Overview
To run a model, we need to create an inferenceSession
with the configuration we discussed in the previous step. Session
is the core entity used to communicate with the model. After you call the load
method, the model weights are loaded into the device’s RAM and remain there until you deallocate the Session
object. Once loaded, the same Session
can be reused for multiple requests until you drop it.
Each model may consume a significant amount of RAM, so it’s important to keep only one session loaded at a time. For iOS apps, we recommend adding the Increased Memory Capability entitlement to ensure your app can allocate the required memory.
Don’t try to run a single
Session
instance from different threads, as it’s not thread-safe and doesn’t support batch requests.Session
, you need to build an input prompt. To do this, use the SessionInput
enum, which can have two possible values:
SessionInput
messages | A list of role-based messages. Possible roles are: system, user, assistant. |
text | Text to use as the prompt |
run
method with the input and a max tokens limit. You can also optionally specify a sampling method if you want to override the one recommended by the model. This method accepts a callback with intermediate results and expects a boolean flag that indicates whether generation should continue.
Steps
Now we finally can run the model. Choose the use case that’s closest to your product.
Simple flow for chat-based use cases
1
Create a session for the selected model
2
Load the session with a specific configuration
3
Build an input prompt
4
Run the session
As a result, the session returns a
SessionOutput
object, which contains execution stats such as tokens per second or prefill time:If you run into any problems with the flow described above, check the troubleshooting section.
Now that we’ve integrated AI into the app, you can also check out a detailed overview of how the
uzu
inference engine works under the hood.