Overview
To run a model, we need to create an inferenceSession with the configuration we discussed in the previous step. Session is the core entity used to communicate with the model. After creation, the model weights are loaded into the device’s RAM and remain there until you deallocate the Session object. Once loaded, the same Session can be reused for multiple requests until you drop it.
After creating a Session, you need to build an input prompt. To do this, use the Input enum, which can have two possible values:
Input
| messages | A list of role-based messages. Possible roles are: system, user, assistant. |
| text | Text to use as the prompt |
run method with the input and a max tokens limit. You can also optionally specify a sampling method if you want to override the one recommended by the model. This method accepts a callback with intermediate results and expects a boolean flag that indicates whether generation should continue.
Steps
Now we finally can run the model. Choose the use case that’s closest to your product.
- Chat
- Summarization
- Classification
Simple flow for chat-based use cases
1
Create a session for the selected model
2
Build an input prompt
3
Run the session
As a result, the session returns a
Output object, which contains execution stats such as tokens per second or prefill time:Now that we’ve integrated AI into the app, you can also check out a detailed overview of how the
uzu inference engine works under the hood.