Overview

To run a model, we need to create an inference Session with the configuration we discussed in the previous step. Session is the core entity used to communicate with the model. After you call the load method, the model weights are loaded into the device’s RAM and remain there until you deallocate the Session object. Once loaded, the same Session can be reused for multiple requests until you drop it.
Each model may consume a significant amount of RAM, so it’s important to keep only one session loaded at a time. For iOS apps, we recommend adding the Increased Memory Capability entitlement to ensure your app can allocate the required memory.
Don’t try to run a single Session instance from different threads, as it’s not thread-safe and doesn’t support batch requests.
After creating a Session, you need to build an input prompt. To do this, use the SessionInput enum, which can have two possible values:

SessionInput

messagesA list of role-based messages. Possible roles are: system, user, assistant.
textText to use as the prompt
The last step is simply calling the run method with the input and a max tokens limit. You can also optionally specify a sampling method if you want to override the one recommended by the model. This method accepts a callback with intermediate results and expects a boolean flag that indicates whether generation should continue.

Steps

Now we finally can run the model. Choose the use case that’s closest to your product.
Simple flow for chat-based use cases
1

Create a session for the selected model

let modelId: ModelId = .local(id: localModelId)
let session = try engine.createSession(modelId)
2

Load the session with a specific configuration

try session.load(
    preset: .general,
    samplingSeed: .default,
    contextLength: .default
)
3

Build an input prompt

let messages = [
    SessionMessage(role: .system, content: "You are a helpful assistant."),
    SessionMessage(role: .user, content: "Tell me a short, funny story about a robot."),
]
let input: SessionInput = .messages(messages: messages)
4

Run the session

let tokensLimit: UInt32 = 128
let sampling: SamplingConfig = .argmax

let output = try session.run(
    input: input,
    tokensLimit: tokensLimit,
    samplingConfig: sampling
) { partialOutput in handlePartialOutput(partialOutput) }
As a result, the session returns a SessionOutput object, which contains execution stats such as tokens per second or prefill time:
If you run into any problems with the flow described above, check the troubleshooting section.

Now that we’ve integrated AI into the app, you can also check out a detailed overview of how the uzu inference engine works under the hood.