To run a model, we need to create an inference Session with the configuration we discussed in the previous step. Session is the core entity used to communicate with the model. After creation, the model weights are loaded into the device’s RAM and remain there until you deallocate the Session object. Once loaded, the same Session can be reused for multiple requests until you drop it.
Each model may consume a significant amount of RAM, so it’s important to keep only one session loaded at a time. For iOS apps, we recommend adding the Increased Memory Capability entitlement to ensure your app can allocate the required memory.
Don’t try to run a single Session instance from different threads, as it’s not thread-safe and doesn’t support batch requests.
After creating a Session, you need to build an input prompt. To do this, use the Input enum, which can have two possible values:
Input
messages
A list of role-based messages. Possible roles are: system, user, assistant.
text
Text to use as the prompt
The last step is simply calling the run method with the input and a max tokens limit. You can also optionally specify a sampling method if you want to override the one recommended by the model. This method accepts a callback with intermediate results and expects a boolean flag that indicates whether generation should continue.
Now we finally can run the model. Choose the use case that’s closest to your product.
Chat
Summarization
Classification
Simple flow for chat-based use cases
1
Create a session for the selected model
Copy
let session = try engine.chatSession(model, config: Config(preset: .general))
2
Build an input prompt
Copy
let messages = [ Message(role: .system, content: "You are a helpful assistant."), Message(role: .user, content: "Tell me a short, funny story about a robot."),]let input: Input = .messages(messages: messages)
3
Run the session
Copy
let runConfig = RunConfig() .tokensLimit(1024)let output = try session.run( input: input, config: runConfig) { _ in return true}
Extract a summary of the input text
1
Create a session for the selected model
Copy
let session = try engine.chatSession(model, config: Config(preset: .summarization))
2
Build an input prompt
Copy
let textToSummarize = "A Large Language Model (LLM) is a type of artificial intelligence that processes and generates human-like text. It is trained on vast datasets containing books, articles, and web content, allowing it to understand and predict language patterns. LLMs use deep learning, particularly transformer-based architectures, to analyze text, recognize context, and generate coherent responses. These models have a wide range of applications, including chatbots, content creation, translation, and code generation. One of the key strengths of LLMs is their ability to generate contextually relevant text based on prompts. They utilize self-attention mechanisms to weigh the importance of words within a sentence, improving accuracy and fluency. Examples of popular LLMs include OpenAI's GPT series, Google's BERT, and Meta's LLaMA. As these models grow in size and sophistication, they continue to enhance human-computer interactions, making AI-powered communication more natural and effective."let input: Input = .text( text: "Text is: \"\(textToSummarize)\". Write only summary itself.")
let textToDetectFeature = "Today's been awesome! Everything just feels right, and I can't stop smiling."let prompt = "Text is: \"\(textToDetectFeature)\". Choose \(feature.name) from the list: \(feature.values.joined(separator: ", ")). Answer with one word. Don't add a dot at the end."let input: Input = .text(text: prompt)