Quick Start
Rust
First, add the uzu dependency to your Cargo.toml
:
[dependencies]
uzu = { git = "https://github.com/trymirai/uzu", branch = "main", package = "uzu" }
Then, create an inference Session
with a specific model and configuration:
use std::path::PathBuf;
use uzu::session::{
sampling_config::SamplingConfig,
session::Session,
session_config::{SessionConfig, SessionRunConfig},
session_input::SessionInput,
session_output::SessionOutput
};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let model_path = PathBuf::from("MODEL_PATH");
let mut session = Session::new(model_path.clone())?;
session.load_with_session_config(SessionConfig::default())?;
let input = SessionInput::Text("Tell about London".to_string());
let tokens_limit = 128;
let run_config = SessionRunConfig::new_with_sampling(
tokens_limit,
SamplingConfig::default()
);
let output = session.run(input, run_config, Some(|_: SessionOutput| {
return true;
}));
println!("{}", output.text);
Ok(())
}
Swift
Setup
Add the uzu-swift
dependency to your Package.swift
:
dependencies: [
.package(url: "https://github.com/trymirai/uzu-swift.git", from: "0.1.0")
]
Set up your project via Platform, obtain an API_KEY
, and initialize engine:
import Uzu
let engine = UzuEngine(apiKey: "API_KEY")
Model state
Refresh models registry:
let registry = try await engine.updateRegistry()
let modelIdentifiers = registry.map(\.key)
Control the model's state:
let modelIdentifier = "Meta-Llama-3.2-1B-Instruct-float16"
engine.download(identifier: modelIdentifier)
engine.pause(identifier: modelIdentifier)
engine.resume(identifier: modelIdentifier)
engine.delete(identifier: modelIdentifier)
Observe the model's state:
@Environment(UzuEngine.self) private var engine
...
ProgressView(value: engine.states[id]?.progress ?? 0.0)
Possible model state values:
.notDownloaded
.downloading(progress: Double)
.paused(progress: Double)
.downloaded
.error(message: String)
Session
Session
is the core entity used to communicate with the model:
let session = try engine.createSession(identifier: modelIdentifier)
Session
offers different configuration presets that can provide significant performance boosts for common use cases like classification and summarization:
let config = SessionConfig(
preset: .general,
samplingSeed: .default,
contextLength: .default
)
try session.load(config: config)
Once loaded, the same Session
can be reused for multiple requests until you drop it. Each model may consume a significant amount of RAM, so it's important to keep only one session loaded at a time. For iOS apps, we recommend adding the Increased Memory Capability entitlement to ensure your app can allocate the required memory.
Inference
After loading, you can run the Session
with a specific prompt or a list of messages:
let input = SessionInput.messages([
.init(role: .system, content: "You are a helpful assistant"),
.init(role: .user, content: "Tell about London")
])
let output = session.run(
input: input,
maxTokens: 128,
samplingMethod: .argmax
) { partialOutput in
// Access the current text using partialOutput.text
return true // Return true to continue generation
}
SessionOutput
also includes generation metrics such as prefill duration and tokens per second. It’s important to note that you should run a release build to obtain accurate metrics.
Presets
Summarization
In this example, we will extract a summary of the input text:
let textToSummarize = "A Large Language Model (LLM) is a type of artificial intelligence that processes and generates human-like text. It is trained on vast datasets containing books, articles, and web content, allowing it to understand and predict language patterns. LLMs use deep learning, particularly transformer-based architectures, to analyze text, recognize context, and generate coherent responses. These models have a wide range of applications, including chatbots, content creation, translation, and code generation. One of the key strengths of LLMs is their ability to generate contextually relevant text based on prompts. They utilize self-attention mechanisms to weigh the importance of words within a sentence, improving accuracy and fluency. Examples of popular LLMs include OpenAI's GPT series, Google's BERT, and Meta's LLaMA. As these models grow in size and sophistication, they continue to enhance human-computer interactions, making AI-powered communication more natural and effective.";
let text = "Text is: \"\(textToSummarize)\". Write only summary itself."
let config = SessionConfig(
preset: .summarization,
samplingSeed: .default,
contextLength: .default
)
try session.load(config: config)
let input = SessionInput.text(text)
let output = session.run(
input: input,
maxTokens: 1024,
samplingMethod: .argmax
) { _ in
return true
}
This will generate 34 output tokens with only 5 model runs during the generation phase, instead of 34 runs.
Classification
Let’s look at a case where you need to classify input text based on a specific feature, such as sentiment
:
let feature = SessionClassificationFeature(
name: "sentiment",
values: ["Happy", "Sad", "Angry", "Fearful", "Surprised", "Disgusted"]
)
let textToDetectFeature = "Today's been awesome! Everything just feels right, and I can't stop smiling."
let text = "Text is: \"\(textToDetectFeature)\". Choose \(feature.name) from the list: \(feature.values.joined(separator: ", ")). Answer with one word. Dont't add dot at the end."
let config = SessionConfig(
preset: .classification(feature),
samplingSeed: .default,
contextLength: .default
)
try session.load(config: config)
let input = SessionInput.text(text)
let output = session.run(
input: input,
maxTokens: 32,
samplingMethod: .argmax
) { _ in
return true
}
In this example, you will get the answer Happy
immediately after the prefill step, and the actual generation won't even start.