Quick Start

Rust

First, add the uzu dependency to your Cargo.toml:

[dependencies]
uzu = { git = "https://github.com/trymirai/uzu", branch = "main", package = "uzu" }

Then, create an inference Session with a specific model and configuration:

use std::path::PathBuf;
use uzu::session::{
    sampling_config::SamplingConfig,
    session::Session,
    session_config::{SessionConfig, SessionRunConfig},
    session_input::SessionInput,
    session_output::SessionOutput
};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let model_path = PathBuf::from("MODEL_PATH");
    
    let mut session = Session::new(model_path.clone())?;
    session.load_with_session_config(SessionConfig::default())?;

    let input = SessionInput::Text("Tell about London".to_string());

    let tokens_limit = 128;
    let run_config = SessionRunConfig::new_with_sampling(
        tokens_limit,
        SamplingConfig::default()
    );

    let output = session.run(input, run_config, Some(|_: SessionOutput| {
        return true;
    }));
    println!("{}", output.text);
    Ok(())
}

Swift

Setup

Add the uzu-swift dependency to your Package.swift:

dependencies: [
    .package(url: "https://github.com/trymirai/uzu-swift.git", from: "0.1.0")
]

Set up your project via Platform, obtain an API_KEY, and initialize engine:

import Uzu

let engine = UzuEngine(apiKey: "API_KEY")

Model state

Refresh models registry:

let registry = try await engine.updateRegistry()
let modelIdentifiers = registry.map(\.key)

Control the model's state:

let modelIdentifier = "Meta-Llama-3.2-1B-Instruct-float16"

engine.download(identifier: modelIdentifier)
engine.pause(identifier: modelIdentifier)
engine.resume(identifier: modelIdentifier)
engine.delete(identifier: modelIdentifier)

Observe the model's state:

@Environment(UzuEngine.self) private var engine

...

ProgressView(value: engine.states[id]?.progress ?? 0.0)

Possible model state values:

  • .notDownloaded

  • .downloading(progress: Double)

  • .paused(progress: Double)

  • .downloaded

  • .error(message: String)

Session

Session is the core entity used to communicate with the model:

let session = try engine.createSession(identifier: modelIdentifier)

Session offers different configuration presets that can provide significant performance boosts for common use cases like classification and summarization:

let config = SessionConfig(
    preset: .general,
    samplingSeed: .default,
    contextLength: .default
)
try session.load(config: config)

Once loaded, the same Session can be reused for multiple requests until you drop it. Each model may consume a significant amount of RAM, so it's important to keep only one session loaded at a time. For iOS apps, we recommend adding the Increased Memory Capability entitlement to ensure your app can allocate the required memory.

Inference

After loading, you can run the Session with a specific prompt or a list of messages:

let input = SessionInput.messages([
    .init(role: .system, content: "You are a helpful assistant"),
    .init(role: .user, content: "Tell about London")
])
let output = session.run(
    input: input,
    maxTokens: 128,
    samplingMethod: .argmax
) { partialOutput in
    // Access the current text using partialOutput.text
    return true // Return true to continue generation
}

SessionOutput also includes generation metrics such as prefill duration and tokens per second. It’s important to note that you should run a release build to obtain accurate metrics.

Presets

Summarization

In this example, we will extract a summary of the input text:

let textToSummarize = "A Large Language Model (LLM) is a type of artificial intelligence that processes and generates human-like text. It is trained on vast datasets containing books, articles, and web content, allowing it to understand and predict language patterns. LLMs use deep learning, particularly transformer-based architectures, to analyze text, recognize context, and generate coherent responses. These models have a wide range of applications, including chatbots, content creation, translation, and code generation. One of the key strengths of LLMs is their ability to generate contextually relevant text based on prompts. They utilize self-attention mechanisms to weigh the importance of words within a sentence, improving accuracy and fluency. Examples of popular LLMs include OpenAI's GPT series, Google's BERT, and Meta's LLaMA. As these models grow in size and sophistication, they continue to enhance human-computer interactions, making AI-powered communication more natural and effective.";
let text = "Text is: \"\(textToSummarize)\". Write only summary itself."

let config = SessionConfig(
    preset: .summarization,
    samplingSeed: .default,
    contextLength: .default
)
try session.load(config: config)

let input = SessionInput.text(text)
let output = session.run(
    input: input,
    maxTokens: 1024,
    samplingMethod: .argmax
) { _ in
    return true
}

This will generate 34 output tokens with only 5 model runs during the generation phase, instead of 34 runs.

Classification

Let’s look at a case where you need to classify input text based on a specific feature, such as sentiment:

let feature = SessionClassificationFeature(
    name: "sentiment",
    values: ["Happy", "Sad", "Angry", "Fearful", "Surprised", "Disgusted"]
)

let textToDetectFeature = "Today's been awesome! Everything just feels right, and I can't stop smiling."
let text = "Text is: \"\(textToDetectFeature)\". Choose \(feature.name) from the list: \(feature.values.joined(separator: ", ")). Answer with one word. Dont't add dot at the end."

let config = SessionConfig(
    preset: .classification(feature),
    samplingSeed: .default,
    contextLength: .default
)
try session.load(config: config)

let input = SessionInput.text(text)
let output = session.run(
    input: input,
    maxTokens: 32,
    samplingMethod: .argmax
) { _ in
    return true
}

In this example, you will get the answer Happy immediately after the prefill step, and the actual generation won't even start.