Inference Engine

uzu (うず, from Japan – vortex) – is a modular, high-performance inference engine for AI models, designed with support for multiple backends in mind.

Before we start

Before diving into implementation details, it's important to understand how Apple's ML stack is built. Each Apple device has multiple computation units (CPU, GPU, and ANE) that can be switched effectively due to a unified memory architecture.

While it's straightforward how to use the CPU and GPU, a common misconception is that CoreML is the only way to access the ANE. This is not true. Under the hood, CoreML uses a lower-level API called Metal Performance Shaders Graph. While MPSGraph abstracts direct configuration from the end user by providing high-level settings like optimization profiles, it is still possible to specify the exact device that will execute a specific set of operations. By splitting the execution graph into smaller, encodable blocks, you can granularly control which computation unit is used for each part. MPSGraph itself compiles to MLIR, which allows it to perform additional optimizations.

Another important aspect to understand is Apple's hardware specification. The current generation of edge devices has enough performance to execute a small model's forward pass. In the context of token-by-token generation, the main limitation is the memory bandwidth required to process the model's weights. This bandwidth defines the theoretical upper bound for achievable generation speed. Therefore, accelerators like the ANE can provide massive improvements for large batch computations (e.g., prefilling a large context, VLMs, or processing large speculation trees), but not for single-token generation.

With this in mind, we wanted to design an architecture that allows for hybrid implementations. Some layers can be compiled into MPSGraph executables with specified device placement, while others can be implemented as custom, highly-optimized Metal kernels.

Why Rust

While Swift is native to Apple platforms, we chose Rust for uzu's core logic to prioritize performance, memory safety, and future portability. Rust offers C-level performance, and its ownership model prevents common bugs at compile time, ensuring a robust inference engine.

Our primary goal is a cross-platform codebase. By writing the core logic in Rust, we create a portable foundation. The current Metal backend is just one implementation, we plan to add others soon. This modular design maximizes code reuse.

Finally, Rust's excellent Foreign Function Interface (FFI) support, streamlined by uniffi, allows us to create a lightweight bridge. This lets us call the high-performance Rust code seamlessly from Swift on iOS/macOS and will allow easy integration with other languages like Kotlin or Python in the future.

Architecture

Let's take a look at the core parts of the engine

Executables

Executables are the lowest-level components in the system, representing individual computational blocks that can be dispatched to the compute unit. Each layer, such as MLP, Attention, RMSNorm, or Embedding, must be represented as an Executable.

The only requirement for an Executable is that it must be encodable into a command buffer for execution.

Currently, the system supports two types of executables:

MPSGraphBlock. These executables use Apple's MPSGraph framework. To work with MPSGraph, we provide the mpsgraph-rs crate.
Kernel. Used when:
- An operation is not effectively implemented in MPSGraph (e.g., attention)
- Specific memory access patterns are required
- Custom algorithms need to be implemented (e.g., sampling strategies)

For each generation step, the system creates a ForwardPassState that manages all heap-allocated buffers required for computation. This state includes:

Input buffers (token_ids, token_positions, mask, ...)
Intermediate activation buffers
Output buffers (logits, sampling results)
Persistent buffers (KV cache, weights)

Each executable receives a reference to this state and accesses required buffers through corresponding identifiers.

Generator

The generator is the core component of the inference process, responsible for managing the step-by-step generation of new tokens. It also supports speculative decoding, a technique used to accelerate text generation. At the start of a generation session, the generator prepares all required resources. This includes ensuring that the compiled executables are ready and memory buffers are allocated.

The generation process, whether consuming the initial prompt or producing additional tokens, follows a structured cycle:

Speculation: Instead of generating tokens one by one, the generator uses a lightweight draft model (the speculator) to predict a tree of possible future token sequences. These serve as candidate paths for the main model to evaluate.
Linearization: The speculative branches are flattened into a single batch, allowing the main model to evaluate multiple token positions in parallel.
Causal masking: Before dispatching the batch to the compute device, the generator builds an attention mask to enforce causal constraints. Each token can only attend to earlier tokens. For speculative tokens, the mask grants access to the original prompt and ancestor tokens along the same speculative path.
Execution and validation: The batch is processed in a single forward pass. The generator compares the main model’s output logits against the speculator’s predictions to identify correctly guessed tokens.
Acceptance and state update: The longest sequence of validated tokens is accepted and appended to the output. The KV cache is updated to reflect the new state.

The cycle then repeats using the extended context. To reduce latency, the generator overlaps computation where possible (encoding the most probable next pass while the executing the previous one).

Session

The session serves as the primary interface for text generation, managing the complete lifecycle from user input to streaming output. It supports configurable modes, enabling optimized execution for specific use cases such as classification or summarization.

During a single run, the session orchestrates the generation process across several key phases:

Input processing: The input text or messages list is tokenized into a sequence of token_ids.
Prefill: The initial prompt is processed, optionally in chunks for long inputs
Generation loop: Tokens are generated iteratively. The loop ends when an end-of-sequence token is emitted, the token limit is reached, or the user cancels via the callback.
Output handling: The generated tokens are decoded into text and returned, along with detailed performance metrics.

Throughout execution, the session records various metrics, including tokens per second, phase durations, and model invocation counts.

Tracer

The tracer validates the numerical correctness of the inference implementation by comparing its outputs to reference activations. It is a critical tool to ensure that uzu’s implementation produces results identical to the original model.

At initialization, the tracer loads a trace file exported by lalamo. This file contains pre-computed activations from a reference implementation, including all intermediate values: layer inputs/outputs, KV cache state and final logits for a given input sequence.

For each tensor, the tracer computes:

Maximum absolute and relative errors
Root mean square (RMS) of differences
RMS of both reference and produced values
Total number of violations

Each tensor element is checked against a tolerance criterion. Elements exceeding the threshold are counted as violations. A tensor is considered valid if the number of violations remains below a configurable limit, allowing for minor inconsistencies due to floating-point precision.

The tracer generates detailed reports for each tensor, flagging those that exceed tolerance limits. This enables automated validation of all supported models and makes the development of new features safer.

What's next?

We’re just getting started – more features are coming soon:

Android support
VLM / TTS / STT models
Advanced speculation
Specialized sessions
Tools calling
Deep end-platform integrations
Automatic routing between local and cloud inference

More magic coming soon! ❤️

PreviousModels NextCLI