Models

lalamo – is a set of tools for adapting AI Models to on-device inference using the uzu inference engine.

Approach

While developing an inference engine, it’s critical to have full control over the model implementation, because it allows you to:

  • Perform optimizations (e.g., RoPE precomputation)

  • Use a modular export format, where each model is constructed from unified blocks that are easy to support on the inference side

  • Maintain a reference implementation for validating output correctness

In contrast to libraries like transformers, where each model is implemented independently, lalamo introduces a unified intermediate representation. This ensures that if a new model is composed of already supported blocks, adding its support is as simple as providing a simple configuration.

This also avoids issues found in approaches like those used in coremltools, where the library tries to trace the model to capture the computation graph. That process is often unstable and difficult to maintain due to changes in the underlying inference implementation.

Usage

To get the list of supported models, run:

uv run lalamo list-models

To convert a model, run:

uv run lalamo convert MODEL_REPO --precision float16

After that, you can find the converted model in the models folder. For more options see uv run lalamo convert --help.

Model support

To add support for a new model, write the corresponding ModelSpec, as shown in the example below:

ModelSpec(
    vendor="Google",
    family="Gemma-3",
    name="Gemma-3-1B-Instruct",
    size="1B",
    quantization=None,
    repo="google/gemma-3-1b-it",
    config_type=HFGemma3TextConfig,
    config_file_name="config.json",
    weights_file_names=huggingface_weight_files(1),
    weights_type=WeightsType.SAFETENSORS,
    tokenizer_files=HUGGINGFACE_TOKENIZER_FILES,
    use_cases=tuple(),
)

Last updated