Choose configuration

Before we move on to running the model, it’s useful to say a few words about the inference configuration. It’s defined by the SessionConfig type, which has the following properties:

SessionConfig


preset	See below.
samplingSeed	At each step of model inference, a new token is chosen from the distribution using a sampling method, which is usually stochastic. If you need reproducible results for testing, you can set a fixed samplingSeed.
contextLength	Each model has its own context length, which defines how long your conversation with the model can be. If you know you won’t need a long context, you can set a smaller value to save RAM during inference.

SessionPreset

Basically, the preset defines the method of speculative decoding. When an LLM generates text, it goes step by step, predicting one token at a time. This is autoregressive decoding. It’s slow because each token needs a full forward pass. With speculative decoding, we use heuristics to guess multiple tokens in a single model run, then validate them.

While you can use any input with any SessionPreset, a wrong choice can lead to performance degradation. If the heuristic doesn’t match your real use case, you’ll end up doing extra computations without gaining any performance boost, since we won’t be able to predict any additional tokens.

If you’re using a thinking model, it’s better to go with the General preset, since other presets won’t give any boost during the thinking phase.

No speculation is performed.

Overview

Quick start

Integrate AI to your app

Choose configuration

SessionConfig

SessionPreset

Finally, we can run the model