Before we move on to running the model, it’s useful to say a few words about the inference configuration. It’s defined by the SessionConfig type, which has the following properties:

SessionConfig

presetSee below.
samplingSeedAt each step of model inference, a new token is chosen from the distribution using a sampling method, which is usually stochastic. If you need reproducible results for testing, you can set a fixed samplingSeed.
contextLengthEach model has its own context length, which defines how long your conversation with the model can be. If you know you won’t need a long context, you can set a smaller value to save RAM during inference.

SessionPreset

Basically, the preset defines the method of speculative decoding. When an LLM generates text, it goes step by step, predicting one token at a time. This is autoregressive decoding. It’s slow because each token needs a full forward pass. With speculative decoding, we use heuristics to guess multiple tokens in a single model run, then validate them.
While you can use any input with any SessionPreset, a wrong choice can lead to performance degradation. If the heuristic doesn’t match your real use case, you’ll end up doing extra computations without gaining any performance boost, since we won’t be able to predict any additional tokens.
If you’re using a thinking model, it’s better to go with the General preset, since other presets won’t give any boost during the thinking phase.
No speculation is performed.

Finally, we can run the model