This section applies only to local models
Before we move on to running the model, it’s useful to say a few words about the inference configuration. It’s defined by the Config type, which has the following properties:
Config
| |
| preset | See below. |
| samplingSeed | At each step of model inference, a new token is chosen from the distribution using a sampling method, which is usually stochastic. If you need reproducible results for testing, you can set a fixed samplingSeed. |
| contextLength | Each model has its own context length, which defines how long your conversation with the model can be. If you know you won’t need a long context, you can set a smaller value to save RAM during inference. |
Preset
Basically, the preset defines the method of speculative decoding. When an LLM generates text, it goes step by step, predicting one token at a time. This is autoregressive decoding. It’s slow because each token needs a full forward pass. With speculative decoding, we use heuristics to guess multiple tokens in a single model run, then validate them.
While you can use any input with any Preset, a wrong choice can lead to performance degradation. If the heuristic doesn’t match your real use case, you’ll end up doing extra computations without gaining any performance boost, since we won’t be able to predict any additional tokens.
If you’re using a thinking model, it’s better to go with the General preset, since other presets won’t give any boost during the thinking phase.
General
Summarization
Classification
No speculation is performed.
We use a prompt lookup technique, which helps in cases like summarization or code completion, where the model’s output often contains significant parts of the original text.
Use this preset when you want the LLM to answer with one of the predefined classes (like in sentiment analysis). This is defined by the SessionClassificationFeature.
Finally, we can run the model