Skip to main content

Inference parameters

The LLM Inference Parameters structure controls how the model loads and generates text. You pass these parameters when loading a model. This page describes each parameter and its effect.

Parameter Reference

ParameterTypeDefaultRangeDescription
Max Tokensint325121–8192Maximum number of tokens to generate in a single response
Temperaturefloat0.70.0–2.0Controls randomness. 0.0 = deterministic. Higher values = more creative output
Top Pfloat0.90.0–1.0Nucleus sampling. Only tokens whose cumulative probability exceeds this value are considered
Top Kint32400–200Limits selection to the top K most probable tokens. 0 = disabled
Repeat Penaltyfloat1.10.0–3.0Penalizes tokens that already appear in the output. 1.0 = no penalty
Num GPU Layersint32-1-1–200Model layers to offload to GPU. -1 = auto. 0 = CPU only
Context Sizeint322048128–131072Maximum context window in tokens. Larger values use more memory
System PromptFString"You are a helpful assistant."System instruction that shapes the model's behavior
Seedint32-1-1+Random seed for reproducible output. -1 = random
Num Threadsint3200–128CPU threads for generation. 0 = automatic

Usage

Inference parameters appear as a struct pin on load and async nodes. Break the struct to set individual values:

Inference Parameters in Blueprint

To get a default set of parameters as a starting point, use Get Default Inference Params:

Get Default Inference Params

Platform Recommendations

Mobile / VR (Android, iOS, Meta Quest)

  • Context Size: 1024–2048
  • Num GPU Layers: 0 (CPU only) unless the device has confirmed GPU compute support
  • Max Tokens: Under 256 for responsive interactions
  • Num Threads: 2–4 depending on the device

Desktop (Windows, Mac, Linux)

  • Context Size: 2048–8192 for most conversations
  • Num GPU Layers: -1 (auto) to leverage GPU acceleration when available
  • Num Threads: 0 (auto)
  • Max Tokens: 512–2048 for longer responses