Saltar al contenido principal

Inference parameters

Unreleased Plugin

The documentation you are currently viewing is for a plugin that has not yet been released. Content may be incomplete or subject to change. Please check back once the plugin is officially available on the Fab marketplace.

The LLM Inference Parameters structure controls how the model loads and generates text. You pass these parameters when loading a model. This page describes each parameter and its effect.

Parameter Reference

ParameterTypeDefaultRangeDescription
Max Tokensint325121–8192Maximum number of tokens to generate in a single response
Temperaturefloat0.70.0–2.0Controls randomness. 0.0 = deterministic. Higher values = more creative output
Top Pfloat0.90.0–1.0Nucleus sampling. Only tokens whose cumulative probability exceeds this value are considered
Top Kint32400–200Limits selection to the top K most probable tokens. 0 = disabled
Repeat Penaltyfloat1.10.0–3.0Penalizes tokens that already appear in the output. 1.0 = no penalty
Num GPU Layersint32-1-1–200Model layers to offload to GPU. -1 = auto. 0 = CPU only
Context Sizeint322048128–131072Maximum context window in tokens. Larger values use more memory
System PromptFString"You are a helpful assistant."System instruction that shapes the model's behavior
Seedint32-1-1+Random seed for reproducible output. -1 = random
Num Threadsint3200–128CPU threads for generation. 0 = automatic

Usage

Inference parameters appear as a struct pin on load and async nodes. Break the struct to set individual values:

Inference Parameters in Blueprint

To get a default set of parameters as a starting point, use Get Default Inference Params:

Get Default Inference Params

Platform Recommendations

Mobile / VR (Android, iOS, Meta Quest)

  • Context Size: 1024–2048
  • Num GPU Layers: 0 (CPU only) unless the device has confirmed GPU compute support
  • Max Tokens: Under 256 for responsive interactions
  • Num Threads: 2–4 depending on the device

Desktop (Windows, Mac, Linux)

  • Context Size: 2048–8192 for most conversations
  • Num GPU Layers: -1 (auto) to leverage GPU acceleration when available
  • Num Threads: 0 (auto)
  • Max Tokens: 512–2048 for longer responses