Overview

The documentation you are currently viewing is for a plugin that has not yet been released. Content may be incomplete or subject to change. Please check back once the plugin is officially available on the Fab marketplace.
Runtime Local LLM is a plugin that runs large language models entirely on-device using llama.cpp, with no internet connection required at runtime. It supports GGUF model files and provides a full Blueprint API for loading models, sending messages, and receiving token-by-token responses, all on a background thread with game-thread callbacks.
The plugin supports Windows, Mac, Linux, Android (including Meta Quest and other Android-based platforms), and iOS.
Key Features
- Complete offline inference: No cloud services or API keys at runtime
- GGUF model support: Load any GGUF-format model (Llama, Mistral, Phi, Gemma, Qwen, etc.)
- Up-to-date llama.cpp: Updated regularly on Fab to keep pace with llama.cpp releases, so the latest GGUF model formats are always supported
- GPU acceleration: Uses Vulkan on Windows and Linux, Metal on Mac and iOS, and CPU + intrinsics on Android and Meta Quest
- Multiple model loading methods:
- Load from a local file path
- Load by model name (dropdown selection in Blueprints)
- Download from URL and load automatically
- Download-only for pre-caching models
- Token-by-token streaming: Receive each token as it generates for real-time display
- Async Blueprint nodes: Nodes with output delegates for loading, sending messages, and downloading
- Configurable inference parameters: Temperature, Top-P, Top-K, repeat penalty, GPU layer offloading, context size, seed, thread count, and system prompt
- Conversation context management: Maintain multi-turn conversations with context reset support
- Editor model manager: Browse, download, import, delete, and test models directly in project settings
- Cross-platform packaging: Models ship with your project via NonUFS staging
How It Works
- Manage models in the editor: Use the plugin settings panel to browse a catalog of pre-defined models, download them, or import your own GGUF files
- Load a model at runtime: Call one of the load functions (by file, by name, by URL, or by metadata) with your inference parameters
- Send messages: Pass a user message to the LLM instance; tokens stream back through delegates as the model generates a response
- Use the response: Display tokens in a chat UI, drive NPC dialogue, generate dynamic content, or feed into other systems
All inference runs on a dedicated background thread. Callbacks (token generation, completion, errors) fire on the game thread, so you can safely update UI and game state from them.
Model Storage and Packaging
Models are stored as .gguf files in the Content/RuntimeLocalLLM/Models directory of your project. The plugin automatically configures Additional Non-Asset Directories To Copy (DirectoriesToAlwaysStageAsNonUFS) so that model files ship with your packaged project and remain accessible via standard file I/O at runtime.
Each model also has a .json sidecar file that stores its metadata (display name, family, variant, description, parameter count).
Supported Models
The plugin works with any model in GGUF format. The editor provides a catalog of popular pre-defined models for one-click download, and you can import any custom GGUF file. Common model families include:
- Llama (Meta) — 1B, 3B, 8B, and larger
- Mistral / Mixtral — 7B and larger
- Phi (Microsoft) — 2B, 3B, 4B
- Gemma (Google) — 2B, 7B
- Qwen (Alibaba) — 1.5B, 7B, and larger
- TinyLlama — 1.1B
- And many more community models
Quantization
Models come in various quantization levels that trade off quality for size and speed:
| Quantization | Quality | Size | Speed |
|---|---|---|---|
| Q2_K | Lower | Smallest | Fastest |
| Q4_K_M | Good | Medium | Fast |
| Q5_K_M | Better | Larger | Moderate |
| Q8_0 | High | Large | Slower |
| F16 / F32 | Highest | Largest | Slowest |
For mobile and VR devices, smaller quantizations (Q2_K through Q4_K_M) with compact models (1B–3B parameters) are recommended. For desktop, you can use larger models and higher quantization levels depending on available RAM and CPU/GPU resources.
Additional Resources
- Plugin Support & Custom Development: [email protected] (tailored solutions for teams & organizations)