Overview

Runtime Local LLM is a plugin that runs large language models entirely on-device using llama.cpp, with no internet connection required at runtime. It supports GGUF model files and provides a full Blueprint API for loading models, sending messages, and receiving token-by-token responses, all on a background thread with game-thread callbacks.
The plugin supports Windows, Mac, Linux, Android (including Meta Quest and other Android-based platforms), and iOS.
Key Features
- Complete offline inference: No cloud services or API keys at runtime
- GGUF model support: Load any GGUF-format model (Llama, Mistral, Phi, Gemma, Qwen, etc.)
- Up-to-date llama.cpp: Updated regularly on Fab to keep pace with llama.cpp releases, so the latest GGUF model formats are always supported
- GPU acceleration: Uses Vulkan on Windows and Linux, Metal on Mac and iOS, and CPU + intrinsics on Android and Meta Quest
- Multiple model loading methods:
- Load from a local file path
- Load by model name (dropdown selection in Blueprints)
- Download from URL and load automatically
- Download-only for pre-caching models
- Token-by-token streaming: Receive each token as it generates for real-time display
- Async Blueprint nodes: Nodes with output delegates for loading, sending messages, and downloading
- Configurable inference parameters: Temperature, Top-P, Top-K, repeat penalty, GPU layer offloading, context size, seed, thread count, and system prompt
- Conversation management: Multi-turn conversations with context reset, save/load to disk, in-memory snapshots, and automatic summarization for long-running chats
- Editor model manager: Browse, download, import, delete, and test models directly in project settings
- Cross-platform packaging: Models ship with your project via NonUFS staging
How It Works
- Manage models in the editor: Use the plugin settings panel to browse a catalog of pre-defined models, download them, or import your own GGUF files
- Load a model at runtime: Call one of the load functions (by file, by name, by URL, or by metadata) with your inference parameters
- Send messages: Pass a user message to the LLM instance; tokens stream back through delegates as the model generates a response
- Use the response: Display tokens in a chat UI, drive NPC dialogue, generate dynamic content, or feed into other systems
All inference runs on a dedicated background thread. Callbacks (token generation, completion, errors) fire on the game thread, so you can safely update UI and game state from them.
Common Use Cases
- In-game chatbots and assistants: Q&A, help systems, dynamic tutorials
- NPC dialogue: Conversational NPCs with persistent per-character memory using conversation snapshots
- Long-running roleplay and narrative systems: Automatic summarization keeps multi-hour conversations within context limits without losing key facts
- Procedural content: Generate quest descriptions, item lore, dialogue trees on the fly
- Offline-first applications: Anything that needs LLM capabilities without a network connection
Model Storage and Packaging
Models are stored as .gguf files in the Content/RuntimeLocalLLM/Models directory of your project. The plugin automatically configures Additional Non-Asset Directories To Copy (DirectoriesToAlwaysStageAsNonUFS) so that model files ship with your packaged project and remain accessible via standard file I/O at runtime.
Each model also has a .json sidecar file that stores its metadata (display name, family, variant, description, parameter count).
Supported Models
The plugin works with any model in GGUF format. The editor provides a catalog of popular pre-defined models for one-click download, and you can import any custom GGUF file. Common model families include:
- Llama (Meta) — 1B, 3B, 8B, and larger
- Mistral / Mixtral — 7B and larger
- Phi (Microsoft) — 2B, 3B, 4B
- Gemma (Google) — 2B, 7B
- Qwen (Alibaba) — 1.5B, 7B, and larger
- TinyLlama — 1.1B
- And many more community models
Quantization
Models come in various quantization levels that trade off quality for size and speed:
| Quantization | Quality | Size | Speed |
|---|---|---|---|
| Q2_K | Lower | Smallest | Fastest |
| Q4_K_M | Good | Medium | Fast |
| Q5_K_M | Better | Larger | Moderate |
| Q8_0 | High | Large | Slower |
| F16 / F32 | Highest | Largest | Slowest |
For mobile and VR devices, smaller quantizations (Q2_K through Q4_K_M) with compact models (1B–3B parameters) are recommended. For desktop, you can use larger models and higher quantization levels depending on available RAM and CPU/GPU resources.
Additional Resources
- Get it on Fab
- Product website
- Download Demo (Windows)
- Video tutorial
- Plugin Support & Custom Development: [email protected] (tailored solutions for teams & organizations)