Overview

Runtime Local LLM is a plugin that runs large language models entirely on-device using llama.cpp, with no internet connection required at runtime. It supports GGUF model files and provides a full Blueprint API for loading models, sending messages, and receiving token-by-token responses, all on a background thread with game-thread callbacks.

The plugin supports Windows, Mac, Linux, Android (including Meta Quest and other Android-based platforms), and iOS.

Key Features

Complete offline inference: No cloud services or API keys at runtime
GGUF model support: Load any GGUF-format model (Llama, Mistral, Phi, Gemma, Qwen, etc.)
Up-to-date llama.cpp: Updated regularly on Fab to keep pace with llama.cpp releases, so the latest GGUF model formats are always supported
GPU acceleration: Uses Vulkan on Windows and Linux, Metal on Mac and iOS, and CPU + intrinsics on Android and Meta Quest
Multiple model loading methods:
- Load from a local file path
- Load by model name (dropdown selection in Blueprints)
- Download from URL and load automatically
- Download-only for pre-caching models
Token-by-token streaming: Receive each token as it generates for real-time display
Async Blueprint nodes: Nodes with output delegates for loading, sending messages, and downloading
Configurable inference parameters: Temperature, Top-P, Top-K, repeat penalty, GPU layer offloading, context size, seed, thread count, and system prompt
Conversation management: Multi-turn conversations with context reset, save/load to disk, in-memory snapshots, and automatic summarization for long-running chats
Editor model manager: Browse, download, import, delete, and test models directly in project settings
Cross-platform packaging: Models ship with your project via NonUFS staging

How It Works

Manage models in the editor: Use the plugin settings panel to browse a catalog of pre-defined models, download them, or import your own GGUF files
Load a model at runtime: Call one of the load functions (by file, by name, by URL, or by metadata) with your inference parameters
Send messages: Pass a user message to the LLM instance; tokens stream back through delegates as the model generates a response
Use the response: Display tokens in a chat UI, drive NPC dialogue, generate dynamic content, or feed into other systems

All inference runs on a dedicated background thread. Callbacks (token generation, completion, errors) fire on the game thread, so you can safely update UI and game state from them.

Common Use Cases

In-game chatbots and assistants: Q&A, help systems, dynamic tutorials
NPC dialogue: Conversational NPCs with persistent per-character memory using conversation snapshots
Long-running roleplay and narrative systems: Automatic summarization keeps multi-hour conversations within context limits without losing key facts
Procedural content: Generate quest descriptions, item lore, dialogue trees on the fly
Offline-first applications: Anything that needs LLM capabilities without a network connection

Model Storage and Packaging

Models are stored as .gguf files in the Content/RuntimeLocalLLM/Models directory of your project. The plugin automatically configures Additional Non-Asset Directories To Copy (DirectoriesToAlwaysStageAsNonUFS) so that model files ship with your packaged project and remain accessible via standard file I/O at runtime.

Each model also has a .json sidecar file that stores its metadata (display name, family, variant, description, parameter count).

Supported Models

The plugin works with any model in GGUF format. The editor provides a catalog of popular pre-defined models for one-click download, and you can import any custom GGUF file. Common model families include:

Llama (Meta) — 1B, 3B, 8B, and larger
Mistral / Mixtral — 7B and larger
Phi (Microsoft) — 2B, 3B, 4B
Gemma (Google) — 2B, 7B
Qwen (Alibaba) — 1.5B, 7B, and larger
TinyLlama — 1.1B
And many more community models

Quantization

Models come in various quantization levels that trade off quality for size and speed:

Quantization	Quality	Size	Speed
Q2_K	Lower	Smallest	Fastest
Q4_K_M	Good	Medium	Fast
Q5_K_M	Better	Larger	Moderate
Q8_0	High	Large	Slower
F16 / F32	Highest	Largest	Slowest

For mobile and VR devices, smaller quantizations (Q2_K through Q4_K_M) with compact models (1B–3B parameters) are recommended. For desktop, you can use larger models and higher quantization levels depending on available RAM and CPU/GPU resources.

Additional Resources

Get it on Fab
Product website
Download Demo (Windows)
Video tutorial
Plugin Support & Custom Development: [email protected] (tailored solutions for teams & organizations)

Join our Discord

online · support

Key Features​

How It Works​

Common Use Cases​

Model Storage and Packaging​

Supported Models​

Quantization​

Additional Resources​