Skip to main content

Overview

Runtime Local LLM Documentation

Runtime Local LLM is a plugin that runs large language models entirely on-device using llama.cpp, with no internet connection required at runtime. It supports GGUF model files and provides a full Blueprint API for loading models, sending messages, and receiving token-by-token responses, all on a background thread with game-thread callbacks.

The plugin supports Windows, Mac, Linux, Android (including Meta Quest and other Android-based platforms), and iOS.

Key Features

  • Complete offline inference: No cloud services or API keys at runtime
  • GGUF model support: Load any GGUF-format model (Llama, Mistral, Phi, Gemma, Qwen, etc.)
  • Up-to-date llama.cpp: Updated regularly on Fab to keep pace with llama.cpp releases, so the latest GGUF model formats are always supported
  • GPU acceleration: Uses Vulkan on Windows and Linux, Metal on Mac and iOS, and CPU + intrinsics on Android and Meta Quest
  • Multiple model loading methods:
    • Load from a local file path
    • Load by model name (dropdown selection in Blueprints)
    • Download from URL and load automatically
    • Download-only for pre-caching models
  • Token-by-token streaming: Receive each token as it generates for real-time display
  • Async Blueprint nodes: Nodes with output delegates for loading, sending messages, and downloading
  • Configurable inference parameters: Temperature, Top-P, Top-K, repeat penalty, GPU layer offloading, context size, seed, thread count, and system prompt
  • Conversation management: Multi-turn conversations with context reset, save/load to disk, in-memory snapshots, and automatic summarization for long-running chats
  • Editor model manager: Browse, download, import, delete, and test models directly in project settings
  • Cross-platform packaging: Models ship with your project via NonUFS staging

How It Works

  1. Manage models in the editor: Use the plugin settings panel to browse a catalog of pre-defined models, download them, or import your own GGUF files
  2. Load a model at runtime: Call one of the load functions (by file, by name, by URL, or by metadata) with your inference parameters
  3. Send messages: Pass a user message to the LLM instance; tokens stream back through delegates as the model generates a response
  4. Use the response: Display tokens in a chat UI, drive NPC dialogue, generate dynamic content, or feed into other systems

All inference runs on a dedicated background thread. Callbacks (token generation, completion, errors) fire on the game thread, so you can safely update UI and game state from them.

Common Use Cases

  • In-game chatbots and assistants: Q&A, help systems, dynamic tutorials
  • NPC dialogue: Conversational NPCs with persistent per-character memory using conversation snapshots
  • Long-running roleplay and narrative systems: Automatic summarization keeps multi-hour conversations within context limits without losing key facts
  • Procedural content: Generate quest descriptions, item lore, dialogue trees on the fly
  • Offline-first applications: Anything that needs LLM capabilities without a network connection

Model Storage and Packaging

Models are stored as .gguf files in the Content/RuntimeLocalLLM/Models directory of your project. The plugin automatically configures Additional Non-Asset Directories To Copy (DirectoriesToAlwaysStageAsNonUFS) so that model files ship with your packaged project and remain accessible via standard file I/O at runtime.

Each model also has a .json sidecar file that stores its metadata (display name, family, variant, description, parameter count).

Supported Models

The plugin works with any model in GGUF format. The editor provides a catalog of popular pre-defined models for one-click download, and you can import any custom GGUF file. Common model families include:

  • Llama (Meta) — 1B, 3B, 8B, and larger
  • Mistral / Mixtral — 7B and larger
  • Phi (Microsoft) — 2B, 3B, 4B
  • Gemma (Google) — 2B, 7B
  • Qwen (Alibaba) — 1.5B, 7B, and larger
  • TinyLlama — 1.1B
  • And many more community models

Quantization

Models come in various quantization levels that trade off quality for size and speed:

QuantizationQualitySizeSpeed
Q2_KLowerSmallestFastest
Q4_K_MGoodMediumFast
Q5_K_MBetterLargerModerate
Q8_0HighLargeSlower
F16 / F32HighestLargestSlowest

For mobile and VR devices, smaller quantizations (Q2_K through Q4_K_M) with compact models (1B–3B parameters) are recommended. For desktop, you can use larger models and higher quantization levels depending on available RAM and CPU/GPU resources.

Additional Resources

Join our Discord
online · support