Overview

Runtime Text To Speech is a plugin that enables real-time, offline, and cross-platform text-to-speech synthesis. It supports 41 languages, over 900 voices, and 190+ voice qualities – now featuring Kokoro 🚀, a cutting-edge open-source voice model family with studio-quality output. The plugin is fast, lightweight, and ideal for games, apps, and projects requiring natural-sounding speech.

Currently, the plugin supports the following platforms: Windows, Linux, Mac, Android (including Meta Quest), and iOS.

📹 See It in Action
Watch the YouTube Demo or test generic voice samples at Piper Samples.

Kokoro

The plugin now implements Kokoro voice models - high-quality open-source TTS architectures recently published on Hugging Face.

49 high-quality models across 8 languages:
🇺🇸 English (US) • 🇬🇧 English (UK) • 🇨🇳 Simplified Chinese • 🇪🇸 Spanish • 🇧🇷 Portuguese • 🇮🇳 Hindi • 🇫🇷 French • 🇮🇹 Italian
Live preview available: Test Kokoro Voices

Why Kokoro?

The Kokoro voice models are currently among the highest-quality open-source TTS solutions available today.

Key Features

Complete offline synthesis: No internet connection required
Multiple synthesis modes:
- Regular synthesis: Generate complete audio for the entire text
- Streaming synthesis: Process audio chunks in real-time as they're generated
Cancellation support: Interrupt ongoing synthesis operations at any time
Cross-platform compatibility: Works on all major platforms
Blueprint and C++ support: Full API access in both environments

Installation

To get started, install voice models via the plugin settings on the first run. After installation, you can begin using the plugin in your project. For detailed instructions, refer to the How to use the plugin page.

Plugin Details

This plugin provides real-time text-to-speech synthesis using Piper, Kokoro, and ONNX Runtime libraries. The plugin allows you to download and manage multiple voice models via the editor, which can then be packaged with your project.

The core functionality consists of text input processing and voice model selection for synthesis. Some voice models support multiple speakers - for instance, English LibriTTS includes over 900 different speakers, German Thorsten Emotional has 7 speakers, etc.

The output is PCM audio data (in float format) with corresponding sample rate and number of channels. This data can be processed in two ways:

Regular synthesis: Receive the complete audio data when synthesis is finished
Streaming synthesis: Receive audio data in chunks as they're generated, allowing for real-time processing

Converting this raw audio data into a playable sound wave usually requires the Runtime Audio Importer plugin, which provides both regular and streaming playback capabilities.

Additional Resources

Get it on Fab
Product website
Download Demo (Windows)
Discord support server
Video tutorial
Custom Development: [email protected] (tailored solutions for teams & organizations)

Kokoro​

Key Features​

Installation​

Plugin Details​

Additional Resources​

Kokoro

Key Features

Installation

Plugin Details

Additional Resources