Skip to main content

How to use the plugin

The Runtime Speech Recognizer plugin is designed to recognize words from incoming audio data. It uses a slightly modified version of whisper.cpp to work with the engine. To use the plugin, follow these steps:

Editor side

  1. Select the appropriate language models for your project as described here.

Runtime side

  1. Create a Speech Recognizer and set the necessary parameters (CreateSpeechRecognizer, for parameters see here).
  2. Bind to the needed delegates (OnRecognitionFinished, OnRecognizedTextSegment and OnRecognitionError).
  3. Start the speech recognition (StartSpeechRecognition).
  4. Process audio data and wait for results from the delegates (ProcessAudioData).
  5. Stop the speech recognizer when needed (e.g., after the OnRecognitionFinished broadcast).

The plugin supports incoming audio in the floating point 32-bit interleaved PCM format. While it works well with the Runtime Audio Importer, it doesn't directly depend on it.

Recognition parameters

The plugin supports both streaming and non-streaming audio data recognition. To adjust recognition parameters for your specific use case, call SetStreamingDefaults or SetNonStreamingDefaults. Additionally, you have the flexibility to manually set individual parameters such as the number of threads, step size, whether to translate incoming language to English, and whether to use past transcription. Refer to the Recognition Parameter List for a complete list of available parameters.

Improving performance

Please refer to the How to improve performance section for tips on how to optimize the performance of the plugin.

Voice Activity Detection (VAD)

When processing audio input, especially in streaming scenarios, it's recommended to use Voice Activity Detection (VAD) to filter out empty or noise-only audio segments before they reach the recognizer. This filtering can be enabled on the capturable sound wave side using the Runtime Audio Importer plugin, which helps prevent the language models from hallucinating - attempting to find patterns in noise and generating incorrect transcriptions. For detailed instructions on VAD configuration, refer to the Voice Activity Detection documentation.

In the demo project included with the plugin, VAD is enabled by default. You can find more information about the demo implementation at Demo Project.

Examples

There is a good project demo included in the plugin's Content -> Demo folder, which you can use as an example for implementation.

These examples illustrate how to use the Runtime Speech Recognizer plugin with both streaming and non-streaming audio input, using the Runtime Audio Importer to obtain audio data as an example. Please note that separate downloading of the RuntimeAudioImporter is required to access the same set of audio importing features showcased in the examples (e.g. capturable sound wave and ImportAudioFromFile). These examples are solely intended to illustrate the core concept and do not include error handling.

Streaming audio input examples

Note: In UE 5.3 and other versions, you might encounter missing nodes after copying Blueprints. This can occur due to differences in node serialization between engine versions. Always verify that all nodes are properly connected in your implementation.

1. Basic streaming recognition

This example demonstrates the basic setup for capturing audio data from the microphone as a stream using the Capturable sound wave and passing it to the speech recognizer. It records speech for about 5 seconds and then processes the recognition, making it suitable for quick tests and simple implementations. Copyable nodes.

Key features of this setup:

  • Fixed 5-second recording duration
  • Simple one-shot recognition
  • Minimal setup requirements
  • Perfect for testing and prototyping

2. Controlled streaming recognition

This example extends the basic streaming setup by adding manual control over the recognition process. It allows you to start and stop the recognition at will, making it suitable for scenarios where you need precise control over when the recognition occurs. Copyable nodes.

Key features of this setup:

  • Manual start/stop control
  • Continuous recognition capability
  • Flexible recording duration
  • Suitable for interactive applications

3. Voice-activated command recognition

This example is optimized for command recognition scenarios. It combines streaming recognition with Voice Activity Detection (VAD) to automatically process speech when the user stops talking. The recognizer starts processing accumulated speech only when silence is detected, making it ideal for command-based interfaces. Copyable nodes.

Key features of this setup:

  • Manual start/stop control
  • VAD enabled to detect speech segments
  • Automatic recognition triggering when silence is detected
  • Optimal for short command recognition
  • Reduced processing overhead by only recognizing actual speech

Non-streaming audio input

This example imports audio data to the Imported sound wave and recognizes the full audio data once it has been imported. Copyable nodes.