Voice Activity Detection
Streaming Sound Wave, along with its derived types such as Capturable Sound Wave, supports Voice Activity Detection (VAD). VAD filters incoming audio data to populate the internal buffer only when voice is detected.
The plugin offers two VAD implementations:
- Default VAD
- Silero VAD
The default implementation uses libfvad, a lightweight voice activity detection library that works efficiently across all platforms and engine versions supported by Runtime Audio Importer.
Available as an extension plugin, Silero VAD is a neural network-based voice activity detector that provides higher accuracy, especially in noisy environments. It uses machine learning to more reliably distinguish speech from background noise.
Basic Usage
To enable VAD after creating a sound wave, use the ToggleVAD
function:
- Blueprint
- C++
// Assuming StreamingSoundWave is a UE reference to a UStreamingSoundWave object (or its derived type, such as UCapturableSoundWave)
StreamingSoundWave->ToggleVAD(true);
After enabling VAD, you can reset it at any time:
- Blueprint
- C++
// Reset the VAD
StreamingSoundWave->ResetVAD();
Default VAD Settings
When using the default VAD provider, you can adjust its aggressiveness by changing the VAD mode:
- Blueprint
- C++
// Set the VAD mode (only works with the default VAD provider)
StreamingSoundWave->SetVADMode(ERuntimeVADMode::VeryAggressive);
The mode parameter controls how aggressively the VAD filters audio. Higher values are more restrictive, meaning they're less likely to report false positives but might miss some speech.
VAD Providers
After enabling VAD with the ToggleVAD
function, you can choose between different Voice Activity Detection providers to suit your needs. The default provider is built-in, while additional providers such as Silero VAD are available through extension plugins.
- Blueprint
- C++
// Assuming StreamingSoundWave is a UE reference to a UStreamingSoundWave object (or its derived type, such as UCapturableSoundWave)
// Make sure to call ToggleVAD(true) before setting the provider
// Set the VAD provider to Silero VAD
StreamingSoundWave->SetVADProvider(URuntimeSileroVADProvider::StaticClass());
Silero VAD Extension
Silero VAD provides more accurate speech detection using neural networks. To use it:
- Ensure the Runtime Audio Importer plugin is already installed in your project
- Download the Silero VAD extension plugin from Google Drive
- Extract the folder from the downloaded archive into the
Plugins
folder of your project (create this folder if it doesn't exist) - Rebuild your project (this extension requires a C++ project)
- The default VAD works with all engine versions supported by Runtime Audio Importer (UE 4.24, 4.25, 4.26, 4.27, 5.0, 5.1, 5.2, 5.3, 5.4, 5.5, and 5.6)
- Silero VAD supports Unreal Engine 4.27 and all UE5 versions (4.27, 5.0, 5.1, 5.2, 5.3, 5.4, 5.5, and 5.6)
- Silero VAD is currently available for Windows only
- This extension is provided as source code and requires a C++ project to use
- For more information on how to build plugins manually, see the Building Plugins tutorial
Once installed, you can select it as your VAD provider using the SetVADProvider
function with Silero class provider.
Speech Start and End detection
Voice Activity Detection not only detects the presence of speech, but it also allows for detection of the start and end of speech activity. This is useful for triggering events when speech begins or ends during playback or capture.
You can customize the sensitivity of speech start and end detection by adjusting parameters such as the minimum speech duration and the silence duration. These parameters help to fine-tune the detection to avoid false positives, like picking up brief noises or too-short pauses between speech.
Minimum Speech Duration
The Minimum Speech Duration parameter sets the minimum amount of continuous voice activity required to trigger a speech start event. This helps filter out brief noises that shouldn't be considered speech, to make sure that only sustained voice activity is recognized. The default value for Minimum Speech Duration is 300 milliseconds.
- Blueprint
- C++
// Assuming StreamingSoundWave is a UE reference to a UStreamingSoundWave object (or its derived type, such as UCapturableSoundWave)
// Set the minimum speech duration
StreamingSoundWave->SetMinimumSpeechDuration(200);
Silence Duration
The Silence Duration parameter sets the duration of silence required to trigger a speech end event. This prevents speech detection from ending prematurely during natural pauses between words or sentences. The default value for Silence Duration is 500 milliseconds.
- Blueprint
- C++
// Assuming StreamingSoundWave is a UE reference to a UStreamingSoundWave object (or its derived type, such as UCapturableSoundWave)
// Set the silence duration
StreamingSoundWave->SetSilenceDuration(700);
Binding to Speech Delegates
You can bind to specific delegates when speech starts or ends. This is useful for triggering custom behavior based on speech activity, such as starting or stopping text recognition, or adjusting the volume of other audio sources.
- Blueprint
- C++
// Assuming StreamingSoundWave is a UE reference to a UStreamingSoundWave object (or its derived type, such as UCapturableSoundWave)
// Bind to the OnSpeechStartedNative delegate
StreamingSoundWave->OnSpeechStartedNative.AddWeakLambda(this, [this]()
{
// Handle the result when speech starts
});
// Bind to the OnSpeechEndedNative delegate
StreamingSoundWave->OnSpeechEndedNative.AddWeakLambda(this, [this]()
{
// Handle the result when speech ends
});
Comparing VAD Providers
- Default VAD
- Silero VAD
Default VAD (libfvad)
Advantages:
- Lightweight and efficient
- Works on all platforms
- Minimal resource usage
- Suitable for mobile and low-powered devices
Best for:
- Simple voice detection in quiet environments
- Mobile applications
- Projects where performance is a priority
- When universal platform support is required
Silero VAD
Advantages:
- Higher accuracy in detecting voice
- Superior noise tolerance in challenging environments
- More consistent results across different speakers
- Advanced configuration options for precise control
Best for:
- Applications requiring precise voice detection
- Environments with background noise
- Voice recognition systems
- Professional audio applications
Silero VAD may require more computational resources than the default VAD.