Skip to main content

Voice Activity Detection

Streaming Sound Wave, along with its derived types such as Capturable Sound Wave, supports Voice Activity Detection (VAD). VAD filters incoming audio data to populate the internal buffer only when voice is detected.

The plugin offers two VAD implementations:

The default implementation uses libfvad, a lightweight voice activity detection library that works efficiently across all platforms and engine versions supported by Runtime Audio Importer.

Basic Usage

To enable VAD after creating a sound wave, use the ToggleVAD function:

Toggle VAD node

After enabling VAD, you can reset it at any time:

Reset VAD node

Default VAD Settings

When using the default VAD provider, you can adjust its aggressiveness by changing the VAD mode:

Set VAD Mode node

The mode parameter controls how aggressively the VAD filters audio. Higher values are more restrictive, meaning they're less likely to report false positives but might miss some speech.

VAD Providers

After enabling VAD with the ToggleVAD function, you can choose between different Voice Activity Detection providers to suit your needs. The default provider is built-in, while additional providers such as Silero VAD are available through extension plugins.

Set VAD Provider node

Silero VAD Extension

Silero VAD provides more accurate speech detection using neural networks. To use it:

  1. Ensure the Runtime Audio Importer plugin is already installed in your project
  2. Download the Silero VAD extension plugin from Google Drive
  3. Extract the folder from the downloaded archive into the Plugins folder of your project (create this folder if it doesn't exist)
  4. Rebuild your project (this extension requires a C++ project)
important
  • The default VAD works with all engine versions supported by Runtime Audio Importer (UE 4.24, 4.25, 4.26, 4.27, 5.0, 5.1, 5.2, 5.3, 5.4, 5.5, and 5.6)
  • Silero VAD supports Unreal Engine 4.27 and all UE5 versions (4.27, 5.0, 5.1, 5.2, 5.3, 5.4, 5.5, and 5.6)
  • Silero VAD is currently available for Windows only
  • This extension is provided as source code and requires a C++ project to use
  • For more information on how to build plugins manually, see the Building Plugins tutorial

Once installed, you can select it as your VAD provider using the SetVADProvider function with Silero class provider.

Speech Start and End detection

Voice Activity Detection not only detects the presence of speech, but it also allows for detection of the start and end of speech activity. This is useful for triggering events when speech begins or ends during playback or capture.

You can customize the sensitivity of speech start and end detection by adjusting parameters such as the minimum speech duration and the silence duration. These parameters help to fine-tune the detection to avoid false positives, like picking up brief noises or too-short pauses between speech.

Minimum Speech Duration

The Minimum Speech Duration parameter sets the minimum amount of continuous voice activity required to trigger a speech start event. This helps filter out brief noises that shouldn't be considered speech, to make sure that only sustained voice activity is recognized. The default value for Minimum Speech Duration is 300 milliseconds.

Set Minimum Speech Duration node

Silence Duration

The Silence Duration parameter sets the duration of silence required to trigger a speech end event. This prevents speech detection from ending prematurely during natural pauses between words or sentences. The default value for Silence Duration is 500 milliseconds.

Set Silence Duration node

Binding to Speech Delegates

You can bind to specific delegates when speech starts or ends. This is useful for triggering custom behavior based on speech activity, such as starting or stopping text recognition, or adjusting the volume of other audio sources.

Bind Event To On Speech Started Bind Event To On Speech Ended

Comparing VAD Providers

Default VAD (libfvad)

Advantages:

  • Lightweight and efficient
  • Works on all platforms
  • Minimal resource usage
  • Suitable for mobile and low-powered devices

Best for:

  • Simple voice detection in quiet environments
  • Mobile applications
  • Projects where performance is a priority
  • When universal platform support is required