Audio Processing Guide

This guide covers how to set up different audio input methods to feed audio data to your lip sync generators. Make sure you've completed the Setup Guide before proceeding.

Audio Input Processing

You need to set up a method to process audio input. There are several ways to do this depending on your audio source.

This approach performs lip sync in real-time while speaking into the microphone:

Standard Model
Realistic Model
Mood-Enabled Realistic Model

Create a Capturable Sound Wave using Runtime Audio Importer
Before starting to capture audio, bind to the OnPopulateAudioData delegate
In the bound function, call ProcessAudioData from your Runtime Viseme Generator
Start capturing audio from the microphone

Copyable nodes.

Lip Sync During Audio Capture

The Realistic Model uses the same audio processing workflow as the Standard Model, but with the RealisticLipSyncGenerator variable instead of VisemeGenerator.

Copyable nodes.

Realistic Lip Sync During Audio Capture

The Mood-Enabled Model uses the same audio processing workflow, but with the MoodMetaHumanLipSyncGenerator variable and additional mood configuration capabilities.

Copyable nodes.

Mood-Enabled Lip Sync During Audio Capture

This approach captures audio from a microphone, then plays it back with lip sync:

Standard Model
Realistic Model
Mood-Enabled Realistic Model

Create a Capturable Sound Wave using Runtime Audio Importer
Start audio capture from the microphone
Before playing back the capturable sound wave, bind to its OnGeneratePCMData delegate
In the bound function, call ProcessAudioData from your Runtime Viseme Generator

Copyable nodes.

Lip Sync During Audio Playback

The Realistic Model uses the same audio processing workflow as the Standard Model, but with the RealisticLipSyncGenerator variable instead of VisemeGenerator.

Copyable nodes.

Realistic Lip Sync During Audio Playback

The Mood-Enabled Model uses the same audio processing workflow, but with the MoodMetaHumanLipSyncGenerator variable and additional mood configuration capabilities.

Copyable nodes.

Mood-Enabled Lip Sync During Audio Playback

Regular
Streaming

This approach synthesizes speech from text using local TTS and performs lip sync:

Standard Model
Realistic Model
Mood-Enabled Realistic Model

Use Runtime Text To Speech to generate speech from text
Use Runtime Audio Importer to import the synthesized audio
Before playing back the imported sound wave, bind to its OnGeneratePCMData delegate
In the bound function, call ProcessAudioData from your Runtime Viseme Generator

Copyable nodes.

Lip Sync From Synthesized Speech

Use Runtime Text To Speech to generate speech from text
Use Runtime Audio Importer to import the synthesized audio
Before playing back the imported sound wave, bind to its OnGeneratePCMData delegate
In the bound function, call ProcessAudioData from your Mood-Enabled Lip Sync Generator
Configure mood settings based on the text content or desired emotional expression

Copyable nodes.

Mood-Enabled Lip Sync From Local TTS

TTS with Mood Integration:

Set appropriate moods before or during TTS generation
Adjust mood intensity to match the emotional tone of the text
Use different moods for different sections of longer text content

This approach uses streaming text-to-speech synthesis with real-time lip sync:

Standard Model
Realistic Model
Mood-Enabled Realistic Model

Use Runtime Text To Speech to generate streaming speech from text
Use Runtime Audio Importer to import the synthesized audio
Before playing back the streaming sound wave, bind to its OnGeneratePCMData delegate
In the bound function, call ProcessAudioData from your Runtime Viseme Generator

Copyable nodes.

Lip Sync From Synthesized Streaming Speech

Use Runtime Text To Speech to generate streaming speech from text
Use Runtime Audio Importer to import the synthesized audio
Before playing back the streaming sound wave, bind to its OnGeneratePCMData delegate
In the bound function, call ProcessAudioData from your Mood-Enabled Lip Sync Generator
Adjust mood settings dynamically during streaming if needed

Copyable nodes.

Mood-Enabled Lip Sync From Streaming Local TTS

Regular
Streaming

This approach uses the Runtime AI Chatbot Integrator plugin to generate synthesized speech from AI services (OpenAI or ElevenLabs) and perform lip sync:

Standard Model
Realistic Model
Mood-Enabled Realistic Model

Use Runtime AI Chatbot Integrator to generate speech from text using external APIs (OpenAI, ElevenLabs, etc.)
Use Runtime Audio Importer to import the synthesized audio data
Before playing back the imported sound wave, bind to its OnGeneratePCMData delegate
In the bound function, call ProcessAudioData from your Runtime Viseme Generator

Copyable nodes.

Lip Sync From Externally Synthesized Speech

The Realistic Model uses the same audio processing workflow as the Standard Model, but with the RealisticLipSyncGenerator variable instead of VisemeGenerator.

Copyable nodes.

Realistic Lip Sync From Externally Synthesized Speech

The Mood-Enabled Model uses the same audio processing workflow, but with the MoodMetaHumanLipSyncGenerator variable and additional mood configuration capabilities.

Copyable nodes.

Mood-Enabled Lip Sync From Externally Synthesized Speech

This approach uses the Runtime AI Chatbot Integrator plugin to generate synthesized streaming speech from AI services (OpenAI or ElevenLabs) and perform lip sync:

Standard Model
Realistic Model
Mood-Enabled Realistic Model

Use Runtime AI Chatbot Integrator to connect to streaming TTS APIs (like ElevenLabs Streaming API)
Use Runtime Audio Importer to import the synthesized audio data
Before playing back the streaming sound wave, bind to its OnGeneratePCMData delegate
In the bound function, call ProcessAudioData from your Runtime Viseme Generator

Copyable nodes.

Lip Sync From Externally Synthesized Streaming Speech

The Realistic Model uses the same audio processing workflow as the Standard Model, but with the RealisticLipSyncGenerator variable instead of VisemeGenerator.

Copyable nodes.

Realistic Lip Sync From Externally Synthesized Streaming Speech

The Mood-Enabled Model uses the same audio processing workflow, but with the MoodMetaHumanLipSyncGenerator variable and additional mood configuration capabilities.

Copyable nodes.

Mood-Enabled Lip Sync From Externally Synthesized Streaming Speech

This approach uses pre-recorded audio files or audio buffers for lip sync:

Standard Model
Realistic Model
Mood-Enabled Realistic Model

Use Runtime Audio Importer to import an audio file from disk or memory
Before playing back the imported sound wave, bind to its OnGeneratePCMData delegate
In the bound function, call ProcessAudioData from your Runtime Viseme Generator
Play the imported sound wave and observe the lip sync animation

Copyable nodes.

Lip Sync From Audio File

The Realistic Model uses the same audio processing workflow as the Standard Model, but with the RealisticLipSyncGenerator variable instead of VisemeGenerator.

Copyable nodes.

Realistic Lip Sync From Audio File

The Mood-Enabled Model uses the same audio processing workflow, but with the MoodMetaHumanLipSyncGenerator variable and additional mood configuration capabilities.

Copyable nodes.

Mood-Enabled Lip Sync From Audio File

For streaming audio data from a buffer, you need:

Standard Model
Realistic Model
Mood-Enabled Realistic Model

Audio data in float PCM format (an array of floating-point samples) available from your streaming source (or use Runtime Audio Importer to support more formats)
The sample rate and number of channels
Call ProcessAudioData from your Runtime Viseme Generator with these parameters as audio chunks become available

Copyable nodes.

Lip Sync From Streaming Source

The Realistic Model uses the same audio processing workflow as the Standard Model, but with the RealisticLipSyncGenerator variable instead of VisemeGenerator.

Copyable nodes.

Realistic Lip Sync From Streaming Source

The Mood-Enabled Model uses the same audio processing workflow, but with the MoodMetaHumanLipSyncGenerator variable and additional mood configuration capabilities.

Copyable nodes.

Mood-Enabled Lip Sync From Streaming Source

Note: When using streaming audio sources, make sure to manage audio playback timing appropriately to avoid distorted playback. See the Streaming Sound Wave documentation for more information.

Processing Performance Tips

Chunk Size: If you want to process audio data in smaller chunks for more responsive lip sync, adjust the calculation in the SetNumSamplesPerChunk function. For example, dividing the sample rate by 150 (streaming every ~6.67 ms) instead of 100 (streaming every 10 ms) will provide more frequent lip sync updates.
Buffer Management: The mood-enabled model processes audio in 320-sample frames (20ms at 16kHz). Ensure your audio input timing aligns with this for optimal performance.
Generator Recreation: For reliable operation with Realistic models, recreate the generator each time you want to feed new audio data after a period of inactivity.

Next Steps

Once you have audio processing set up, you may want to:

Learn about Configuration options to fine-tune your lip sync behavior
Add laughter animation for enhanced expressiveness
Combine lip sync with existing facial animations using the layering techniques described in the Configuration guide

Audio Input Processing​

Processing Performance Tips​

Next Steps​

Audio Input Processing

Processing Performance Tips

Next Steps