How to use the plugin
This guide walks you through the process of setting up Runtime MetaHuman Lip Sync for your MetaHuman characters.
Note: Runtime MetaHuman Lip Sync works with both MetaHuman and custom characters. The plugin supports various character types including:
- Popular commercial characters (Daz Genesis 8/9, Reallusion CC3/CC4, Mixamo, ReadyPlayerMe, etc)
- Characters with FACS-based blendshapes
- Models using ARKit blendshape standards
- Characters with Preston Blair phoneme sets
- 3ds Max phoneme systems
- Any character with custom morph targets for facial expressions
For detailed instructions on setting up custom characters, including viseme mapping references for all the above standards, see the Custom character setup guide.
Prerequisites
Before getting started, ensure:
- The MetaHuman plugin is enabled in your project (Note: Starting from UE 5.6, this step is no longer required as MetaHuman functionality is integrated directly into the engine)
- You have at least one MetaHuman character downloaded and available in your project
- The Runtime MetaHuman Lip Sync plugin is installed
Standard Model Extension Plugin
If you plan to use the Standard (Faster) Model, you'll need to install the extension plugin:
- Download the Standard Lip Sync Extension plugin from Google Drive
- Extract the folder from the downloaded archive into the
Plugins
folder of your project (create this folder if it doesn't exist) - Ensure your project is set up as a C++ project (even if you don't have any C++ code)
- Rebuild your project
- This extension is only required if you want to use the Standard Model. If you only need the Realistic Model, you can skip this step.
- For more information on how to build plugins manually, see the Building Plugins tutorial
Additional Plugins
- If you plan to use audio capture (e.g., microphone input), install the Runtime Audio Importer plugin.
- If you plan to use text-to-speech functionality with my plugins (you may have your own custom TTS or other audio input), then in addition to the Runtime Audio Importer plugin, also install:
- For local TTS, the Runtime Text To Speech plugin.
- For external TTS providers (ElevenLabs, OpenAI), the Runtime AI Chatbot Integrator plugin.
Platform-Specific Configuration
Android / Meta Quest Configuration
If you're targeting Android or Meta Quest platforms and encounter build errors with this plugin, you'll need to disable the x86_64 (x64) Android architecture in your project settings:
- Go to Edit > Project Settings
- Navigate to Platforms > Android
- Under Platforms - Android, Build section, find Support x86_64 [aka x64] and ensure it's disabled, as shown below
This is because the plugin currently only supports arm64-v8a and armeabi-v7a architectures for Android / Meta Quest platforms.
Setup Process
Step 1: Locate and modify the face animation Blueprint
- UE 5.5 and Earlier (or Legacy MetaHumans in UE 5.6+)
- UE 5.6+ MetaHuman Creator Characters
You need to modify an Animation Blueprint that will be used for your MetaHuman character's facial animations. The default MetaHuman face Animation Blueprint is located at:
Content/MetaHumans/Common/Face/Face_AnimBP
You have several options for implementing the lip sync functionality:
- Edit Default Asset (Simplest Option)
- Create Duplicate
- Use Custom Animation Blueprint
Open the default Face_AnimBP
directly and make your modifications. Any changes will affect all MetaHuman characters using this Animation Blueprint.
Note: This approach is convenient but will impact all characters using the default Animation Blueprint.
- Duplicate
Face_AnimBP
and give it a descriptive name - Locate your character's Blueprint class (e.g., for character "Bryan", it would be at
Content/MetaHumans/Bryan/BP_Bryan
) - Open the character Blueprint and find the Face component
- Change the Anim Class property to your newly duplicated Animation Blueprint
Note: This approach allows you to customize lip sync for specific characters while leaving others unchanged.
You can implement the lip sync blending in any Animation Blueprint that has access to the required facial bones:
- Create or use an existing custom Animation Blueprint
- Ensure your Animation Blueprint works with a skeleton that contains the same facial bones as the default MetaHuman's
Face_Archetype_Skeleton
(which is the standard skeleton used for any MetaHuman character)
Note: This approach gives you maximum flexibility for integration with custom animation systems.
Starting with UE 5.6, the new MetaHuman Creator system was introduced, which creates characters without the traditional Face_AnimBP
asset. For these characters, the plugin provides a face Animation Blueprint located at:
Content/LipSyncData/LipSync_Face_AnimBP
This Animation Blueprint is located in the plugin's content folder and will be overwritten with each plugin update. To prevent losing your customizations, it's highly recommended to:
- Copy this asset to your project's Content folder (for example, to
YourProject/Content/MetaHumans/LipSync_Face_AnimBP
) - Use your copied version in your character setup
- Make all your modifications to the copied version
This ensures your lip sync configurations will persist through plugin updates.
Using the Plugin's Face Animation Blueprint:
- Locate your MetaHuman Creator character's Blueprint class
- Open the character Blueprint and find the Face component
- Change the Anim Class property to the plugin's
LipSync_Face_AnimBP
- Continue with Steps 2-4 to configure the Runtime MetaHuman Lip Sync functionality
Alternative Options:
- Use Legacy Instructions: You can still follow the UE 5.5 instructions above if you're working with legacy MetaHumans or prefer the traditional workflow
- Create Custom Animation Blueprint: Create your own Animation Blueprint that works with the MetaHuman Creator skeleton structure
Note: If you're using UE 5.6+ but working with legacy MetaHumans (not created through MetaHuman Creator), use the "UE 5.5 and Earlier" tab instructions instead.
Important: The Runtime MetaHuman Lip Sync blending can be implemented in any Animation Blueprint asset that has access to a pose containing the facial bones present in the default MetaHuman's Face_Archetype_Skeleton
. You're not limited to the options above - these are just common implementation approaches.
Step 2: Event Graph setup
Open your Face Animation Blueprint and switch to the Event Graph
. You'll need to create a generator that will process audio data and generate lip sync animation.
- Standard (Faster) Model
- Realistic (Higher Quality) Model
- Add the
Event Blueprint Begin Play
node if it doesn't exist already - Add the
Create Runtime Viseme Generator
node and connect it to the Begin Play event - Save the output as a variable (e.g. "VisemeGenerator") for use in other parts of the graph
- Add the
Event Blueprint Begin Play
node if it doesn't exist already - Add the
Create Realistic MetaHuman Lip Sync Generator
node and connect it to the Begin Play event - Save the output as a variable (e.g. "RealisticLipSyncGenerator") for use in other parts of the graph
- (Optional) Configure the generator settings using the Configuration parameter
- (Optional) Set the Processing Chunk Size on the Realistic MetaHuman Lip Sync Generator object
Note: The Realistic Model is optimized specifically for MetaHuman characters and is not compatible with custom character types.
Configuration Options
The Create Realistic MetaHuman Lip Sync Generator
node accepts an optional Configuration parameter that allows you to customize the generator's behavior:
Model Type
The Model Type setting determines which version of the realistic model to use:
Model Type | Performance | Visual Quality | Noise Handling | Recommended Use Cases |
---|---|---|---|---|
Highly Optimized (Default) | Highest performance, lowest CPU usage | Good quality | May show noticeable mouth movements with background noise or non-voice sounds | Clean audio environments, performance-critical scenarios |
Optimized | Good performance, moderate CPU usage | High quality | Better stability with noisy audio | Balanced performance and quality, mixed audio conditions |
Original Unoptimized | Suitable for real-time use on modern CPUs | Highest quality | Most stable with background noise and non-voice sounds | High-quality productions, noisy audio environments, when maximum accuracy is needed |
Performance Settings
Intra Op Threads: Controls the number of threads used for internal model processing operations.
- 0 (Default/Automatic): Uses automatic detection (typically 1/4 of available CPU cores, maximum 4)
- 1-16: Manually specify thread count. Higher values may improve performance on multi-core systems but use more CPU
Inter Op Threads: Controls the number of threads used for parallel execution of different model operations.
- 0 (Default/Automatic): Uses automatic detection (typically 1/8 of available CPU cores, maximum 2)
- 1-8: Manually specify thread count. Usually kept low for real-time processing
Using Configuration
To configure the generator:
- In the
Create Realistic MetaHuman Lip Sync Generator
node, expand the Configuration parameter - Set Model Type to your preferred option:
- Use Highly Optimized for best performance (recommended for most users)
- Use Optimized for balanced performance and quality
- Use Original Unoptimized only when maximum quality is essential
- Adjust Intra Op Threads and Inter Op Threads if needed (leave at 0 for automatic detection in most cases)
Performance Recommendations:
- For most projects with clean audio, use Highly Optimized for best performance
- If you're working with audio that contains background noise, music, or non-voice sounds, consider using Optimized or Original Unoptimized models for better stability
- The Highly Optimized model may show subtle mouth movements when processing non-voice audio due to optimization techniques applied during model creation
- The Original Unoptimized model, while requiring more CPU resources, is still suitable for real-time applications on modern hardware and provides the most accurate results with challenging audio conditions
- Only adjust thread counts if you're experiencing performance issues or have specific optimization requirements
- Higher thread counts don't always mean better performance - the optimal values depend on your specific hardware and project requirements
Processing Chunk Size Configuration: The Processing Chunk Size determines how many samples are processed in each inference step. The default value is 160 samples, which corresponds to 10ms of audio at 16kHz (the internal processing sample rate). You can adjust this value to balance between update frequency and CPU usage:
- Smaller values provide more frequent updates but increase CPU usage
- Larger values reduce CPU load but may decrease lip sync responsiveness
To set the Processing Chunk Size:
- Access your
Realistic MetaHuman Lip Sync Generator
object - Locate the
Processing Chunk Size
property - Set your desired value
It's recommended to use values that are multiples of 160. This aligns with the model's internal processing structure. Recommended values include:
160
(default, minimum recommended)320
480
640
- etc.
The default Processing Chunk Size of 160
samples corresponds to 10ms of audio at 16kHz. Using multiples of 160 maintains alignment with this base unit, which can help optimize processing efficiency and maintain consistent behavior across different chunk sizes.
For reliable and consistent operation with the Realistic Model, it's required to recreate the Realistic MetaHuman Lip Sync Generator each time you want to feed new audio data after a period of inactivity. This is due to ONNX runtime behavior that can cause lip sync to stop working when reusing generators after periods of silence.
Example scenario: If you performed TTS lip sync and then stopped, and later want to perform lip sync again with new audio, create a new Realistic MetaHuman Lip Sync Generator instead of reusing the existing one.
Step 3: Set up audio input processing
You need to set up a method to process audio input. There are several ways to do this depending on your audio source.
- Microphone (Real-time)
- Microphone (Playback)
- Text-to-Speech (Local)
- Text-to-Speech (External APIs)
- From Audio File/Buffer
- Streaming Audio Buffer
This approach performs lip sync in real-time while speaking into the microphone:
- Standard (Faster) Model
- Realistic (Higher Quality) Model
- Create a Capturable Sound Wave using Runtime Audio Importer
- Before starting to capture audio, bind to the
OnPopulateAudioData
delegate - In the bound function, call
ProcessAudioData
from your Runtime Viseme Generator - Start capturing audio from the microphone
The Realistic Model uses the same audio processing workflow as the Standard Model, but with the RealisticLipSyncGenerator
variable instead of VisemeGenerator
.
In each of the examples shown for the Standard Model, simply replace:
VisemeGenerator
with yourRealisticLipSyncGenerator
variable- The function names and parameters remain identical between both models
This approach captures audio from a microphone, then plays it back with lip sync:
- Standard (Faster) Model
- Realistic (Higher Quality) Model
- Create a Capturable Sound Wave using Runtime Audio Importer
- Start audio capture from the microphone
- Before playing back the capturable sound wave, bind to its
OnGeneratePCMData
delegate - In the bound function, call
ProcessAudioData
from your Runtime Viseme Generator
The Realistic Model uses the same audio processing workflow as the Standard Model, but with the RealisticLipSyncGenerator
variable instead of VisemeGenerator
.
In each of the examples shown for the Standard Model, simply replace:
VisemeGenerator
with yourRealisticLipSyncGenerator
variable- The function names and parameters remain identical between both models
Note: If you want to process audio data in smaller chunks for more responsive lip sync, adjust the calculation in the SetNumSamplesPerChunk
function. For example, dividing the sample rate by 150 (streaming every ~6.67 ms) instead of 100 (streaming every 10 ms) will provide more frequent lip sync updates.
- Regular
- Streaming
This approach synthesizes speech from text and performs lip sync:
- Standard (Faster) Model
- Realistic (Higher Quality) Model
- Use Runtime Text To Speech to generate speech from text
- Use Runtime Audio Importer to import the synthesized audio
- Before playing back the imported sound wave, bind to its
OnGeneratePCMData
delegate - In the bound function, call
ProcessAudioData
from your Runtime Viseme Generator
The local TTS provided by Runtime Text To Speech plugin is not currently supported with the Realistic model due to ONNX runtime conflicts. For text-to-speech with the Realistic model, consider using external TTS services (such as OpenAI or ElevenLabs via Runtime AI Chatbot Integrator) or use the Standard model instead.
Note: If you want to process audio data in smaller chunks for more responsive lip sync, adjust the calculation in the SetNumSamplesPerChunk
function. For example, dividing the sample rate by 150 (streaming every ~6.67 ms) instead of 100 (streaming every 10 ms) will provide more frequent lip sync updates.
This approach uses streaming text-to-speech synthesis with real-time lip sync:
- Standard (Faster) Model
- Realistic (Higher Quality) Model
- Use Runtime Text To Speech to generate streaming speech from text
- Use Runtime Audio Importer to import the synthesized audio
- Before playing back the streaming sound wave, bind to its
OnGeneratePCMData
delegate - In the bound function, call
ProcessAudioData
from your Runtime Viseme Generator
The local TTS provided by Runtime Text To Speech plugin is not currently supported with the Realistic model due to ONNX runtime conflicts. For text-to-speech with the Realistic model, consider using external TTS services (such as OpenAI or ElevenLabs via Runtime AI Chatbot Integrator) or use the Standard model instead.
Note: If you want to process audio data in smaller chunks for more responsive lip sync, adjust the calculation in the SetNumSamplesPerChunk
function. For example, dividing the sample rate by 150 (streaming every ~6.67 ms) instead of 100 (streaming every 10 ms) will provide more frequent lip sync updates.
- Regular
- Streaming
This approach uses the Runtime AI Chatbot Integrator plugin to generate synthesized speech from AI services (OpenAI or ElevenLabs) and perform lip sync:
- Standard (Faster) Model
- Realistic (Higher Quality) Model
- Use Runtime AI Chatbot Integrator to generate speech from text using external APIs (OpenAI, ElevenLabs, etc.)
- Use Runtime Audio Importer to import the synthesized audio data
- Before playing back the imported sound wave, bind to its
OnGeneratePCMData
delegate - In the bound function, call
ProcessAudioData
from your Runtime Viseme Generator
The Realistic Model uses the same audio processing workflow as the Standard Model, but with the RealisticLipSyncGenerator
variable instead of VisemeGenerator
.
In each of the examples shown for the Standard Model, simply replace:
VisemeGenerator
with yourRealisticLipSyncGenerator
variable- The function names and parameters remain identical between both models
Note: If you want to process audio data in smaller chunks for more responsive lip sync, adjust the calculation in the SetNumSamplesPerChunk
function. For example, dividing the sample rate by 150 (streaming every ~6.67 ms) instead of 100 (streaming every 10 ms) will provide more frequent lip sync updates.
This approach uses the Runtime AI Chatbot Integrator plugin to generate synthesized streaming speech from AI services (OpenAI or ElevenLabs) and perform lip sync:
- Standard (Faster) Model
- Realistic (Higher Quality) Model
- Use Runtime AI Chatbot Integrator to connect to streaming TTS APIs (like ElevenLabs Streaming API)
- Use Runtime Audio Importer to import the synthesized audio data
- Before playing back the streaming sound wave, bind to its
OnGeneratePCMData
delegate - In the bound function, call
ProcessAudioData
from your Runtime Viseme Generator
The Realistic Model uses the same audio processing workflow as the Standard Model, but with the RealisticLipSyncGenerator
variable instead of VisemeGenerator
.
In each of the examples shown for the Standard Model, simply replace:
VisemeGenerator
with yourRealisticLipSyncGenerator
variable- The function names and parameters remain identical between both models
Note: If you want to process audio data in smaller chunks for more responsive lip sync, adjust the calculation in the SetNumSamplesPerChunk
function. For example, dividing the sample rate by 150 (streaming every ~6.67 ms) instead of 100 (streaming every 10 ms) will provide more frequent lip sync updates.
This approach uses pre-recorded audio files or audio buffers for lip sync:
- Standard (Faster) Model
- Realistic (Higher Quality) Model
- Use Runtime Audio Importer to import an audio file from disk or memory
- Before playing back the imported sound wave, bind to its
OnGeneratePCMData
delegate - In the bound function, call
ProcessAudioData
from your Runtime Viseme Generator - Play the imported sound wave and observe the lip sync animation
The Realistic Model uses the same audio processing workflow as the Standard Model, but with the RealisticLipSyncGenerator
variable instead of VisemeGenerator
.
In each of the examples shown for the Standard Model, simply replace:
VisemeGenerator
with yourRealisticLipSyncGenerator
variable- The function names and parameters remain identical between both models
Note: If you want to process audio data in smaller chunks for more responsive lip sync, adjust the calculation in the SetNumSamplesPerChunk
function. For example, dividing the sample rate by 150 (streaming every ~6.67 ms) instead of 100 (streaming every 10 ms) will provide more frequent lip sync updates.
For streaming audio data from a buffer, you need:
- Standard (Faster) Model
- Realistic (Higher Quality) Model
- Audio data in float PCM format (an array of floating-point samples) available from your streaming source
- The sample rate and number of channels
- Call
ProcessAudioData
from your Runtime Viseme Generator with these parameters as audio chunks become available
Here's an example of processing lip sync from streaming audio data:
Note: When using streaming audio sources, make sure to manage audio playback timing appropriately to avoid distorted playback. See the Streaming Sound Wave documentation for more information on proper streaming audio management.
The Realistic Model uses the same audio processing workflow as the Standard Model, but with the RealisticLipSyncGenerator
variable instead of VisemeGenerator
.
In each of the examples shown for the Standard Model, simply replace:
VisemeGenerator
with yourRealisticLipSyncGenerator
variable- The function names and parameters remain identical between both models
Note: When using streaming audio sources, make sure to manage audio playback timing appropriately to avoid distorted playback. See the Streaming Sound Wave documentation for more information on proper streaming audio management.
Note: If you want to process audio data in smaller chunks for more responsive lip sync, adjust the calculation in the SetNumSamplesPerChunk
function. For example, dividing the sample rate by 150 (streaming every ~6.67 ms) instead of 100 (streaming every 10 ms) will provide more frequent lip sync updates.
Step 4: Anim Graph setup
After setting up the Event Graph, switch to the Anim Graph
to connect the generator to the character's animation:
Lip Sync
- Standard (Faster) Model
- Realistic (Higher Quality) Model
- Locate the pose that contains the MetaHuman face (typically from
Use cached pose 'Body Pose'
) - Add the
Blend Runtime MetaHuman Lip Sync
node - Connect the pose to the
Source Pose
of theBlend Runtime MetaHuman Lip Sync
node - Connect your
RuntimeVisemeGenerator
variable to theViseme Generator
pin - Connect the output of the
Blend Runtime MetaHuman Lip Sync
node to theResult
pin of theOutput Pose
When lip sync is detected in the audio, your character will dynamically animate accordingly:
- Locate the pose that contains the MetaHuman face (typically from
Use cached pose 'Body Pose'
) - Add the
Blend Realistic MetaHuman Lip Sync
node - Connect the pose to the
Source Pose
of theBlend Realistic MetaHuman Lip Sync
node - Connect your
RealisticLipSyncGenerator
variable to theLip Sync Generator
pin - Connect the output of the
Blend Realistic MetaHuman Lip Sync
node to theResult
pin of theOutput Pose
The Realistic Model provides enhanced visual quality with more natural mouth movements:
Note: The Realistic Model is designed exclusively for MetaHuman characters and is not compatible with custom character types.
Laughter Animation
You can also add laughter animations that will dynamically respond to laughter detected in the audio:
- Add the
Blend Runtime MetaHuman Laughter
node - Connect your
RuntimeVisemeGenerator
variable to theViseme Generator
pin - If you're already using lip sync:
- Connect the output from the
Blend Runtime MetaHuman Lip Sync
node to theSource Pose
of theBlend Runtime MetaHuman Laughter
node - Connect the output of the
Blend Runtime MetaHuman Laughter
node to theResult
pin of theOutput Pose
- Connect the output from the
- If using only laughter without lip sync:
- Connect your source pose directly to the
Source Pose
of theBlend Runtime MetaHuman Laughter
node - Connect the output to the
Result
pin
- Connect your source pose directly to the
When laughter is detected in the audio, your character will dynamically animate accordingly:
Combining with Body Animations
To apply lip sync and laughter alongside existing body animations without overriding them:
- Add a
Layered blend per bone
node between your body animations and the final output. Make sureUse Attached Parent
is true. - Configure the layer setup:
- Add 1 item to the
Layer Setup
array - Add 3 items to the
Branch Filters
for the layer, with the followingBone Name
s:FACIAL_C_FacialRoot
FACIAL_C_Neck2Root
FACIAL_C_Neck1Root
- Add 1 item to the
- Make the connections:
- Existing animations (such as
BodyPose
) →Base Pose
input - Facial animation output (from lip sync and/or laughter nodes) →
Blend Poses 0
input - Layered blend node → Final
Result
pose
- Existing animations (such as
Why this works: The branch filters isolate facial animation bones, allowing lip sync and laughter to blend exclusively with facial movements while preserving original body animations. This matches the MetaHuman facial rig structure, ensuring natural integration.
Note: The lip sync and laughter features are designed to work non-destructively with your existing animation setup. They only affect the specific facial bones needed for mouth movement, leaving other facial animations intact. This means you can safely integrate them at any point in your animation chain - either before other facial animations (allowing those animations to override lip sync/laughter) or after them (letting lip sync/laughter blend on top of your existing animations). This flexibility lets you combine lip sync and laughter with eye blinking, eyebrow movements, emotional expressions, and other facial animations without conflicts.
Configuration
Lip Sync Configuration
- Standard (Faster) Model
- Realistic (Higher Quality) Model
The Blend Runtime MetaHuman Lip Sync
node has configuration options in its properties panel:
Property | Default | Description |
---|---|---|
Interpolation Speed | 25 | Controls how quickly the lip movements transition between visemes. Higher values result in faster more abrupt transitions. |
Reset Time | 0.2 | The duration in seconds after which the lip sync is reset. This is useful to prevent the lip sync from continuing after the audio has stopped. |
The Blend Realistic MetaHuman Lip Sync
node has configuration options in its properties panel:
Property | Default | Description |
---|---|---|
Interpolation Speed | 30 | Controls how quickly the lip movements transition between positions. Higher values result in faster more abrupt transitions. |
Reset Time | 0.2 | The duration in seconds after which the lip sync is reset. This is useful to prevent the lip sync from continuing after the audio has stopped. |
Laughter Configuration
The Blend Runtime MetaHuman Laughter
node has its own configuration options:
Property | Default | Description |
---|---|---|
Interpolation Speed | 25 | Controls how quickly the lip movements transition between laughter animations. Higher values result in faster more abrupt transitions. |
Reset Time | 0.2 | The duration in seconds after which the laughter is reset. This is useful to prevent the laughter from continuing after the audio has stopped. |
Max Laughter Weight | 0.7 | Scales the maximum intensity of the laughter animation (0.0 - 1.0). |
Choosing Between Lip Sync Models
When deciding which lip sync model to use for your project, consider these factors:
Consideration | Standard Model | Realistic Model |
---|---|---|
Character Compatibility | MetaHumans and all custom character types | MetaHumans only |
Visual Quality | Good lip sync with efficient performance | Enhanced realism with more natural mouth movements |
Performance | Optimized for all platforms including mobile/VR | Slightly higher resource requirements |
Use Cases | General applications, games, VR/AR, mobile | Cinematic experiences, close-up character interactions |
Engine Version Compatibility
If you're using Unreal Engine 5.2, the Realistic Model may not work correctly due to a bug in UE's resampling library. For UE 5.2 users who need reliable lip sync functionality, please use the Standard Model instead.
This issue is specific to UE 5.2 and does not affect other engine versions.
For most projects, the Standard Model provides an excellent balance of quality and performance while supporting the widest range of character types. The Realistic Model is ideal when you need the highest visual fidelity specifically for MetaHuman characters in contexts where performance overhead is less critical.