DreamID Omni logo

DreamID Omni: Unified Human Audio-Video AI

One framework for generation (R2AV), editing (RV2AV), and animation (RA2V). Built by Tsinghua University and ByteDance with Syn-RoPE identity binding—delivering lip-synced, identity-consistent videos for portraits, dubbing, and multi-person scenes.

Generate with DreamID Omni

Create Human-Centric Videos with DreamID Omni

Upload an image and an audio clip, then add a prompt. DreamID Omni—unified human audio-video AI—generates identity-consistent talking video driven by voice and timing (R2AV).

Image *

Click to upload an image with a human subject or characterRequired. Input image containing a human subject, face or character.

Required. Use an image with a human subject, face or character.

Audio*

Click to upload speech or singing audioRequired. WAV / MP3. Duration must be under 35 seconds.

Required. Audio under 35 seconds. We’ll use its length for the video duration.

Prompt *

0/2000

Required. Max 2000 characters. For scene, movements, camera, etc. Supports Chinese, English, Japanese, Korean, Spanish, Indonesian.

Audio duration must be less than 35 seconds. The backend requires duration between 3 and 35 seconds; we automatically set this based on your audio length.

Requests are tied to your DreamID Omni account and will appear in your profile history once finished.

Preview

Demo clip · DreamID Omni
Core Capabilities

The Engine Behind DreamID Omni

Every layer is engineered for semantic consistency and audio-visual alignment.

Unified Omni-Framework demonstration

Unified Omni-Framework

End-to-End PipelineCross-Modal LatentsZero-Shot

A holistic backbone orchestrating R2AV (Generation), RV2AV (Editing), and RA2V (Animation). Eliminates the friction of stitching incompatible models.

// TECH SPECS:Shared Latent Space ensures character identity, motion trajectory, and audio semantics are intrinsically aligned.

Syn-RoPE Technology demonstration

Syn-RoPE Technology

Spatial-Temporal BindingIdentity LockingAnti-Ambiguity

Proprietary rotary positional embeddings that solve referential ambiguity by rigidly binding identity tokens to specific spatial coordinates.

// TECH SPECS:Ensures Pixel-Perfect Identity Preservation, keeping faces and voices disentangled even in complex multi-subject scenes.

Symmetric DiT Backbone demonstration

Symmetric DiT Backbone

Dual-Stream DiffusionMicro-Expression4K Fidelity

Next-gen Dual-stream Diffusion Transformer that performs bi-directional reasoning over audio and video signals simultaneously.

// TECH SPECS:Achieves Granular Lip-Sync, captures subtle micro-expressions, and maintains global illumination consistency.

Workflow Demonstration

From Audio to Video

Experience the DreamID Omni engine. Play the Source Audio to hear the raw input, then play the Generated Video to see the identity-consistent output.

GENERATED OUTPUT
Source Audio Input
Target Script

"Today he receives the silver star for bravery and valor."

Visual Context

$Warm soft light, sub1 in black suit, white shirt. Serious, respectful tone.

GENERATED OUTPUT
Source Audio Input
Target Script

"Nice work. Tell DCA, get a fire team."

Visual Context

$Dim industrial background. Middle-aged man, camo uniform, sweaty intense face.

GENERATED OUTPUT
Source Audio Input
Target Script

"Really bad guy, someone who might be threatening girls with scissors or a knife."

Visual Context

$Long straight dark hair, grey shirt. Expression serious, concerned, focused.

GENERATED OUTPUT
Source Audio Input
Target Script

"Increasingly powerful bursts of aggression, uh, persecution, anxiety."

Visual Context

$Dim room. Long blonde hair, looking down at screen. Furrowed brows, anxious.

GENERATED OUTPUT
Source Audio Input
Target Script

"About you. About how you're changing."

Visual Context

$Outdoor blurred greenery. Man with shoulder-length hair, beige jacket. Intense gaze.

GENERATED OUTPUT
Source Audio Input
Target Script

"Cash. He was supposed to come back the next day for his shirt. But get this..."

Visual Context

$Bright indoor. Light blue shirt, white tank top. Casual conversational tone.

Workflow Pipeline

From Static to Cinematic in 3 Steps

The DreamID Omni engine unifies generation, editing, and animation into a single, streamlined process.

01
DROP ASSETS HERE

Input Source Asset

Upload Portrait or Video

Start by uploading a single portrait image (for animation).

R2AV SupportRV2AV Support
02
WAV / MP3

Driver Injection

Audio or Motion Reference

Upload your voice track (TTS/Recording). The engine extracts semantic motion and emotion cues.

Audio SemanticMotion Driver
03
PROCESSING...85%

Neural Rendering

Generate & Sync

The Symmetric DiT backbone fuses the source and driver, applying Syn-RoPE to lock identity and sync lips perfectly.

Syn-RoPE4K Rendering
Support & Details

DreamID Omni Frequently Asked Questions

Everything you need to know about the product and billing. Can’t find the answer you’re looking for? Chat to our team.

Still have questions?

We’re here to help you get started.

Contact Support

Ready to Revolutionize Video with DreamID Omni?

Explore the DreamID-Omni engine, test your own assets, and build the next generation of human-centric experiences.