Short Info: Alibaba’s new AI Alibaba made a cool new computer program called EMO (Emotive Portrait Alive). It can make pictures of people talk and sing with remarkable expressiveness, even animating famous people from the past! First showcased in early 2024, EMO remains a significant example in AI-driven video generation, particularly for animating portraits. Want to learn more? Read this Article till the end!
Updated: May 15, 2025
Artificial intelligence (AI) is changing how videos are made. Alibaba’s video generator, EMO (Emotive Portrait Alive), introduced by its Institute for Intelligent Computing, is a great example. It does something amazing – it can turn a picture into a moving video where the person in the picture talks and sings with impressive realism. This specialized capability garnered significant attention, even when compared to broader AI video models like OpenAI’s Sora. People remain curious about how this technology will affect different industries, like entertainment, and the ethical considerations it brings forth. Check Official Researched Paper
Alibaba’s EMO: Making Waves in AI Video
Alibaba’s Institute for Intelligent Computing introduced EMO, an incredibly advanced AI video maker. It can turn static face pictures into lively, animated characters. To demonstrate its capabilities, Alibaba shared demos where EMO animates various portraits, including making the famous “Sora lady” (a character featured in OpenAI’s Sora demos) sing Dua Lipa’s “Don’t Start Now.” But it doesn’t stop there: EMO can even animate historical figures like Audrey Hepburn, making them appear to say lines from modern viral videos or sing contemporary songs.
How EMO Works
What makes EMO special is how it syncs lip movements with sound while capturing subtle emotions accurately. It learns realistic facial expressions from a vast collection of video and audio data (over 250 hours). Unlike some older face-swapping technologies, EMO doesn’t necessarily need intermediate 3D models. It directly creates lifelike videos through a diffusion-based method, using attention mechanisms to highlight details from the original image and audio, ensuring the generated character remains consistent with the source image’s identity.

EMO vs. Sora: A Quick Comparison
Feature | EMO (Alibaba) | Sora (OpenAI) | Notes |
Type of AI | Expressive audio-driven portrait video generator | Text-to-video model for generating diverse scenes and characters | EMO focuses on animating existing single portrait images with audio; Sora creates entire video sequences from text prompts. |
Key Capabilities | Realistic facial animation from a single image, expressive lip-sync, multilingual audio support, various artistic styles. | Generating detailed and imaginative video scenes, characters, and motion based on textual descriptions. | EMO excels at detailed face-focused videos from specific inputs. Sora offers broader creative capabilities for generating original video content. |
Limitations (as of initial demonstrations) | May struggle with extreme emotions or very complex occlusions. Primarily focused on head and upper body. | Can sometimes generate physically implausible scenarios; consistency over longer durations can be challenging. Access is limited. | The field is rapidly evolving; capabilities and limitations change with new developments. |
Potential Use Cases | Entertainment, historical reenactments, personalized avatars, content creation. | Filmmaking, artistic expression, virtual production, game development, creating immersive experiences. | Both technologies offer unique and powerful possibilities in their respective domains. |
It’s also worth noting that the AI video generation landscape is incredibly dynamic, with other models from companies like Google (e.g., VLOGGER) and Microsoft (e.g., VASA-1) also demonstrating impressive capabilities in animating portraits or generating video content, each with unique strengths.
EMO’s Impressive Expressiveness
EMO doesn’t just do lip-syncing. It accurately captures the small changes in expression during pauses, like a quick glance downwards or pursed lips. This attention to detail is impressive and makes the AI-made videos feel remarkably human, preserving the identity of the person in the photo while imbuing them with lifelike emotion tied to the audio input.

Questions and Considerations
EMO’s impressive abilities, like those of other advanced generative AI models, continue to bring up important concerns about how such technology could be misused (e.g., for creating sophisticated deepfakes or spreading misinformation) and what it means for actors and content creators regarding intellectual property and authenticity. The ease with which realistic talking heads can be generated from a single image underscores the need for ongoing ethical discussions and robust detection mechanisms. At the same time, it opens up exciting opportunities for fresh kinds of video entertainment, personalized digital assistants, and the chance to reimagine history or create novel artistic expressions in innovative ways.
EMO by Alibaba is insane.
It’s not just AI lip sync, it expressively shows emotions, head motions, facial expressions, and even earring movements!
Can you trust what you see after watching these AI videos?
5 crazy examples:
1. AI Lady from Sora with OpenAI’s Mira Murati voice pic.twitter.com/1NYI8VwfxQ
— Min Choi (@minchoi) February 29, 2024
Conclusion
Alibaba’s EMO video generator, since its demonstration, represents a significant leap in expressive AI video creation from single images and audio. Its ability to bring static images to life with accurate emotion and lip-sync is a compelling demonstration of AI’s evolving potential in storytelling, virtual persona creation, and entertainment. As this technology and others like it continue to evolve rapidly, we can expect even more amazing and potentially disruptive applications, making the ongoing dialogue about their responsible development and deployment increasingly crucial.