Character animation has come a long way, but producing realistic videos has always been a challenge. While recent 2D models have excelled at image-guided synthesis, they lack control and struggle with complex 3D motions and interactions. Alibaba aims to change this with its new platform, MIMO.
Table of Contents
What is MIMO?
Alibaba’s Institute for Intelligent Computing has recently introduced MIMO, a generalizable model that enables controllable character video synthesis with spatial decomposed modelling. MIMO stands for Mimicking Motion Object. This model allows users to simply provide inputs like a character image, motion sequence, and scene to synthesize realistic videos.
Demo Video of Alibaba MIMO
Key Features of Alibaba MIMO
Some key highlights of MIMO include:
1. Flexible User Control
MIMO allows users to control character, motion, and scene attributes by simply providing inputs like a character image, pose sequence, and scene video/image.
2. Scalability
The model can synthesize videos for arbitrary characters by just using a single reference image as input.
3. Motion Generality
It achieves high generality for novel 3D motions, including those extracted from in-the-wild videos.
4. Scene Applicability
MIMO is effective at producing animations within complex, real-world scenes featuring object interactions.
Example Videos Generated by Alibaba MIMO
Core Concept Behind MIMO
The core idea behind MIMO is spatial decomposed modelling. Unlike previous 2D techniques, it encodes video inputs in a 3D-aware manner by decomposing them into spatial components. Specifically, each frame is separated into three layers based on depth: the main human, the underlying scene, and floating occlusions.
The human layer is further disentangled into identity and motion codes using canonical appearance transfer and structured body codes, respectively. A shared VAE encoder embeds the scene and occlusion layers into a full scene code. These latent codes then control synthesis via a diffusion-based decoder. This 3D-aware approach enables flexible control and handling of challenging scenarios.
How Alibaba MIMO Works
When a user provides inputs, MIMO embeds them into latent codes. It also spatially decodes driving videos into codes. These codes are inserted into a diffusion decoder for reconstruction. MIMO is jointly trained to minimize noise-prediction errors.
Performance Evaluation of Alibaba MIMO
Alibaba researchers demonstrate MIMO’s abilities through various character video synthesis examples controlled by different attributes:
1. Arbitrary Character Control
The model animates diverse human and cartoon characters given just a single reference image.
2. Novel 3D Motion Control
The model faithfully mimics complex motions from large databases and in-the-wild videos.
3. Interactive Scene Control
It easily inserts characters into complicated real-world scenes with natural object interactions.
It also outperforms prior 2D and 3D methods on tasks like character replacement in videos. The results validate MIMO’s success through its unified framework.
Potential Applications of Alibaba MIMO
MIMO opens doors for a wide range of applications in film, VR, content creation, e-commerce personalization, graphics and character animation. It could significantly lower video production costs and make animation accessible to all.
Future Work
Looking ahead, researchers plan to enhance MIMO’s realism and dynamism. Collecting larger, more diverse training datasets could help improve the model. Exploring additional controllable attributes like facial expressions is also interesting for future work. For more technical details, please visit the arXiV paper.
| Latest From Us
- Google Knows Where You Are By Tracking Your Location Even With GPS Disabled
- Nvidia’s New Open Model NVLM 1.0 Takes On GPT-4o in Multimodal AI
- Do AI Coding Assistants Really Improve Developer Productivity?
- Nintendo Is Going Against Popular YouTube Channels That Show Its Games Being Emulated
- By 2027, 79 Percent of CEOs Expect Remote Jobs to Be Extinct