In this world of artificial intelligence, video generation is becoming one of the most exciting, competitive and challenging areas. There are several popular AI video generators out there, each defeating the other. Now, Alibaba’s Qwen just open-sourced its Wan2.1 AI video models for free, allowing everyone to join the race.
These Alibaba Wan2.1 models bring high-quality video generation to everyday users. They let people create impressive videos from text or images better than most other AI tools out there, even without needing expensive hardware.
Let’s get into the details behind this Alibaba’s incredible innovation.
Alibaba’s Qwen Releases Wan2.1 AI Video Generators
The Wan2.1 AI Models Family
Alibaba has released several versions of Wan2.1, each optimized for specific tasks:
1. Wan2.1-T2V-1.3B
A lightweight Text-to-Video model requiring only 8.19 GB of VRAM.
2. Wan2.1-I2V-14B-480P
A powerful Image-to-Video model for 480P resolution.
3. Wan2.1-I2V-14B-720P
A high-resolution Image-to-Video model for 720P output.
4. Wan2.1-T2V-14B
The main Text-to-Video model supporting both 480P and 720P resolutions.
These models can be downloaded from both Hugging Face and ModelScope repositories.
Example Videos Generated by Wan2.1
Key Features of Alibaba Wan2.1
Wan2.1 AI video models outperform many existing models, both open-source and commercial, in quality and efficiency. The best part? It runs on consumer-grade GPUs, meaning you don’t need a supercomputer to use it.
1. Competitive Video Generation
Alibaba tested Wan2.1 using 1,035 prompts across 14 major categories and 26 subcategories. The results? Wan2.1 consistently outshined or comparable to both open and closed-source competitors, making it one of the best video generation models available.
2. Runs on Regular GPUs
Unlike many AI video tools that need expensive setups, Wan2.1’s T2V-1.3B model requires only 8.19 GB of VRAM. This means it can run on common gaming GPUs, like an RTX 4090, generating a 5-second 480P video in about 4 minutes. That’s a game-changer for creators who don’t have access to high-end hardware.
3. Multi-Purpose Capabilities
Wan2.1 isn’t just for making videos from text. It can handle multiple tasks, including video editing, Image-to-Video (I2V), Text-to-Image (T2I), and Video-to-Audio generation.
4. Visual Text Generation
One standout feature is Wan2.1’s ability to create videos with readable text in both Chinese and English. That’s a big deal for anyone making instructional videos, animated storytelling, or AI-assisted filmmaking.
How Alibaba Wan2.1 Works
Wan2.1 achieves its impressive results thanks to advanced AI architecture and smart data processing.
1. 3D Variational Autoencoders (VAEs)
At the core of Wan2.1 is a new 3D causal VAE architecture called Wan-VAE, specifically designed for video generation. This approach combines multiple strategies to improve video compression, reduce memory use, and maintain consistency over time.
What makes Wan-VAE especially powerful is its ability to process unlimited-length 1080P videos without losing information about what happened earlier. This makes it excellent for video generation tasks that need to stay consistent from beginning to end.
2. Video Diffusion Transformers (DiT)
Wan2.1 is built on the Flow Matching framework using a Diffusion Transformer setup. It employs an advanced T5 encoder for multilingual text input, ensuring that text prompts are accurately transformed into visuals.
Performance Evaluation Across Different GPUs
Wan2.1 is designed to be efficient across different hardware setups. Here’s how it performs:
1. For the T2V-1.3B model
At 480P on an RTX 4090:
- Single GPU: 261.4 seconds (8.19 GB peak memory)
- 8 GPUs: 112.3 seconds (12.2 GB peak memory)
2. For the I2V and T2V 14B models
At 720P on high-end GPUs (H800/H100):
- Single GPU: 1837.9 seconds (69.1 GB peak memory)
- 8 GPUs: 287.9 seconds (29.9 GB peak memory)
These results show that while high-end hardware speeds things up, the model is still practical on standard GPUs.
Try Wan2.1 on Hugging Face
Before downloading the model, you can test Wan2.1 directly through the Hugging Face Space provided by Alibaba. This lets you experiment with both Text-to-Video and Image-to-Video capabilities without installing anything.
You can either enter text prompts for video generation or upload images to generate videos out of them. You can adjust settings as well.
Try it here: https://huggingface.co/spaces/Wan-AI/Wan2.1
How to Set Up Wan2.1 Locally
1. Clone the Repository and Install Dependencies
To install Wan2.1, first clone the repository and install dependencies:
git clone https://github.com/Wan-Video/Wan2.1.git
cd Wan2.1
pip install -r requirements.txt
Make sure you have PyTorch 2.4.0 or later.
2. Download the Models
Using Hugging Face CLI:
pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.1-T2V-14B --local-dir ./Wan2.1-T2V-14B
Using ModelScope CLI:
pip install modelscope
modelscope download Wan-AI/Wan2.1-T2V-14B --local_dir ./Wan2.1-T2V-14B
3. Running Text-to-Video
For T2V-14B
python generate.py --task t2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-T2V-14B --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
For T2V-1.3B:
python generate.py --task t2v-1.3B --size 832*480 --ckpt_dir ./Wan2.1-T2V-1.3B --offload_model True --t5_cpu --sample_shift 8 --sample_guide_scale 6 --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
4. Running Image-to-Video
Running Image-to-Video generation with Wan2.1 is similar to Text-to-Video, with options for processes with and without prompt extension. Here’s how you can set it up:
Without Prompt Extension:
python generate.py --task i2v-14B --size 832*480 --ckpt_dir ./Wan2.1-I2V-14B-480P --prompt "Describe the video you want to generate from the image."
With Prompt Extension:
You can use the Dashscope API or a local model for this.
DASH_API_KEY=your_key python generate.py --task i2v-14B --size 1280*720 --ckpt_dir ./Wan2.1-I2V-14B-720P --prompt "Describe the video you want to generate from the image." --use_prompt_extend --prompt_extend_method 'dashscope' --prompt_extend_target_lang 'ch'
Gradio Demo
For those who prefer a graphical interface, Wan2.1 provides a Gradio-based web UI:
cd gradio
DASH_API_KEY=your_key python t2v_14B_singleGPU.py --prompt_extend_method 'dashscope' --ckpt_dir ./Wan2.1-T2V-14B
Real-World Testing of Wan2.1
YouTuber Bijan Bowen tested this tool and compared it with other models like OpenAI’s Sora and CogVideoX. He followed the steps from GitHub, created a special environment for the software, and downloaded the model (about 16-18 GB). The whole process was straightforward.
Then, he tested the smallest 1.3B T2V model. It took about 6-7 minutes on his RTX 3090 Ti graphics card. Even though he used the smallest 1.3B model, the results were impressive when compared to other AI video tools like Sora and CogVideoX.
The videos showed a good understanding of what he asked for, smooth movement, and realistic details like flowing hair and clothing. He called Wan2.1 “fantastic” and said it was “definitely very much comparable to a lot of state-of-the-art models” despite being smaller than many competitors.
Wrapping Up
Alibaba’s Qwen Wan2.1 joins the AI video generation race by offering a powerful, open-source alternative to commercial models. Its ability to run on consumer hardware, support multiple video tasks, and generate high-quality results makes it a game-changer.
Whether you’re a developer, content creator, or researcher, this tool opens the door to endless creative possibilities. Now, with this tool, anyone with a decent computer can generate high-quality videos. Have you tested Wan2.1? Which model did you try? Let us know your experience in the comments below!
| Latest From Us
- Forget Towers: Verizon and AST SpaceMobile Are Launching Cellular Service From Space

- This $1,600 Graphics Card Can Now Run $30,000 AI Models, Thanks to Huawei

- The Global AI Safety Train Leaves the Station: Is the U.S. Already Too Late?

- The AI Breakthrough That Solves Sparse Data: Meet the Interpolating Neural Network

- The AI Advantage: Why Defenders Must Adopt Claude to Secure Digital Infrastructure








One Response
جيد