Artificial intelligence (AI) is rapidly advancing, and within this dynamic landscape, video understanding stands as a pivotal area of innovation, exemplified by Meta AI, in collaboration with Stanford researchers, has unveiled Apollo, a family of video-centric Large Multimodal Models (LMMs). Designed to tackle the intrinsic complexities of video data—spanning both spatial and temporal dimensions. Recognizing the unique challenges of video data—specifically, its temporal complexity (how videos change over time) and multi-modal nature (how different types of information work together)—the Apollo family sets a new high standard for understanding videos and language. Let’s take a closer look at these impressive Apollo models and the key ideas behind them.
Table of contents

The Challenges of Video Understanding
Unlike still pictures, videos are dynamic. They include how things look, how they change over time, and how different types of information work together. This means we need special ways to build the models and train them to understand things like movement, the story a video is telling, and how different parts of the video relate to each other.
To solve these problems, researchers have been figuring out:
- Incoder Selection: How best to blend text, image, and video data.
- Training Strategies: Multi-stage processes to balance pretraining and fine-tuning.
- Data Composition: Determining the right mix of modalities (text, images, and videos) for optimal learning.
- Efficient Scaling: Developing smaller models that match or outperform larger counterparts.
Apollo: A Big Step Forward in Video-Language Models
Using these research ideas, the team designed Apollo with precision and the ability to grow. Here are some key innovations:
Encoder Architecture:
Apollo integrates the Qwen2.5 series of Large Language Models (LLMs) with SigLIP-SO400M and InternVideo2 encoders. By combining the strengths of video-specific and image-based encoders, Apollo excels in aligning text, images, and videos.

Training Strategies:
A three-stage training schedule was implemented, balancing encoder freezing and unfreezing, with diverse multimodal datasets. Fine-tuning was targeted to improve domain-specific and reasoning tasks, particularly in long-form video contexts.
Optimal Data Mix:
Apollo’s data composition prioritizes 10-14% text data within a video-heavy mix. This balance prevents catastrophic forgetting while improving fine-grained vision-language alignment.
Token Resampling:
The Perceiver Resampler resamples encoder outputs into uniform 32 tokens per frame, optimizing computational efficiency without compromising detail.
Unparalleled Performance Across Benchmarks
The Apollo family’s performance is nothing short of remarkable. It has outperformed many models twice or three times its size across key video-language understanding benchmarks, including TempCompass, MLVU, Perception-Test, and ApolloBench.
Key Results:
- Apollo-1.5B: Surpasses larger models like LongVA-7B and Phi-3.5-Vision, demonstrating efficiency in smaller-scale models.
- Apollo-3B: Competes with cutting-edge 7B models like Oryx-7B and Video-XL, achieving impressive scores across multiple benchmarks.
- Apollo-7B: Sets a new standard by outperforming many 30B+ models, including Oryx-34B and VILA1.5-40B, showcasing the efficacy of its design.

For instance, on the MLVU benchmark, Apollo-7B scored 70.9, narrowly surpassing the 34B-parameter Oryx model’s score of 70.8.
The Importance of Data Composition
One of the standout insights from the Apollo project is the critical role of data composition in training video-LMMs.
- Text data: Including 10-14% text data ensures consistent performance and avoids overfitting on visual tasks.
- Video-heavy mixtures: Outperform image-heavy mixtures, providing richer context and better alignment for video understanding.

ApolloBench: A Streamlined Benchmark
Recognizing the inefficiencies in existing evaluations, the researchers introduced ApolloBench—a curated benchmark suite optimized for speed and relevance. It reduces evaluation time by 41× while maintaining strong correlations with traditional benchmarks.

Implications for AI Research
Apollo’s success has broader implications:
- Efficiency Over Scale: The results emphasize that smarter design and training can outperform brute-force scaling, reducing computational costs.
- Democratizing Video Understanding: Smaller models like Apollo-3B provide accessible tools for researchers and developers, accelerating innovation in the field.
- Unified Evaluation: The introduction of ApolloBench standardizes evaluation, enabling more consistent benchmarking for video-LMMs.
Conclusion:
The Apollo models are more than just a technical achievement. They are changing how the AI world thinks about understanding videos. By focusing on efficiency, using the right mix of information, and having a smart design, Apollo is pushing the limits of what’s possible in AI for videos and language.
As this field keeps growing, Apollo is an important milestone and a guide for future research. With its excellent performance and ability to be scaled down, Apollo has the potential to make video-LMM research more accessible and speed up progress in AI that understands videos.
Explore the future of AI video understanding with Apollo—a model family designed to be excellent and easy to use.
| Latest From Us
- AI-Generated Book Scandal: Chicago Sun-Times Caught Publishing Fakes
- It’s Over for SWE: After MS Copilot… Meet Jules, Google’s AI-Powered Code Assistant
- SHOCKING AI Scaling With ParScale: 22X Less Memory, 6X Faster LLMs Are HERE!
- Assign Coding Tasks to GitHub Copilot Agent Like It’s a Human Programmer Bug Fixes, Refactors, and More
- Klarna AI Customer Service Backfires: $39 Billion Lost as CEO Reverses Course