Mistral AI recently unveiled their new Mixture-of-Experts model, Mixtral 8x22B, one of the most powerful open-source AI models to date with 178 billion parameters. To run this massive model, the Max variant from the latest M3 chip lineup by Apple has gained attention with its robust specifications. In this article, we will explore running the new Mixtral 8x22B model on an M3 Max to understand its performance capabilities.
Table of Contents
Mixtral 8x22B Model by Mistral AI
Developed by Mistral AI, Mixtral 8x22B is one of the biggest openly available language models to date, containing over 178 billion parameters. It uses a Mixture of Experts architecture with eight specialized expert models, with two assigned per input token for highly granular comprehension. The model is available on Hugging Face at https://huggingface.co/mistral-community/Mixtral-8x22B-v0.1-4bit for anyone to utilize. It has a 65K context window. Given its enormous size, traditional GPUs may struggle to accommodate it fully in memory.
Apple New M3 Max Chip
Apple’s latest M3 Max chip touts several capabilities that position it well for such computationally intensive tasks. It features a 40-core GPU that’s up to 50% faster than M1 Max, along with a powerful 16-core CPU. Critically, it supports up to 128GB of unified memory, allowing developers to deploy truly massive models. With 92 billion transistors and an industry-leading architecture, the M3 Max is primed for tackling demanding AI tasks. This holds promise for running gigantic AI models like Mixtral 8x22B directly on the chip.

Mixtral 8x22B Model Performance on Apple M3 Max
To test the capabilities, a Reddit user, PerformanceRound7913, ran the Mixtral 8x22B model on their M3 Max with 128GB of RAM.

Some key findings from their demo video posted on Reddit include:
- The model operated at a 4-bit quantization with a prompt context of 8K tokens.
- The initial response time for the first prompt was around 7 seconds, but subsequent replies were much faster, nearly 1 second, likely due to caching.
- Throughput was measured at around 4.5 tokens per second on average during interaction.
- Graphics monitoring tools indicated steady GPU utilization around 50-60% while maintaining low power usage under 30W.
- Terminal split view enabled monitoring of various metrics alongside interaction.
While the average throughput may seem modest compared to dedicated GPU setups, it is impressive given this is a general purpose laptop chip. The unlimited memory pool ensures the full model capacity is unleashed without compromises. Finetuning capabilities through MLX are another advantage. Overall, the results establish Apple silicon as a viable platform for deploying state-of-the-art models.
M3 Max Inference Speeds
In the video, the Mixtral 8x22B model runs at around 4.5 tokens per second on the M3 Max setup. However, others pointed out that the video is sped up 22x, so the actual token throughput is likely much lower. With more context, the speed would slow down further over time. Still, for a CPU-based system, these speeds are impressive compared to most alternatives.
Apple M3 Max Alternatives and Trade-offs
For bare maximum performance, multiple high-end GPUs would certainly outpace the M3 Max. Running the same model on multiple high-end Nvidia RTX 3090 GPUs could achieve 10-12 tokens/second initially but would scale better with longer contexts compared to the CPU. However, the initial investment and power consumption are significantly higher. Another option is renting cloud GPU instances, but those lack on-device access.
Overall, the Apple silicon delivers a good balance of on-device deployment and optimization for general users looking for an “appliance-like” AI experience. Power efficiency also makes it suitable even for deployments that aren’t 100% compute-focused.
Conclusion
This demonstration of running the massive Mixtral 8x22B model on an M3 Max establishes Apple silicon as a practical and accessible platform for deploying AI models. With each generation, performance continues to improve, delivering an even better experience.
The unified memory architecture is well-suited to capitalize on forthcoming even larger models. As ML capabilities are further optimized for these systems, they will empower more users to benefit from advanced AI.
| Also Read Latest From Us
- Virtual Reality and Eye Tracking Help Diagnose Adult ADHD With 81% Accuracy
- University of Zurich Researchers Secretly Used AI on Reddit’s r/ChangeMyView
- Best 3D Inpainting Tool Now Available via Colab & Gradio
- UPS in Talks with Figure AI to Deploy Humanoid Robots in Logistics Operations
- PHGDH: How AI Helped A New Key to Solving Alzheimer’s Disease?