Mistral AI, a French startup, has been on a mission to deliver the best open models to the developer community. Their latest release, Mixtral 8x7B, is a testament to this commitment. It is the strongest open-weight model with a permissive license and the best model overall regarding cost/performance trade-offs. The model was trained on open web data and is licensed under Apache 2.0. It can handle a context of 32k tokens and supports multiple languages, including English, French, Italian, German, and Spanish. This high-quality sparse mixture of experts model (SMoE) with open weights is a game-changer, promising faster and more efficient performance than existing models.
Table of Contents
Working of Mixtral 8x7B Model
Mixtral 8x7B operates on the principle of a Mixture of Experts (MoE) model, a concept that has gained traction in recent years for its efficiency and scalability. Specifically, Mixtral 8x7B is comprised of 8 experts, each with 7 billion parameters.
The architecture of Mixtral AI is designed for token inference using only 2 of the 8 experts, a strategy that optimizes processing efficiency and speed. This approach is reflected in the model metadata, which provides deeper insights into its configuration.
{
"dim": 4096,
"n_layers": 32,
"head_dim": 128,
"hidden_dim": 14336,
"n_heads": 32,
"n_kv_heads": 8,
"norm_eps": 1e-05,
"vocab_size": 32000,
"moe": {
"num_experts_per_tok": 2,
"num_experts": 8
}
}
This configuration showcases an advanced setup, featuring a high-dimensional embedding space (dim: 4096), multiple layers (n_layers: 32), and a substantial number of heads for attention mechanisms (n_heads: 32). The MoE architecture (moe: {“num_experts_per_tok”: 2, “num_experts”: 8}) underscores the model’s focus on efficient, specialized processing for each token.
Size of Mixtral AI
Mixtral 8x7B has 46.7 billion total parameters but only uses 12.9 billion parameters per token. It, therefore, processes input and generates output at the same speed and for the same cost as a 12.9 billion model. This leads to better inference throughput at the cost of more vRAM.
The model’s reduced size, compared to a hypothetical 8x scale-up of Mistral 7B, is achieved through shared attention parameters, a clever design choice that significantly reduces the overall model size without compromising its performance.
Unique Capabilities of Mixtral 8x7B
Mixtral AI offers a range of capabilities:
1. Context Handling
The model can handle a context of 32k tokens, making it suitable for complex tasks that require a large context.
2. Multi-Language Support
Mixtral 8x7B supports multiple languages, including English, French, Italian, German, and Spanish, making it versatile for diverse applications.
3. Code Generation
The model shows strong performance in code generation, particularly in the science domain, including mathematics.
4. Instruction-Following Model
The Mistral AI model can be fine-tuned into an instruction-following model that achieves a score of 8.3 on the MT-Bench.
How Mixtral 8x7B Outperforms LLaMa 2 Models (7B to 70B)?
Mixtral 8x7B outperforms its predecessor, Mistral 7B and rivals LLaMa 2 models (7B to 70B) in most benchmarks while delivering six times faster inference. Mixtral 8x7B outperforms all LLaMa 2 models on MMLU, Math, Code, Knowledge and Reasoning Benchmarks. It also outperforms the LLaMa models til 34B on Comprehension, AGI Eval and BBH benchmarks but is much closer to surpassing the LLaMa 2 70B model.
Moreover, it also outperforms GPT3.5 and LLaMA 2 70B on most standard benchmarks (MMLU, ARC Challenge, MBPP and GSM-8K).
Hallucination Rates and Truthfulness
Mixtral 8x7B displays higher truthfulness (73.9% vs 50.2% on the TruthfulQA benchmark) and less bias on benchmarks compared to the LLaMa 2 70B Model. This gives it a strong foundation, although fine-tuning can further improve safety.
Overall, Mixtral displays more positive sentiments than LLaMa 2 on BOLD, with similar variances within each dimension.
Language
Also, Mixtral 8x7B masters French, German, Spanish, Italian, and English, surpassing LLaMa models on all benchmarks (Arc-c, HellaSwag, MMLU).
How to Try Mixtral 8x7B?
Below are some ways to try the Mistral AI’s Mistral 8x7B model:
1. Using Mistral AI Platform
Sign up to join Mistral’s AI platform to enjoy early beta access. It allows easy model access through API. For business needs, contact Mistral for accelerated access. The Mixtral-8x7B model is available behind the mistral-small endpoint. Then, you’ll also get access to mistral-medium.
2. Directly From the Source
If you’re an experienced researcher or developer, you can access Mixtral 8x7B and Mixtral 8x7B Instruct directly by downloading the provided torrent using Mistral’s magnet link.
magnet:?xt=urn:btih:5546272da9065eddeb6fcd7ffddeef5b75be79a7&dn=mixtral-8x7b-32kseqlen&tr=udp%3A%2F%https://t.co/uV4WVdtpwZ%3A6969%2Fannounce&tr=http%3A%2F%https://t.co/g0m9cEUz0T%3A80%2Fannounce
— Mistral AI (@MistralAI) December 8, 2023
RELEASE a6bbd9affe0c2725c1b7410d66833e24
3. Locally on LM Studio
LM Studio allows offline model usage on Mac, Windows, or Linux. It supports GGML, Llama and more models from Hugging Face. Also, the version 0.2.9 supports Mixtral 8x7B. After installing LM Studio, search for Mixtral to find compatible versions for use in the in-app Chat UI or OpenAI-compatible local server.
4. Hugging Face
You can use Hugging Face for Mixtral models, including the base and fine-tuned versions for chat-based interactions:
- Mixtral Base Model Card: mistralai/Mixtral-8x7B-v0.1
- Mixtral Instruct Model Crad: mistralai/Mixtral-8x7B-Instruct-v0.1
Download ready-to-use checkpoints via the HuggingFace Hub or convert raw checkpoints to HuggingFace format. Detailed instructions, including loading and running the model with Flash Attention 2, can be found here.
5. Using Demos
i. Poe
It is available on Poe as Mixtral-8x7B-Chat and is hosted by Fireworks.ai.
ii. Perplexity Labs
Click the link below and select “mixtral-8x7b-instruct” among models to start experimenting with this model.
iii. HuggingChat
Click the link below and then click the settings icon, and from there, activate the mistralai/Mixtral-8x7B-Instruct-v0.1.
Link: https://huggingface.co/chat
iv. Together.AI
If you aren’t signed up yet, then first sign up to start using mistralai/Mixtral-8x7B-Instruct-v0.1 through Together.ai API.
Link: https://api.together.xyz/playground/chat/mistralai/Mixtral-8x7B-Instruct-v0.1
v. Vercel
Moreover, Vercel also offers a demo of Mistral-mixtral-8x7B that allows users to compare responses with other AI conversational models. It provides a reliable way to compare the mixtral 8x7B and the Meta LLaMA 2 70B model.
Link: https://sdk.vercel.ai/
The Mixtral 8x7B Instruct Model
The Mixtral 8x7B Instruct, an addition to the Mixtral 8x7B, is an optimized model that has been fine-tuned and optimized through Direct Preference Optimization (DPO). This model has shown impressive results, achieving a score of 8.30 on the MT-Bench, which is comparable to GPT-3.5. This has confirmed its position as a leading open-weights model in its class.
Availability of Mixtral AI
Mistral AI has made Mixtral 8x7B accessible to the community by submitting changes to the vLLM project. Skypilot allows the deployment of vLLM endpoints on any instance in the cloud, providing flexibility and scalability.
Mistral AI is currently using this model behind their endpoint mistral-small, which is available in beta. By registering, developers can get early access to all generative and embedding endpoints. This makes it easier for developers to experiment with and utilize the capabilities of this model.
Final Takeaway
The arrival of Mixtral AI model is a big step in AI progress. It proves open models can rival top proprietary ones and stay transparent. Its top-notch performance, speed, and abilities will change AI. As developers keep pushing AI boundaries, models like Mixtral 8x7B will be vital.
| Related: Mistral 7B: The Best Tiny Model That Beats Llama 2 Models






