Large Language Models (LLMs) have seen rapid growth in recent years, increasing from billions to trillions of parameters. However, scaling models lead to prohibitively large sizes. Microsoft’s previous models, Phi-1 and Phi-2, showed how careful data selection can improve the performance of smaller models. Now, Microsoft has developed a new model series called Phi-3 that includes Phi-3-mini, Phi-3-small, and Phi-3-medium AI models, which show performance close to larger models, such as Mixtral 8x7B and GPT-3.5, despite their small sizes.
Table of Contents
Microsoft Phi-3 Models Series
1. Phi-3-mini
Phi-3-mini has 3.8 billion parameters which has a context length of 4K/128K words. It uses architecture similar to that of Llama but with a larger vocabulary. The model is trained with 3.3 trillion tokens using the bfloat16 format for efficiency. It can generate over 12 tokens per second when run locally on iPhone.
To run the Microsoft Phi-3 Mini Instruct model, you can check the following options:
1: The mini version has two variants: 4K and 128K:
2: The optimized versions of Phi-3-mini-128k-instruct and Phi-3-mini-4k-instruct to accelerate inference with ONNX Runtime are as follows:
3: GGUF format for the Phi-3-Mini-4K-Instruct is as follows:
2. Phi-3-small
Encouraged by Phi-3-mini’s success, Microsoft also trained a 7 billion-parameter Phi-3-small model. Phi-3-small has 7B parameters, employs more optimal attention, and uses multilingual data. Its vocabulary size is 100352, and its default context length is 8K. It follows the standard decoder architecture of a 7B model class, with 32 layers and a hidden size of 4096.
3. Phi-3-medium
Microsoft also trained a larger Phi-3-medium model with 14 billion parameters to explore scaling their approach. Phi-3-medium uses the same architecture as Phi-3-Mini but with an increased size of 40 attention heads and layers with an embedding dimension of 5120.
It was trained on the same data mixture as Phi-3-Small for 4.8 trillion tokens. Microsoft views the 14B model numbers as a preliminary preview while they analyze ways to better scale their approach to even larger models.
Phi-3 Training Methodology
Microsoft’s breakthrough is a result of optimized training methodology. Unlike scaling models, they focused on curating high-quality educational data. Phi-3 was trained on filtered web data from trusted sources and synthetic data generated by models.
The training has two phases: phase 1 taught general knowledge, while phase 2 emphasized logical reasoning. Microsoft claims this “data optimal” approach leads to better performance than generic scaling laws.
Performance Evaluation of Phi-3 Models
Microsoft thoroughly evaluated the Phi-3 models against several academic benchmarks and compared them with state-of-the-art models like Phi-2, Mistral, Gemma, Llama and Mixtral.
Phi-3-Mini shows comparable results to GPT-3.5 and Mixtral on reasoning benchmarks like MMLU and MT-Bench. Notably, Phi-3-Mini achieved this level despite being 30X smaller than the compared models, confirming the impact of optimized training methodology.
Phi-3-Small and Phi-3-Medium achieve progressively better performance, validating Microsoft’s data-focused approach. However, benchmarks improve less from 7B to 14B indicating need for optimal data mixture.
Safety and Limitations
Microsoft ensured Phi-3-Mini is aligned with their AI safety principles. It underwent various tests and evaluations to reduce likelihood of harmful, toxic and unintended behaviors.
While Phi-3-Mini achieves strong language skills, its small size limits factual knowledge in certain tasks. However, this can be overcome by combining it with a search engine. It also lacks multilingual capacity, which Microsoft is working on.
Conclusion
Microsoft’s refined training approach proved data optimization trumps simple scaling. The resulting Phi-3 models, especially tiny yet mighty Phi-3-Mini, demonstrate how capable AI can be deployed directly on devices while still prioritizing safety. This marks an exciting step towards more accessible and beneficial language models.
| Related: Phi-2: Microsoft Powerful Small Model That Beats Llama 2 and Mistral 7B Models
| Also Read Latest From Us
- Forget Towers: Verizon and AST SpaceMobile Are Launching Cellular Service From Space

- This $1,600 Graphics Card Can Now Run $30,000 AI Models, Thanks to Huawei

- The Global AI Safety Train Leaves the Station: Is the U.S. Already Too Late?

- The AI Breakthrough That Solves Sparse Data: Meet the Interpolating Neural Network

- The AI Advantage: Why Defenders Must Adopt Claude to Secure Digital Infrastructure







