Microsoft Research has made a significant announcement with the release of Phi-2, a small language model that’s surprisingly powerful. This model, with its 2.7 billion parameters, showcases impressive capabilities in reasoning, language understanding, mathematics, coding, and common sense.
Phi-2’s small compact size is its strength. Its compactness allows it to run on a laptop or mobile device, making it widely accessible for various applications. This accessibility does not compromise its power, as it matches or even exceeds the performance of larger foundational models, including Mistral AI’s 7B Mistral, Meta Platforms Inc.’s 13B Llama 2, and even the 70B Llama-2 on certain benchmarks.
Table of Contents
Training of Phi-2
Phi-2 has a standard transformer architecture trained with the next-word prediction on 1.4 trillion tokens of synthetic and web NLP/coding data. The training spanned 14 days, utilizing 96 A100 GPUs.
Knowledge transfer from the 1.3B parameter Phi 1.5 model avoided expensive training of the added parameters. Though not aligned with human feedback, tailored data curation yielded better toxicity and bias behaviour than aligned models.
Strategies Behind Phi-2’s Strong Performance
Microsoft scaled up Phi-2 from its earlier 1.3 billion parameter Phi-1.5 model via two key advantages.
1. Training Data Curation
Phi-2’s training data consciously focuses on “textbook-style” content, explicitly teaching common sense reasoning across science, daily living, theory of mind, and more. Additional filtered web content bolsters this foundation. This extends previous observations on the high value of textbook-like data.
2. Innovative Scaling Technique
Knowledge from Phi-1.5 was efficiently ported over to the larger Phi-2 during training. This scaled knowledge transfer not only accelerates training convergence but shows a clear boost in Phi-2 benchmark scores.
Phi-2 Evaluation Across Several Benchmarks
Microsoft evaluated the Phi-2 model performance on academic benchmarks while comparing it with popular language models such as the Llama 2 models (7B to 70B) along with the Mistral 7B model.
The academic benchmarks covered in this evaluation span the following categories:
- Big Bench Hard (BBH) (3-shot with CoT)
- Commonsense Reasoning: PIQA, WinoGrande, ARC easy and challenge, SIQA
- Language Understanding: HellaSwag, OpenBookQA, MMLU (5-shot), SQuADv2 (2-shot), BoolQ
- Math: GSM8k (8 shot)
- Coding: HumanEval, MBPP (3-shot)
Below are the results of the evaluation:
1. Surpassing Llama-2 Models and Mistral 7B
Phi-2 matches or beats Llama 2 models (7 to 70 billion parameters) on BBH, commonsense reasoning, language understanding, coding, and mathematical tests. Outperformance in coding and mathematical reasoning is especially notable and demonstrates broadly equal or superior scores to popular open-source models with far greater parameters. Phi-2 also surpasses the Mistral 7B model on the same benchmarks, excluding the language understanding one.
2. Surpassing Google’s Gemini Nano 2
Recently, Google unveiled Gemini Nano 2 with 3.2 billion parameters, but Phi-2 shows gains over this model as well on assessments like BoolQ, MBPP, and MMLU.
Despite the Phi-2 evaluation on these benchmarks, Microsoft also evaluated this model using several Microsoft internal proprietary datasets and tasks, comparing it again to Mistral and Llama-2. After observing similar trends again, Microsoft reassures that Phi-2 outperforms Mistral-7B and Llama-2 models (7B, 13B, and 70B).
Ideal Foundation for Advancing LM Research
Phi-2’s outstanding blend of efficiency and performance makes it an attractive foundation for researchers to build upon. Directions such as interpretability mechanisms, safety improvements, and task-specific fine-tuning represent promising opportunities.
To spur innovation, Microsoft has publicly released Phi-2 access through the Azure AI Studio platform. With Phi-2 as a base, the future looks bright for compact yet mighty language model research!
| Also Read:






