Digital Product Studio

Meta Research Introduces Revolutionary Self-Rewarding Language Models Capable of GPT-4 Level Performance

Language models have come a long way in recent years, but to achieve superhuman capabilities, we need models that can continuously improve and provide high-quality feedback. Traditional approaches rely on human preferences for training reward models, which can be limited by human performance and lack the ability to improve during training. However, a new paradigm called Self-Rewarding Language Models (SRLMs) is changing the game. SRLMs take a different approach to language model training. With SRLMs, Meta has developed a system that generates its own rewards, allowing for continual improvement in both instructions following and reward modelling abilities.

Meta Self-Rewarding Language Models (SRLMs) in Action

The SRLM approach involves two key skills: instruction following and self-instruction creation. The model acts as an instruction-following model, generating helpful and high-quality responses to user prompts. Simultaneously, it can generate and evaluate new instruction following examples to augment its training set. This dual capability allows the model to align itself, continually improving its performance. 

So, instead of relying on frozen reward models, these models use themselves as judges to provide feedback and rewards during training. This innovative technique, known as LLM-as-a-Judge prompting, allows the model to continuously update and improve its reward model, avoiding the limitations of traditional approaches.

The Iterative DPO Framework

An iterative framework called Direct Preference Optimization (DPO) is employed to train SRLMs. Starting with a seed model, it went through multiple iterations, improving both instruction following and reward modelling abilities. In each iteration, the model generates candidate responses for prompts and evaluates their quality using LLM-as-a-Judge prompting. 

The preference dataset created from this process is then used to train the next iteration of the model. This allows both its response generation and reward modelling abilities to reinforce each other through mutual training. Unlike prior methods, the reward signals are no longer capped by human performance – the model can push itself to superhuman levels.

Meta Self-Rewarding Language Models (SRLMs) Capable of GPT-4 Level Performance
Meta Self-Rewarding Language Models (SRLMs) Capable of GPT-4 Level Performance

Over 3 Iterations, Performance Keeps Increasing to Achieve GPT-4 Level Performance

The team started with Llama-2 70B, a powerful pre-trained model, and provided it with a seed dataset of human-annotated examples for reference. They then trained it to both carry out tasks from prompts and judge its own performance on a 5-point scale. This allowed the model to generate additional self-supervised training examples by attempting prompts and scoring its responses.

Through an iterative process of self-instruction, response generation, and self-evaluation, the researchers were able to train an increasingly capable model without external human input. Each round used the previous model’s self-generated examples to train the next version via preference learning.

Remarkably, after just three iterations, the resulting model outperformed other state-of-the-art systems on the AlpacaEval 2.0 benchmark, surpassing models like Claude 2, Gemini Pro and GPT-4 0613. This demonstrates the incredible potential of Self-Rewarding Language Models (SRLMs) to achieve GPT-4 level performance.

Meta Self-Rewarding Language Models (SRLMs) Capable of GPT-4 Level Performance
Meta Self-Rewarding Language Models (SRLMs) Capable of GPT-4 Level Performance

The Potential for Continuous Improvement

By incorporating self-rewarding mechanisms, Self-Rewarding Language Models break free from the constraints of fixed reward models. This opens up the possibility of achieving language models that continually improve in both instruction-following and reward modelling. While there may be limits to this effect in real-world settings, the potential for obtaining superior reward models and language models is intriguing.


Self-Rewarding Language Models are revolutionizing language model training by combining instruction following and reward modelling in an iterative framework. By leveraging the model’s ability to self-create rewards, SRLMs outperform existing systems and hold the potential for continuous improvement. As the field of language models progresses, SRLMs pave the way for superhuman agents that continually enhance their abilities.

| Read More from Meta:


Stay updated with the latest news and exclusive offers!

* indicates required
Faizan Ali Naqvi
Faizan Ali Naqvi

Research is my hobby and I love to learn new skills. I make sure that every piece of content that you read on this blog is easy to understand and fact checked!

Leave a Reply

Your email address will not be published. Required fields are marked *

The reCAPTCHA verification period has expired. Please reload the page.