DeepSeek and Tsinghua University researchers have created a new approach to AI reasoning that can change how large language models (LLMs) handle complex problems. DeepSeek’s new AI reasoning method is a dual strategy that combines generative reward modelling (GRM) with a technique called Self-Principled Critique Tuning. This approach dramatically improves AI responses without requiring bigger models, according to a paper published on Friday.
Table of Contents
- How DeepSeek’s New AI Reasoning Method Works
- The Evolution of DeepSeek’s AI Research
- Why DeepSeek’s AI Reasoning Method Matters
- The Technical Foundation
- Performance Evaluation
- The Science Behind Self-Principled Critique Tuning
- How DeepSeek’s Approach Compares to Traditional Methods
- The Impact of DeepSeek’s Dual Approach
- Real-World Benefits of DeepSeek’s New AI Reasoning Method
- Wrapping Up
How DeepSeek’s New AI Reasoning Method Works
DeepSeek’s AI reasoning method tackles one of AI’s biggest challenges – getting better performance without just making models larger. The approach uses something called “inference-time scaling” which is pretty clever in its simplicity.
Here’s what it does: instead of generating just one answer, the system creates multiple possible responses. Then, it develops custom evaluation principles specifically for that question and uses them to judge which answer is best. This mimics how humans tackle problems – we often consider different angles before settling on our final answer.
“We propose Self-Principled Critique Tuning to foster scalable reward generation behaviours in GRMs through online reinforcement learning,” the researchers explain. This helps the AI generate better evaluation criteria and more accurate assessments.
The Evolution of DeepSeek’s AI Research
DeepSeek has been steadily advancing its AI capabilities. In February, the company upgraded its V3 model (DeepSeek-V3-0324) with “enhanced reasoning capabilities, optimized front-end web development and upgraded Chinese writing proficiency.”
The company’s R1 reasoning model previously attracted global attention for its cost-efficient performance relative to leading models, and rumours suggest that DeepSeek-R2 could be released soon, though the company hasn’t officially confirmed this.
The company has also demonstrated commitment to open research by open-sourcing five of its code repositories. According to their recent paper, the DeepSeek-GRM models will also be released as open-source, though no specific timeline was provided.
This latest research on inference-time scaling represents a significant advancement in DeepSeek’s approach to AI reasoning.
Why DeepSeek’s AI Reasoning Method Matters
Until now, AI has mostly improved by scaling up – adding more parameters and using more computing power. DeepSeek’s AI reasoning method flips the script by making models smarter rather than just bigger.
In their testing, the researchers showed that their 27 billion parameter model with inference-time scaling outperformed models with up to 671 billion parameters. That’s a massive efficiency breakthrough – getting better results with significantly less computing power.
Their research demonstrated that “direct voting with 32 samples of DeepSeek-GRM-27B could achieve comparable performance with the 671B MoE model.” This suggests we don’t need to keep building ever-larger models to get better AI performance.
The Technical Foundation
The system relies on two key components:
1. Generative Reward Modeling (GRM)
Unlike traditional reward models that just assign number scores, GRMs produce detailed textual feedback explaining their evaluations.
2. Self-Principled Critique Tuning (SPCT)
This novel approach teaches the model to create custom evaluation principles for each specific query and then use these principles to assess different responses.
The combination creates a system that adapts its evaluation criteria based on the question – similar to how humans use different standards when judging different types of problems.
Performance Evaluation
The research team put their approach through rigorous testing against existing methods and models. DeepSeek-GRM-27B outperformed baseline methods and achieved competitive results compared to much larger models like Nemotron-4-340B-Reward and GPT-4o.
What really stands out is how the model performed with inference-time scaling. When allowed to generate and evaluate multiple responses (up to 32 samples), DeepSeek-GRM-27B, with meta-reward model guidance, achieved an overall performance score of 72.8%, beating all other tested models.
This shows that DeepSeek’s AI reasoning method can effectively use additional computing at inference time rather than requiring more parameters during training.
The Science Behind Self-Principled Critique Tuning
Self-Principled Critique Tuning represents a shift in how reward models are trained. Instead of using fixed evaluation criteria, SPCT teaches models to generate appropriate evaluation standards for each query.
The training happens in two phases:
- Rejective fine-tuning, which helps the model learn to generate principles and critiques in the correct format
- Rule-based online reinforcement learning that improves the quality of these principles and critiques
“SPCT enables GRMs to learn to adaptively posit principles and critiques based on the input query and responses, leading to better outcome rewards in general domains,” the researchers write.
How DeepSeek’s Approach Compares to Traditional Methods
Traditional reward modelling approaches typically fall into three categories:
- Scalar approaches that output simple numerical scores
- Semi-scalar approaches that generate textual judgments plus scalar rewards
- Generative approaches that produce textual feedback from which rewards can be extracted
DeepSeek’s AI reasoning method takes the generative approach but adds a key innovation: the ability to create domain-specific principles for evaluation rather than using fixed criteria across all queries.
This overcomes the limitations of previous methods by being more flexible for different input types and producing more diverse, high-quality reward signals through sampling.
The Impact of DeepSeek’s Dual Approach
By combining generative reward modeling with Self-Principled Critique Tuning, DeepSeek’s AI reasoning method represents a significant shift in how AI systems can reason more effectively.
By generating multiple sets of principles and critiques in parallel and then using a meta-reward model to guide the voting process, the system achieves what the researchers call “finer-grained outcome rewards within an expanded value space.”
This means the system can make more nuanced judgments about responses rather than being limited to simplistic evaluations.
Real-World Benefits of DeepSeek’s New AI Reasoning Method
The practical benefits of this approach extend beyond technical improvements in model performance. By enabling more effective reasoning with smaller models, this method could:
- Reduce the environmental impact of AI by requiring less computing power
- Make advanced AI capabilities more accessible to organizations with limited resources
- Improve user experiences by providing more thoughtful, well-reasoned responses
- Enable AI systems to handle more complex reasoning tasks that previously required human judgment
Wrapping Up
What makes DeepSeek’s AI reasoning method particularly significant is how it addresses fundamental limitations in current AI approaches. By enabling models to establish their own context-specific evaluation criteria, the system can adapt to a wider range of problems without specialized training.
This adaptability is crucial for creating truly general-purpose AI systems that can reason effectively across domains. The research suggests we’re moving closer to AI systems that can think more like humans – considering multiple perspectives, establishing appropriate evaluation criteria, and making nuanced judgments.
As DeepSeek continues to develop this approach and presumably incorporates it into future models like the rumoured R2, we may see significant advancements in how effectively AI systems can reason about complex problems.
| Latest From Us
- Forget Towers: Verizon and AST SpaceMobile Are Launching Cellular Service From Space

- This $1,600 Graphics Card Can Now Run $30,000 AI Models, Thanks to Huawei

- The Global AI Safety Train Leaves the Station: Is the U.S. Already Too Late?

- The AI Breakthrough That Solves Sparse Data: Meet the Interpolating Neural Network

- The AI Advantage: Why Defenders Must Adopt Claude to Secure Digital Infrastructure







