SYNTHETIC-1 is a collaborative project by the Prime Intellect team. Their goal is to create the largest open-source dataset of verified reasoning traces for mathematics, coding, and science. This project uses the popular DeepSeek-R1 model by DeepSeek AI, which has shown the power of generating high-quality, cold-start synthetic data for reinforcement learning (RL). The Prime Intellect team saw the potential of DeepSeek-R1 and took it further—scaling its application to build open reasoning models. By launching SYNTHETIC-1, they created a comprehensive, high-quality dataset designed to improve AI training.
Table of Contents
Overview of SYNTHETIC-1 Dataset
The SYNTHETIC-1 dataset comprises 1.4 million high-quality tasks and verifiers across various domains, including mathematics, coding, and STEM. Each of these tasks is carefully curated to ensure that they meet rigorous standards of verifiability. The structured nature of SYNTHETIC-1 ensures that models trained on it can achieve higher levels of accuracy and reliability.
Prime Intellect’s journey towards building fully open-source reasoning models began with the release of their INTELLECT-MATH model and the DeepSeek-R1 paper. These advancements have provided crucial insights into training powerful reasoning models.
The Role of Deepseek-R1 in SYNTHETIC-1
At the heart of SYNTHETIC-1 lies the popular DeepSeek-R1 model. This AI model is designed to generate synthetic data intended for reasoning tasks. The DeepSeek team’s approach involved first training the DeepSeek-R1-Zero model entirely through reinforcement learning, using the Group Relative Policy Optimization (GRPO) technique from DeepSeek v3. They then utilized the R1-Zero model to generate cold-start long Chain-of-Thought reasoning data. This was used to fine-tune the DeepSeek v3 model.
Finally, the team applied GRPO training again on the resulting Sequence-to-Sequence Finetuned (SFT) model, resulting in the stronger DeepSeek-R1 model. Cold-start data for SFT significantly improves model performance, making R1 much stronger than R1-Zero.
Achieving 2 Million Reasoning Samples with DeepSeek-R1
One of the most remarkable achievements of SYNTHETIC-1 is the generation of over 2 million reasoning samples through the application of DeepSeek-R1. With this feat, the dataset aims to provide high-quality data for training next-generation AI models. The ability to generate cold-start synthetic data is particularly important. It allows AI models to learn effectively from scratch, improving their performance in early development stages.
SYNTHETIC-1 Tasks and Verifiers
The dataset includes a diverse array of problems:
1. Verifiable Math Problems
These are approximately 777,000 curated mathematical challenges, primarily from high school competition questions. These problems have undergone LLM-based filtering to ensure their verifiability.
2. Coding Problems
With around 144,000 coding tasks, these challenges are sourced from various platforms and rewritten for multiple programming languages to enhance diversity.
3. Software Engineering Problems
The dataset features 70,000 real-world software engineering challenges derived from GitHub commits, designed to test practical coding skills.
4. STEM Questions
Over 313,000 open-ended questions span a wide range of scientific disciplines, ensuring comprehensive coverage of technical inquiries.
5. Synthetic Code Understanding Tasks
This innovative category consists of 61,000 tasks that challenge the capabilities of state-of-the-art language models.
The Technical Framework
GENESYS is the open-source library that powers SYNTHETIC-1, providing the infrastructure necessary for synthetic data generation and verification. This framework is designed to be user-friendly and easily extendable, allowing developers to create their own tasks and verifiers with minimal effort. GENESYS incorporates advanced verifiers like LLM judges, which assess the quality and accuracy of generated responses. The library supports containerized code execution environments, enabling efficient processing of tasks without bottlenecks.
Next Steps for SYNTHETIC-1
The researchers have outlined a clear roadmap for the future of SYNTHETIC-1. The next steps involve generating verified reasoning data and creating the largest dataset of reasoning traces in math, coding, and science. Open-sourcing these datasets is crucial for enabling the development of smaller yet powerful reasoning models.
In addition to data generation, the team aims to conduct training in a globally distributed setting. This approach allows anyone to contribute computing resources for the advancement of AI. By harnessing the collective power of the community, SYNTHETIC-1 seeks to maximize the potential of reinforcement learning.
| Latest From Us
- DeepSeek V3-0324 Now the Top Non-Reasoning AI Model Even Surpassing Sonnet!
- AI Slop Is Brute Forcing the Internet’s Algorithms for Views
- Texas School Uses AI Tutor to Rocket Student Scores to the Top 2% in the Nation
- Stable Virtual Camera: Transform 2D Images Into Immersive 3D Videos With AI
- World First: Chinese Scientists Develop Brain-Spine Interface Enabling Paraplegics to Walk Again