Hey everyone, buckle up because we’ve got some seriously cool news to share! You know how everyone’s talking about AI that can reason? Not just spit back facts, but actually think things through, like having its own little “aha!” moment? Well, guess what? Now, you can actually train your own reasoning model and you don’t need a massive server farm to do it.
Yep, you read that right. The team here at Unsloth has been cooking up something special, and we’re thrilled to announce that you can now reproduce that “aha!” moment locally, and get this – you only need a minimum of 7GB of VRAM! 🤯 This is a game-changer for anyone wanting to train their own reasoning models.
Table of contents
- Remember DeepSeek’s R1 “Aha!” Moment? Now That Is Within Reach
- Unsloth Steps In: 80% Less VRAM Needed to Train Your Reasoning Model!
- Okay, But What Is GRPO and This “Aha” Moment Thing, Really? (Understanding the Tech Behind Reasoning Model Training)
- Here’s a simplified peek under the hood of how GRPO works:
- Get Your Hands Dirty with Unsloth and GRPO! Start Training Your Reasoning Model Today!
- Supercharged Inference with Unsloth and vLLM: It’s Lightning Fast
- Ready to Unlock Your Model’s “Aha!” Moment? Start to Train Your Own Reasoning Model Now!
Remember DeepSeek’s R1 “Aha!” Moment? Now That Is Within Reach
Let’s rewind a bit. DeepSeek’s incredible R1 research showed us something mind-blowing: AI models could actually learn to think for longer all on their own, without us humans telling them to. They called it the “aha moment,” this point where the model figures out it needs more “thinking time” to get to the right answer. Pretty cool, huh? This was all thanks to something called Group Relative Policy Optimization, or GRPO for short.
The thing is, doing this originally needed a whole lot of computing power. We’re talking serious GPUs, the kind that cost a fortune. But we thought, “There’s gotta be a better way to train your reasoning model without breaking the bank!”
Unsloth Steps In: 80% Less VRAM Needed to Train Your Reasoning Model!
And that’s where we come in! We’ve been working hard to make GRPO way more efficient. Like, way more efficient. We’ve tweaked and optimized the whole process so it now uses 80% less VRAM than before. Let that sink in. 80%! This efficiency leap is huge for anyone looking to train their own reasoning models.

What does that mean for you? It means you can now train your own reasoning model using models like Qwen2.5 (1.5B) with just 7GB of VRAM. Seriously! Just a single, regular GPU. We’ve even got Colab notebooks ready to go for models like Llama 3.1 (8B) so you can see it in action.
Remember Tiny-Zero? They showed that you could get your own “aha” moment with Qwen2.5, but it took four A100 GPUs! That’s a whopping 160GB of VRAM. We’ve brought that down to just 7GB. That’s like going from needing a truck to carry something to being able to just pop it in your backpack. It’s a game-changer for accessibility and makes reasoning model training available to so many more people.
And get this before, GRPO mostly worked with full fine-tuning, which is quite resource-intensive. But we’ve expanded it to work with QLoRA and LoRA too! This makes things even more flexible and efficient for training your own reasoning models.
With just 15GB of VRAM, you can now transform models with up to 15 billion parameters, think Phi-4 (14B), Llama 3.1 (8B), Mistral (12B), you name it, into powerful reasoning models. Imagine the possibilities when you train your own reasoning model tailored to your needs!
Okay, But What Is GRPO and This “Aha” Moment Thing, Really? (Understanding the Tech Behind Reasoning Model Training)
Good question! Let’s break it down a bit. DeepSeek researchers noticed this “aha moment” when they were training their R1-Zero model using something called reinforcement learning. Basically, the model started figuring out on its own that sometimes, it needed to think longer to get the right answer. No human told it to do that; it learned it by itself! It’s like watching a lightbulb go off in the AI’s “head.”
GRPO is the magic behind this. It’s a Reinforcement Learning algorithm that’s super efficient at optimizing how the model responds. Unlike other methods, it doesn’t need a “value function,” which simplifies things quite a bit for reasoning model training.
Think of it like this: Imagine you’re teaching someone to solve puzzles. With GRPO, instead of just telling them if they got the final answer right or wrong, you’re encouraging them to show their work. You’re rewarding the process of reasoning, not just the end result which is key to successful reasoning model training.
In our notebooks, we’re training models with GRPO hoping they develop their own ability to double check their work and explore different solutions basically, creating their own mini “aha moment.” This is the core of what it means to train your own reasoning model.
Here’s a simplified peek under the hood of how GRPO works:
- The model comes up with a bunch of possible answers (responses).
- Each answer gets a score based on how good it is (correctness, usefulness, etc.). You decide what “good” means with a reward function. It’s not some fancy AI judging it; it’s based on rules you set for your reasoning model training.
- We figure out the average score for all the answers in the group.
- Then, for each answer, we compare its score to that average.
- The model gets “encouraged” to produce more answers that scored higher than average. Think of it as positive reinforcement for good thinking! This reinforcement is how you effectively train your own reasoning model.
Let’s take a super simple example: teaching a model basic math.
Say we want the model to solve:
- What is 1+1? >> We want it to show some “thinking” >> The answer is 2.
- What is 2+2? >> Again, some “thinking” >> The answer is 4.
The old way? You’d need tons of data showing the “thinking” part the chain of thought. But with GRPO, we can guide the model to automatically develop that reasoning process itself! Instead of needing massive datasets of reasoning steps, we just need to create good “reward functions.” For instance, give it a point if it gets the answer right, maybe subtract a little if it misspells words you get the idea! You can create a whole bunch of these functions to reward different aspects of the reasoning process when you train your own reasoning model.

Get Your Hands Dirty with Unsloth and GRPO! Start Training Your Reasoning Model Today!
Ready to give it a whirl? If you’re going to use GRPO with Unsloth locally, just make sure you’ve got “diffusers” installed (pip install diffusers), as it’s needed for some of the background magic. This is your first step to training your own reasoning model.
Now, heads up you’ll want to let it train for at least 300 steps to really see the rewards start to climb. And make sure you’re using the latest version of vLLM for the best performance. These Colab examples are set up for a quick run (about an hour), so the results there are just a taste of what’s possible. For truly impressive reasoning skills, you’ll want to train for longer, think 12 hours or more. But the cool thing is, you can stop whenever you want and see how your model is progressing in its reasoning model training journey.
We recommend using models with at least 1.5 billion parameters to get those “thinking tokens” to generate properly. Smaller models might struggle with this. And if you’re using a base model, make sure it has a chat template set up before you train your own reasoning model.
Oh, and one more thing baked right into Unsloth: training loss tracking for GRPO! No need for extra tools like wandb anymore it’s all right there during your reasoning model training.
And guess what? Unsloth team didn’t stop at GRPO! They have also added support for Online DPO, PPO, and RLOO. Big shoutout to Keith and Joey for their awesome work that helped make all of this possible! This expands the toolkit for training your own reasoning models even further.
Supercharged Inference with Unsloth and vLLM: It’s Lightning Fast
But wait, there’s more! We’ve also teamed up with vLLM to give you a massive speed boost. We’re talking 20x more throughput and 50% VRAM savings! This is crucial for making your newly trained reasoning model practical to use.
Now, you can run vLLM directly in your fine-tuning setup. This means way faster processing, and you can even fine-tune and run inference on your model at the same time! On a single A100 40GB GPU, you can expect around 4000 tokens per second with Unsloth’s dynamic 4-bit quantization of Llama 3.2 3B Instruct. Even on a free Colab GPU (Tesla T4 with 16GB VRAM), you can get a solid 300 tokens per second. Fast inference makes your reasoning model training efforts truly worthwhile.
We also did some under-the-hood magic to cut down on memory usage when you use vLLM and Unsloth together. This saves you around 5GB of VRAM for Llama 3.1 8B and 3GB for Llama 3.2 3B. Every bit of VRAM counts, right? Especially when you’re training your own reasoning model on limited resources.
With Unsloth, you can now fine-tune and get the benefits of super-fast inference all in one package, and all within a reasonable VRAM budget. To use this speed boost, just install vllm and tell Unsloth you want fast inference when you load your model:
pip install unsloth vllm
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/Llama-3.2-3B-Instruct",
fast_inference = True,
)
model.fast_generate(["Hello!"])
Cool vLLM things we figured out in Unsloth that Enhance Your Reasoning Models:
It seems Unsloth’s team hasn’t just been focusing on GRPO, they’ve also been digging deep into vLLM and have uncovered some pretty neat tricks to boost performance, especially when it comes to training reasoning models. Here’s the scoop:
- Dynamic Quantization Gets a Thumbs Up from vLLM: Unsloth has announced that vLLM can now seamlessly handle their Dynamic 4-bit quantized models. Building on their earlier success with the 1.58-bit Dynamic R1 GGUF, Unsloth’s research indicates that dynamically adjusting quantization at the layer level significantly improves accuracy while keeping models nice and compact. This is a big win for anyone aiming to train efficient reasoning models.
- Automated Optimization in vLLM for Peak Performance: Unsloth has apparently implemented smart, automatic settings within vLLM to optimize resource usage and speed. This includes dynamically adjusting parameters like chunking and caching to make the most of your RAM and VRAM. They’ve even flipped the switch on super-optimized settings in vLLM by default. The result? A smoother, more streamlined training process for reasoning models.
- LoRA Loading in vLLM Gets a Turbo Boost: According to Unsloth, they’ve cracked the code to significantly speed up the loading of LoRA adapters in vLLM. They’re claiming load times are now up to 1.5 times faster! And they’re not stopping there – the team is actively investigating ways to directly edit LoRA adapters within vLLM, hinting at even faster loading speeds on the horizon. For those eager to train their own reasoning models, this translates to less waiting and more doing.
- VRAM Spikes in vLLM? Problem Solved: Unsloth reports they’ve tackled those pesky VRAM spikes that can sometimes pop up in vLLM, especially during batched generation. They’ve developed a special “batched generate” function to smooth out VRAM usage. This stability is crucial for successful reasoning model training, particularly if you’re working on machines with limited VRAM.
Ready to Unlock Your Model’s “Aha!” Moment? Start to Train Your Own Reasoning Model Now!
So, there you have it! Training your own reasoning model is no longer some far-off dream requiring massive resources. With Unsloth and GRPO, you can do it on your laptop, experiment, and unlock new levels of intelligence in your models.
Go check out Colab notebooks, give it a try, and let us know what amazing reasoning models you create!
| Latest From Us
- Forget Towers: Verizon and AST SpaceMobile Are Launching Cellular Service From Space

- This $1,600 Graphics Card Can Now Run $30,000 AI Models, Thanks to Huawei

- The Global AI Safety Train Leaves the Station: Is the U.S. Already Too Late?

- The AI Breakthrough That Solves Sparse Data: Meet the Interpolating Neural Network

- The AI Advantage: Why Defenders Must Adopt Claude to Secure Digital Infrastructure


