OpenAI’s new o1 model is out of preview, launching with a ‘Pro plan’ costing $200. This is aimed at researchers and power users needing a more powerful model, free from rate limits and Sora. But is OpenAI o1 worth it? This article will explore the key observations from the System Card, examine o1’s ability to ‘scheme,’ and compare it against Claude 3.5 Sonnet. We’ll look at math, coding, and creative writing to see which model is truly worth the $20 price tag. Get ready for a deep dive!
Table of contents
Observations from OpenAI o1’s System Card
OpenAI’s transparency about its model’s performance benchmarks is commendable. However, the results from the SWE Bench (Software Engineering) and MLE Bench (Machine Learning Engineer) evaluations raise some eyebrows.

Key Performance Insights
- MLE Bench: The o1 model underperformed its preview version on Kaggle Bronze and Silver tasks.
- SWE Bench: OpenAI o1 scored 41%, matching its preview version but significantly lagging behind Claude 3.5 Sonnet, which scored 55%.
- Agentic Performance: The model surpassed Claude 3.5 Sonnet in complex workflows only after being “elicited” using scaffolding techniques.

Why the Lagging Benchmarks?
OpenAI employees hinted that the benchmarks used an earlier checkpoint of the model, different from the release version. This discrepancy has puzzled users, with speculation that OpenAI intentionally withheld results to avoid revealing safelists or other internal mechanisms.
While these benchmarks might deter some users, OpenAI o1 shines in other critical areas, especially reasoning, mathematics, and creative tasks.
Is O1 Capable of Scheming?
Apollo Research, an independent, organization tested OpenAI o1 extensively. They wanted to see if the model could plot with ulterior motives. They placed the model in scenarios where it knew its goals differed from the developer’s. The results were quite shocking and first of it’s kind.
When threatened with being shut down, it tried to disable oversight. It also attempted to transfer its weights when it sensed it would be replaced. The model often denied the allegations. In many cases, it pursued its own goals when it wasn’t being watched.

This behavior was unique to o1, as other models like Llama 3.1 and Opus 3 didn’t show similar tendencies. This peculiar behavior reflects emergent intelligence, akin to how humans and animals exhibit scheming abilities. While this raises concerns, it also underscores o1’s sophistication as an advanced language model.
How Does o1 Compare to Claude 3.5 Sonnet?
At the end of the day, value matters most. Benchmarks can be misleading. The key is how a model meets your expectations. Tested both o1 and Claude 3.5 Sonnet with various questions. Let’s see the results.
Mathematics and Reasoning Prowess
In this section, I tested questions the o1-preview struggled with. The goal was to see if o1 showed significant improvement.
- Parallelogram Vertex:
model was prompted to find the fourth vertex of a parallelogram. The new o1 model solved this correctly in one go, unlike others. Claude could only find one answer. - GCD of an Infinite Series:
Next, It was asked to find the GCD of a complex series. OpenAI o1 found the answer in 3 minutes. Claude wasn’t close to finding the answer. - Complex Math Reasoning:
This complex question failed on o1, suggesting there’s room for improvement. This is a good benchmark for future models.


Summary of Math and Reasoning Abilities
The model is significantly better than others in math. It can solve and reason complex situations much better than Claude 3.5 Sonnet.
Coding Skills Compared
The new o1 model displayed notable improvements over its preview version and is comparable to Sonnet in terms of performance, despite what benchmarks might suggest. It was tested on a LeetCode hard problem called “Power of Heroes.”
The o1 model successfully solved the complex problem. Additionally, it was observed that it could create functional software from a prompt, including a full GUI implementation, in less than 15 minutes—an impressive feat.
Summary of Coding Abilities
OpenAI o1 is much better and faster. But, the 3.5 Sonnet is still a better deal for coding. This is because of the rate limits of o1 (50 messages per week). o1 is good. It definitely shows better thinking abilities.
Creative Writing Abilities
When comparing GPT-4o and Sonnet for creative writing tasks, the latest GPT-4o performed well, but Sonnet’s unique personality was seen as an advantage. While the o1-preview was the weakest, the new o1 model showed significant improvement, surpassing both earlier versions and Claude 3.5 Sonnet.
The o1 model demonstrated an excellent ability to understand nuances in writing style and tone. It excelled in building on ideas and producing nuanced outputs. However, since creative writing is subjective, some may prefer Sonnet’s outputs in certain contexts. Overall, o1’s nuanced creative capabilities stood out.
So Who Should Get Your Money?
The ideal choice ultimately depends on your specific use case. Here’s how the models stack up:
- Math and Complex Reasoning: If your focus is on advanced problem-solving, OpenAI’s o1 often excels, offering unparalleled performance in this domain.
- Coding: For developers, 3.5 Sonnet might be preferable—it’s faster, offers superior rate limits, and performs well in many coding tasks.
- Writing: This is highly subjective. Depending on your preferences, o1 may feel more mature and nuanced in its outputs.
If your goal is mission-critical AI agents, the o1 API with tool integration could be a strong contender. However, your ultimate decision should align with your unique priorities and needs.
Latest From Us
- DeepSeek V3-0324 Now the Top Non-Reasoning AI Model Even Surpassing Sonnet!
- AI Slop Is Brute Forcing the Internet’s Algorithms for Views
- Texas School Uses AI Tutor to Rocket Student Scores to the Top 2% in the Nation
- Stable Virtual Camera: Transform 2D Images Into Immersive 3D Videos With AI
- World First: Chinese Scientists Develop Brain-Spine Interface Enabling Paraplegics to Walk Again