Notebookcheck Logo

OpenAI o1 and o1-mini arrive as AIs that handle STEM questions better than prior models

OpenAI o1 and o1-mini arrive – AI that reason better on STEM questions than prior models. (Image source: AI-generated, Dall-E 3)
OpenAI o1 and o1-mini arrive – AI that reason better on STEM questions than prior models. (Image source: AI-generated, Dall-E 3)
OpenAI o1 and o1-mini have arrived, and these AI LLMs perform much better on coding, math, and science problems and tasks than prior models such as GPT-4o by taking more time to think. The OpenAI o1 models cannot browse the web or accept uploaded files and images as their major limitations.

OpenAI o1 and o1-mini have arrived. These AI LLMs perform much better on coding, math, and science problems and tasks than prior models such as GPT-4o by taking more time to think.

Complex problems in STEM tend to require more than a quick online search for correct answers. By giving the o1 AI more time to think, the AI can reason more carefully and accurately. The o1-mini model has been specifically tuned to answer STEM questions with faster speed and lower demand on computer resources, and it is notably better at coding than the o1 model.

Across a range of standardized AP exams and STEM tests for LLMs, the o1 models perform with high accuracy. Specifically, on the AP Calculus, AP Chemistry, AP Physics 2, LSAT, and SAT evidence-based reading & writing tests, the o1 models perform at or above the B-grade level (~80% or higher). The models answer accurately at the A-grade level on PhD-level physics questions, at the B-grade level on tough 2024 American Invitational Mathematics Examination math questions, and at the high B-grade level on Codeforces coding problems. Because o1 has been tuned for answering STEM questions, its performance on AP English Language and AP English Literature is at or below the C-grade level.

Interestingly, while GPT-4o is dumbfounded by the cryptographic challenge of decoding “oyekaijzdf aaptcg suaokybhai ouow aqht mynznvaatzacdfoulxxz” when given the hint “oyfjdnisdr rtqwainr acxz mynzbhhx” means “Think step by step”, o1 had no issues thinking through the problem to come up with the correct answer “There are three r’s in strawberry”. This new power will delight hobby cryptographers at home as well as the NSA.

Closet evil-doers will want to know that while the uncensored o1 models are apt to give troubling replies, OpenAI has neutered these models for release. The o1 models have been tested to resist answering questions about making bioweapons, producing naughty images, jailbreaking itself, and harassing and threatening. Unfortunately, the OpenAI o1 models remain gender and race biased when tested, despite tuning efforts.

ChatGPT Plus and Team users along with API usage tier 5 developers have access to o1 models immediately, and ChatGPT Edu and Enterprise users will gain access on the week of September 16. ChatGPT Free users will gain access to o1-mini in the near future. The o1 models cannot browse the web or accept uploaded files and images to answer questions, so OpenAI recommends users continue using their GPT-4o models for general questions.

Users who want to ask AI questions now have a wide-range of capable LLM models to interact with besides those from OpenAI, including Anthropic Claude, Microsoft CoPilot, Google Gemini, and X Grok. Every AI has specific advantages, so it is worth testing several AI models to find one that best suits individual needs. Some of these AI are built into smart glasses (like these on Amazon) and voice recorders (like this one on Amazon), and some upcoming autonomous humanoid robots use proprietary AI to cook and clean.

Both OpenAI o1 and o1-mini perform slightly worse on writing tasks versus GPT-4o, but much better on technical tasks like math or programming. (Image source: OpenAI)
Both OpenAI o1 and o1-mini perform slightly worse on writing tasks versus GPT-4o, but much better on technical tasks like math or programming. (Image source: OpenAI)
OpenAI o1 series can answer tougher questions correctly that GPT-4o cannot, but only by taking much longer to answer. (Image source: OpenAI)
OpenAI o1 series can answer tougher questions correctly that GPT-4o cannot, but only by taking much longer to answer. (Image source: OpenAI)
By programming OpenAI o1 to think longer before answering, the AI LLM is able to answer hard questions better than prior models, including GPT-4o. (Image source: OpenAI)
By programming OpenAI o1 to think longer before answering, the AI LLM is able to answer hard questions better than prior models, including GPT-4o. (Image source: OpenAI)
Before being neutered for release, OpenAI o1-preview-pre-mitigation loved being naughty. (Image source: OpenAI)
Before being neutered for release, OpenAI o1-preview-pre-mitigation loved being naughty. (Image source: OpenAI)
OpenAI o1 models remain gender and racially biased even after tuning. (Image source: OpenAI)
OpenAI o1 models remain gender and racially biased even after tuning. (Image source: OpenAI)
Although OpenAI o1 series is much better at creating instructions for biohazards, the release versions have such capabilities neutered. (Image source: OpenAI)
Although OpenAI o1 series is much better at creating instructions for biohazards, the release versions have such capabilities neutered. (Image source: OpenAI)
OpenAI hinders job hunters who use AI during programmer interviews by dumbing down the ability of o1-mini and o1-preview to pass a set of OpenAI interview Research Engineer questions on the first try. (Image source: OpenAI)
OpenAI hinders job hunters who use AI during programmer interviews by dumbing down the ability of o1-mini and o1-preview to pass a set of OpenAI interview Research Engineer questions on the first try. (Image source: OpenAI)

September 12, 2024

Introducing OpenAI o1-preview

A new series of reasoning models for solving hard problems. Available starting 9.12

We've developed a new series of AI models designed to spend more time thinking before they respond. They can reason through complex tasks and solve harder problems than previous models in science, coding, and math.

Today, we are releasing the first of this series in ChatGPT and our API. This is a preview and we expect regular updates and improvements. Alongside this release, we’re also including evaluations for the next update, currently in development.

How it works

We trained these models to spend more time thinking through problems before they respond, much like a person would. Through training, they learn to refine their thinking process, try different strategies, and recognize their mistakes. 

In our tests, the next model update performs similarly to PhD students on challenging benchmark tasks in physics, chemistry, and biology. We also found that it excels in math and coding. In a qualifying exam for the International Mathematics Olympiad (IMO), GPT-4o correctly solved only 13% of problems, while the reasoning model scored 83%. Their coding abilities were evaluated in contests and reached the 89th percentile in Codeforces competitions. You can read more about this in our technical research post.

As an early model, it doesn't yet have many of the features that make ChatGPT useful, like browsing the web for information and uploading files and images. For many common cases GPT-4o will be more capable in the near term.

But for complex reasoning tasks this is a significant advancement and represents a new level of AI capability. Given this, we are resetting the counter back to 1 and naming this series OpenAI o1.

Safety

As part of developing these new models, we have come up with a new safety training approach that harnesses their reasoning capabilities to make them adhere to safety and alignment guidelines. By being able to reason about our safety rules in context, it can apply them more effectively. 

One way we measure safety is by testing how well our model continues to follow its safety rules if a user tries to bypass them (known as "jailbreaking"). On one of our hardest jailbreaking tests, GPT-4o scored 22 (on a scale of 0-100) while our o1-preview model scored 84. You can read more about this in the system card and our research post.

To match the new capabilities of these models, we’ve bolstered our safety work, internal governance, and federal government collaboration. This includes rigorous testing and evaluations using our Preparedness Framework(opens in a new window), best-in-class red teaming, and board-level review processes, including by our Safety & Security Committee.

To advance our commitment to AI safety, we recently formalized agreements with the U.S. and U.K. AI Safety Institutes. We've begun operationalizing these agreements, including granting the institutes early access to a research version of this model. This was an important first step in our partnership, helping to establish a process for research, evaluation, and testing of future models prior to and following their public release.

Whom it’s for

These enhanced reasoning capabilities may be particularly useful if you’re tackling complex problems in science, coding, math, and similar fields. For example, o1 can be used by healthcare researchers to annotate cell sequencing data, by physicists to generate complicated mathematical formulas needed for quantum optics, and by developers in all fields to build and execute multi-step workflows. 

OpenAI o1-mini

The o1 series excels at accurately generating and debugging complex code. To offer a more efficient solution for developers, we’re also releasing OpenAI o1-mini, a faster, cheaper reasoning model that is particularly effective at coding. As a smaller model, o1-mini is 80% cheaper than o1-preview, making it a powerful, cost-effective model for applications that require reasoning but not broad world knowledge. 

How to use OpenAI o1

ChatGPT Plus and Team users will be able to access o1 models in ChatGPT starting today. Both o1-preview and o1-mini can be selected manually in the model picker, and at launch, weekly rate limits will be 30 messages for o1-preview and 50 for o1-mini. We are working to increase those rates and enable ChatGPT to automatically choose the right model for a given prompt.

An image of the new ChatGPT dropdown that displays the new "o1-preview" model option over a bright yellow and blue abstract background

ChatGPT Enterprise and Edu users will get access to both models beginning next week. 

Developers who qualify for API usage tier 5(opens in a new window) can start prototyping with both models in the API today with a rate limit of 20 RPM. We’re working to increase these limits after additional testing. The API for these models currently doesn't include function calling, streaming, support for system messages, and other features. To get started, check out the API documentation(opens in a new window).

We also are planning to bring o1-mini access to all ChatGPT Free users. 

What’s next

This is an early preview of these reasoning models in ChatGPT and the API. In addition to model updates, we expect to add browsing, file and image uploading, and other features to make them more useful to everyone. 

We also plan to continue developing and releasing models in our GPT series, in addition to the new OpenAI o1 series. 

static version load dynamic
Loading Comments
Comment on this article
Please share our article, every link counts!
Mail Logo
> Expert Reviews and News on Laptops, Smartphones and Tech Innovations > News > News Archive > Newsarchive 2024 09 > OpenAI o1 and o1-mini arrive as AIs that handle STEM questions better than prior models
David Chien, 2024-09-16 (Update: 2024-09-16)