OpenAI o1 and o1-mini arrive as AIs that handle STEM questions better than prior models
OpenAI o1 and o1-mini have arrived. These AI LLMs perform much better on coding, math, and science problems and tasks than prior models such as GPT-4o by taking more time to think.
Complex problems in STEM tend to require more than a quick online search for correct answers. By giving the o1 AI more time to think, the AI can reason more carefully and accurately. The o1-mini model has been specifically tuned to answer STEM questions with faster speed and lower demand on computer resources, and it is notably better at coding than the o1 model.
Across a range of standardized AP exams and STEM tests for LLMs, the o1 models perform with high accuracy. Specifically, on the AP Calculus, AP Chemistry, AP Physics 2, LSAT, and SAT evidence-based reading & writing tests, the o1 models perform at or above the B-grade level (~80% or higher). The models answer accurately at the A-grade level on PhD-level physics questions, at the B-grade level on tough 2024 American Invitational Mathematics Examination math questions, and at the high B-grade level on Codeforces coding problems. Because o1 has been tuned for answering STEM questions, its performance on AP English Language and AP English Literature is at or below the C-grade level.
Interestingly, while GPT-4o is dumbfounded by the cryptographic challenge of decoding “oyekaijzdf aaptcg suaokybhai ouow aqht mynznvaatzacdfoulxxz” when given the hint “oyfjdnisdr rtqwainr acxz mynzbhhx” means “Think step by step”, o1 had no issues thinking through the problem to come up with the correct answer “There are three r’s in strawberry”. This new power will delight hobby cryptographers at home as well as the NSA.
Closet evil-doers will want to know that while the uncensored o1 models are apt to give troubling replies, OpenAI has neutered these models for release. The o1 models have been tested to resist answering questions about making bioweapons, producing naughty images, jailbreaking itself, and harassing and threatening. Unfortunately, the OpenAI o1 models remain gender and race biased when tested, despite tuning efforts.
ChatGPT Plus and Team users along with API usage tier 5 developers have access to o1 models immediately, and ChatGPT Edu and Enterprise users will gain access on the week of September 16. ChatGPT Free users will gain access to o1-mini in the near future. The o1 models cannot browse the web or accept uploaded files and images to answer questions, so OpenAI recommends users continue using their GPT-4o models for general questions.
Users who want to ask AI questions now have a wide-range of capable LLM models to interact with besides those from OpenAI, including Anthropic Claude, Microsoft CoPilot, Google Gemini, and X Grok. Every AI has specific advantages, so it is worth testing several AI models to find one that best suits individual needs. Some of these AI are built into smart glasses (like these on Amazon) and voice recorders (like this one on Amazon), and some upcoming autonomous humanoid robots use proprietary AI to cook and clean.
Are you a techie who knows how to write? Then join our Team! Wanted:
- News translator (DE-EN)
Details here
Source(s)
September 12, 2024
Introducing OpenAI o1-preview
A new series of reasoning models for solving hard problems. Available starting 9.12
We've developed a new series of AI models designed to spend more time thinking before they respond. They can reason through complex tasks and solve harder problems than previous models in science, coding, and math.
Today, we are releasing the first of this series in ChatGPT and our API. This is a preview and we expect regular updates and improvements. Alongside this release, we’re also including evaluations for the next update, currently in development.
How it works
We trained these models to spend more time thinking through problems before they respond, much like a person would. Through training, they learn to refine their thinking process, try different strategies, and recognize their mistakes.
In our tests, the next model update performs similarly to PhD students on challenging benchmark tasks in physics, chemistry, and biology. We also found that it excels in math and coding. In a qualifying exam for the International Mathematics Olympiad (IMO), GPT-4o correctly solved only 13% of problems, while the reasoning model scored 83%. Their coding abilities were evaluated in contests and reached the 89th percentile in Codeforces competitions. You can read more about this in our technical research post.
As an early model, it doesn't yet have many of the features that make ChatGPT useful, like browsing the web for information and uploading files and images. For many common cases GPT-4o will be more capable in the near term.
But for complex reasoning tasks this is a significant advancement and represents a new level of AI capability. Given this, we are resetting the counter back to 1 and naming this series OpenAI o1.
Safety
As part of developing these new models, we have come up with a new safety training approach that harnesses their reasoning capabilities to make them adhere to safety and alignment guidelines. By being able to reason about our safety rules in context, it can apply them more effectively.
One way we measure safety is by testing how well our model continues to follow its safety rules if a user tries to bypass them (known as "jailbreaking"). On one of our hardest jailbreaking tests, GPT-4o scored 22 (on a scale of 0-100) while our o1-preview model scored 84. You can read more about this in the system card and our research post.
To match the new capabilities of these models, we’ve bolstered our safety work, internal governance, and federal government collaboration. This includes rigorous testing and evaluations using our Preparedness Framework(opens in a new window), best-in-class red teaming, and board-level review processes, including by our Safety & Security Committee.
To advance our commitment to AI safety, we recently formalized agreements with the U.S. and U.K. AI Safety Institutes. We've begun operationalizing these agreements, including granting the institutes early access to a research version of this model. This was an important first step in our partnership, helping to establish a process for research, evaluation, and testing of future models prior to and following their public release.
Whom it’s for
These enhanced reasoning capabilities may be particularly useful if you’re tackling complex problems in science, coding, math, and similar fields. For example, o1 can be used by healthcare researchers to annotate cell sequencing data, by physicists to generate complicated mathematical formulas needed for quantum optics, and by developers in all fields to build and execute multi-step workflows.
OpenAI o1-mini
The o1 series excels at accurately generating and debugging complex code. To offer a more efficient solution for developers, we’re also releasing OpenAI o1-mini, a faster, cheaper reasoning model that is particularly effective at coding. As a smaller model, o1-mini is 80% cheaper than o1-preview, making it a powerful, cost-effective model for applications that require reasoning but not broad world knowledge.
How to use OpenAI o1
ChatGPT Plus and Team users will be able to access o1 models in ChatGPT starting today. Both o1-preview and o1-mini can be selected manually in the model picker, and at launch, weekly rate limits will be 30 messages for o1-preview and 50 for o1-mini. We are working to increase those rates and enable ChatGPT to automatically choose the right model for a given prompt.
An image of the new ChatGPT dropdown that displays the new "o1-preview" model option over a bright yellow and blue abstract background
ChatGPT Enterprise and Edu users will get access to both models beginning next week.
Developers who qualify for API usage tier 5(opens in a new window) can start prototyping with both models in the API today with a rate limit of 20 RPM. We’re working to increase these limits after additional testing. The API for these models currently doesn't include function calling, streaming, support for system messages, and other features. To get started, check out the API documentation(opens in a new window).
We also are planning to bring o1-mini access to all ChatGPT Free users.
What’s next
This is an early preview of these reasoning models in ChatGPT and the API. In addition to model updates, we expect to add browsing, file and image uploading, and other features to make them more useful to everyone.
We also plan to continue developing and releasing models in our GPT series, in addition to the new OpenAI o1 series.