The final day of OpenAI’s “12 Days of Shipmas” has arrived with the unveiling of o3, a new chain-of-thought “reasoning” model that the company claims is its most advanced yet. The model is not yet available for general use, but safety researchers can sign up for a preview starting today.
OpenAI and others hope that reasoning models will go a long way toward solving the pernicious problem of chatbots frequently producing wrong answers. Chatbots fundamentally do not “think” like humans and different techniques are needed to try and create the best simulacrum of a human thought process.
When asked a question, reasoning models pause and consider related prompts that could help produce an accurate answer. For example, if you ask the o3 model, “can habaneros be grown in the Pacific Northwest,” the model might lay out a series of questions it will research to come to a conclusion, such as “where do habaneros typically grow,” “what are the ideal conditions for growing habaneros,” and “what type of climate does the Pacific Northwest have.” Anyone who has used chatbots knows you sometimes have to prompt a chatbot with additional follow-ups until it finally gets the right result. Reasoning models are supposed to do this additional work for you.
o3 is the successor to o1, OpenAI’s first chain-of-thought reasoning model. Reps said they decided to skip the “o2” naming convention “out of respect” for the British telecommunications company, but it certainly doesn’t hurt that it makes the product sound more advanced. The company says the new model comes with the ability to adjust its reasoning time. Users can choose low, medium, or high reasoning time; the greater the compute, the better o3 is supposed to perform. OpenAI says it will spend time “red-teaming” the new model with researchers to prevent it from producing potentially harmful responses (since again, it is not a human and does not know right versus wrong).
Reasoning is the buzzword of the day in the field of generative AI, as industry insiders believe it is the next unlock necessary to improve the performance of large language models. More compute eventually does not offer equivalent performance gains, so new techniques are needed. Google DeepMind recently unveiled its own reasoning model called Gemini Deep Research, which can take 5-10 minutes to generate a report that analyzes many sources across the web in order to come to its findings.
OpenAI is confident in o3, and offers impressive benchmarks—it says that in a Codeforcing testing, which measures coding ability, o3 got a score of 2727. For context, a score of 2400 would put an engineer in the 99th percentile of programmers. It gets a score of 96.7% on the 2024 American Invitational Mathematics Exam, missing just one question. We will have to see how the model holds up in real-world testing, and it is still generally not a good idea to rely too much on AI models for important work where accuracy is necessary. But optimists are confident that the problem of accuracy is being solved. Hopefully so, because as it stands, Google’s AI Overviews in search are still the subject of frequent social media ridicule.
AI model companies like OpenAI and Perplexity are in a race to become the next Google, collecting the world’s knowledge and helping users make sense of it all. They even have search products now that are meant to more directly replicate Google with access to real-time web results.
All of these players seem to leapfrog one another with every passing day, however. The feeling is somewhat reminiscent of the late ’90s when there were a myriad of search engines to choose from—Google, Yahoo, and AltaVista, Ask Jeeves, just to name a few, all hoovering up the internet’s data and presenting it just with a different UX. Most of them disappeared after one came along that was supremely better than the rest—Google.
OpenAI clearly has a strong lead right now with hundreds of millions of monthly active users and a partnership with Apple, but Google has received a lot of plaudits recently for advancements in its Gemini models. The Verge reports that the company is going to soon integrate Gemini more deeply into its search interface.