OpenAI’s newest reasoning models, including o3 and o4-mini, are showing higher rates of hallucinations (incorrect or made-up information) compared to previous systems.
For example, o3 hallucinated 33% of the time on a benchmark test, more than double the previous model (o1), while o4-mini reached 48%. On a more general test, hallucination rates were even higher, with o4-mini reaching 79%.
OpenAI acknowledges the need for further research to understand and address these issues. Similar trends are being observed in other companies’ models, such as Google and DeepSeek.