Recent findings suggest that while artificial intelligence systems are becoming more capable, they are also increasingly prone to error.
According to The New York Times, citing several independent studies, the latest large language models (LLMs), including OpenAI’s newest version known as o3, are demonstrating higher rates of factual inaccuracies than their predecessors.
These challenges are not limited to OpenAI. Similar trends have been observed in models developed by other major players such as Google and the Chinese AI firm DeepSeek. The core issue lies in what researchers refer to as “hallucinations” — instances where AI systems fabricate information without any basis in fact or source verification.
The issue of hallucinations appears to be endemic across models, regardless of their increased mathematical or reasoning capacities. In a controlled internal evaluation by OpenAI, model o3 generated hallucinated content in 33% of its responses related to public figures — a significant increase from the 16% recorded for the earlier o1 model. The new o4-mini model performed worse still, hallucinating in 48% of such responses.
The problem intensifies when the models are tested on more general knowledge. Under these conditions, o3 produced hallucinations in 51% of cases, while o4-mini returned fabricated responses in a striking 79% of prompts. For comparison, the older o1 model was found to hallucinate in 44% of general knowledge responses. The company has acknowledged the trend and states that further investigation is required to understand the root causes.
AI hallucinations can lead to tangible consequences. One such case involved a support chatbot for the Cursor software development tool, which incorrectly stated that the product could only be used on a single device. The information, though untrue, prompted a wave of user complaints and even account deletions. It was later confirmed that the chatbot had generated the restriction on its own; no such change had been implemented by the company.
Amr Awadallah, CEO of Vectara — a firm developing AI tools for enterprise use — has commented that hallucinations are an inherent characteristic of current AI architecture. Despite active efforts by developers to reduce such errors, they appear to be a persistent feature of the technology.
Vectara’s own research supports the prevalence of hallucinations in AI models developed outside OpenAI. Their study found that models by Google and DeepSeek — both of which incorporate reasoning functions — fabricated information in at least 3% of tested cases, with some scenarios pushing this rate to 27%. Across the industry, improvements have been incremental. Over the past year, the hallucination rate has decreased by only 1–2%, despite intensive optimisation efforts.
The underlying cause appears to lie in the complexity of balancing model power with accuracy. As LLMs are trained on ever-larger datasets and refined to exhibit reasoning-like behaviours, their capacity to form coherent but inaccurate outputs seems to grow in parallel. This phenomenon suggests that more sophisticated models are not necessarily more reliable.
Moreover, researchers warn that hallucinations are not limited to obscure or technical subjects. In some cases, fabricated claims have involved widely known facts or easily verifiable public information, raising questions about the robustness of such models in high-stakes environments, including healthcare, law, and business operations.
Although some AI developers have introduced mechanisms such as retrieval-augmented generation (RAG) and source citation prompts to reduce hallucinations, these tools are not foolproof. The limitations of LLMs remain particularly evident when systems are deployed in customer-facing roles or used to support decision-making processes requiring factual precision.
The findings come at a time when regulatory bodies, particularly in the EU and United States, are increasing scrutiny of AI applications. With growing reliance on AI in both public and private sectors, concerns over misinformation and model reliability are likely to remain central to policy discussions.
As AI continues to evolve rapidly, the need for transparency in how these systems are trained and evaluated is becoming more urgent. Industry observers argue that unless the issue of hallucinations is addressed more decisively, public trust in AI-generated content could erode — even as the technology becomes more powerful and widely adopted.
OpenAI and other developers have yet to release detailed plans for mitigating hallucination rates in upcoming models. In the meantime, experts recommend cautious use of generative AI tools in contexts where factual accuracy is critical.
Read also:
The Rise of Decentralised Intelligence: Why Investors Are Flocking to the AI-Blockchain Boom