A team of Apple researchers has found that advanced AI models’ alleged ability to “reason” isn’t all it’s cracked up to be.
“Reasoning” is a word that’s thrown around a lot in the AI industry these days, especially when it comes to marketing the advancements of frontier AI language models. OpenAI, for example, recently dropped its “Strawberry” model, which the company billed as its next-level large language model (LLM) capable of advanced reasoning. (That model has since been renamed just “o1.”)
But marketing aside, there’s no agreed-upon industrywide definition for what reasoning exactly means. Like other AI industry terms, for example, “consciousness” or “intelligence,” reasoning is a slippery, ephemeral concept; as it stands, AI reasoning can be chalked up to an LLM’s ability to “think” its way through queries and complex problems in a way that resembles human problem-solving patterns.
But that’s a notoriously difficult thing to measure. And according to the Apple scientists’ yet-to-be-peer-reviewed study, frontier LLMs’ alleged reasoning capabilities are way flimsier than we thought.
For the study, the researchers took a closer look at the GSM8K benchmark, a widely-used dataset used to measure AI reasoning skills made up of thousands of grade school-level mathematical word problems. Fascinatingly, they found that just slightly altering given problems — switching out a number or a character’s name here or adding an irrelevant detail there — caused a massive uptick in AI errors.
In short: when researchers made subtle changes to GSM8K questions that didn’t impact the mechanics of the problem, frontier AI models failed to keep up. And this, the researchers argue, suggests that AI models aren’t actually reasoning like humans, but are instead engaging in more advanced pattern-matching based on existing training data.
“We hypothesize that this decline is due to the fact that current LLMs are not capable of genuine logical reasoning,” the researchers write. “Instead, they attempt to replicate the reasoning steps observed in their training data.”
As the saying goes, fake it ’till you make it!
1/ Can Large Language Models (LLMs) truly reason? Or are they just sophisticated pattern matchers? In our latest preprint, we explore this key question through a large-scale study of both open-source like Llama, Phi, Gemma, and Mistral and leading closed models, including the… pic.twitter.com/yli5q3fKIT
— Mehrdad Farajtabar (@MFarajtabar) October 10, 2024
A striking example of such an exploit is a mathematical reasoning problem involving kiwis, which reads as follows:
Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have?
Of course, how small or large any of these kiwis are is irrelevant to the task at hand. But as the scientists’ work showed, the majority of AI models routinely — and erroneously — incorporated the extraneous detail into reasoning processes, ultimately resulting in errors.
Take this response given by OpenAI’s “o1-mini” model, a “cost-efficient” version of the AI formerly codenamed “Strawberry,” which mistakenly finds that the smaller kiwis should be subtracted from the eventual total:
Sunday: Double the number he picked on Friday, which is 2 × 44 = 88 kiwis However, on Sunday, 5 of these kiwis were smaller than average. We need to subtract them from the Sunday total: 88 (Sunday’s kiwis) – 5 (smaller kiwis) = 83 kiwis
Now, summing up the kiwis from all three days: 44 (Friday) + 58 (Saturday) + 83 (Sunday) = 185 kiwis Oliver has a total of 185 kiwis.
Overall, researchers saw the AI models’ accuracy drop from 17.5 percent to a staggering 65.7 percent, depending on the model.
And in an even more simplistic test, researchers found that just switching out details like proper nouns or numbers caused a significant decrease in a model’s ability to correctly answer the question, with accuracy dropping from 0.3 percent to nearly ten percent across 20 top reasoning models.
“LLMs remain sensitive to changes in proper names (e.g., people, foods, objects), and even more so when numbers are altered,” lead study author and Apple research scientist Mehrdad Farajtabar wrote last week in a thread on X-formerly-Twitter. “Would a grade-school student’s math test score vary by [about] ten percent if we only changed the names?”
The study’s findings not only call the intelligence of frontier AI models into question, but also the accuracy of the current methods we use to grade and market those models. After all, if you memorize a few sentences of a language phonetically, you haven’t actually learned a language. You just know what a few words are supposed to sound like.
“Understanding LLMs’ true reasoning capabilities is crucial for deploying them in real-world scenarios where accuracy and consistency are non-negotiable — especially in AI safety, alignment, education, healthcare, and decision-making systems,” Farajtabar continued in the X thread. “Our findings emphasize the need for more robust and adaptable evaluation methods.”
“Developing models that move beyond pattern recognition to true logical reasoning,” he added, “is the next big challenge for the AI community.”
More on AI and reasoning: OpenAI’s Strawberry “Thought Process” Sometimes Shows It Scheming to Trick Users
Share This Article