“It’s important to know whether updates to the model…actually hurt its capability.”
Dumb and Dumber
Regardless of what its execs claim, researchers are now saying that yes, OpenAI’s GPT large language model (LLM) appeared to be getting dumber.
In a new yet-to-be-peer-reviewed study, researchers out of Stanford and Berkeley found that over a period of a few months, both GPT-3.5 and GPT-4 significantly changed their “behavior,” with the accuracy of their responses appearing to go down, validating user anecdotes about the apparent degradation of the latest versions of the software in the months since their releases.
“GPT-4 (March 2023) was very good at identifying prime numbers (accuracy 97.6 percent),” the researchers wrote in their paper’s abstract, “but GPT-4 (June 2023) was very poor on these same questions (accuracy 2.4 percent).”
“Both GPT-4 and GPT-3.5,” the abstract continued, “had more formatting mistakes in code generation in June than in March.”
Brain Drain
This study affirms what users have been saying for more than a month now: that as they’ve used the GPT-3 and GPT-4-powered ChatGPTover time, they’ve noticed it becoming, well, stupider.
The seeming degradation of its accuracy has become so troublesome that OpenAI vice president of product Peter Welinder attempted to dispel rumors that the change was intentional.
“No, we haven’t made GPT-4 dumber,” Welinder tweeted last week. “Quite the opposite: we make each new version smarter than the previous one.”
He added that changes in user experience could be due to continuous use, saying that it could be that “when you use [ChatGPT] more heavily, you start noticing issues you didn’t see before.”
Class Clown
The Stanford and Berkeley research is a compelling datapoint against that hypothesis, though. While the researchers don’t posit reasons as to why these downward “drifts” in accuracy and ability are occurring, they do note that this demonstrable worsening over time challenges OpenAI’s insistence that its models are instead improving.
“We find that the performance and behavior of both GPT-3.5 and GPT-4 vary significantly across these two releases and that their performance on some tasks have gotten substantially worse over time,” the paper noted, adding that it’s “interesting” to question whether GPT-4 is indeed getting stronger.
“It’s important to know whether updates to the model aimed at improving some aspects actually hurt its capability in other dimensions,” the researchers wrote.
Translation: OpenAI’s rapid updates may be doing more harm than good for ChatGPT, which has already become known for its inaccuracies.
More on OpenAI: Theory: ChatGPT Use Is Falling Because Kids Don’t Need to Cheat on Papers During Summer Vacation
Share This Article