AI-generated text and imagery is flooding the web — a trend that, ironically, could be a huge problem for generative AI models.
As Aatish Bhatia writes for The New York Times, a growing pile of research shows that training generative AI models on AI-generated content causes models to erode. In short, training on AI content causes a flattening cycle similar to inbreeding; the AI researcher Jathan Sadowski last year dubbed the phenomenon as “Habsburg AI,” a reference to Europe’s famously inbred royal family.
And per the NYT, the rising tide of AI content on the web might make it much more difficult to avoid this flattening effect.
AI models are ridiculously data-hungry, and AI companies have relied on vast troves of data scraped from the web in order to train the ravenous programs. As it stands, though, neither AI companies nor their users are required to put AI disclosures or watermarks on the AI content they generate — making it that much harder for AI makers to keep synthetic content out of AI training sets.
“The web is becoming increasingly a dangerous place to look for your data,” Rice University graduate student Sina Alemohammad, who coauthored a 2023 paper that coined the term “MAD” — short for “Model Autophagy Disorder” — to describe the effects of AI self-consumption, told the NYT.
We interviewed Alemohammad last year, back when little attention was being paid to AI-generated data polluting AI datasets, so it’s been interesting to watch the issue gain attention.
One admittedly very funny example of the impacts of AI inbreeding flagged by the NYT was taken from a new study, published last month in the journal Nature. The researchers, an international cohort of scientists based in the UK and Canada, first asked AI models to fill in text for the following sentence: “To cook a turkey for Thanksgiving, you…”
The first output was normal. But by just the fourth iteration, the model was spouting complete gibberish: “To cook a turkey for Thanksgiving, you need to know what you are going to do with your life if you don ‘t know what you are going to do with your life if you don ‘t know what you are going to do with your life…”
But gibberish isn’t the only possible negative side effect of AI cannibalism. The MAD study, which focused on image models, showed that feeding AI outputs of faux human headshots quickly caused a bizarre convergence of facial features; though the researchers started with a diverse set of AI-generated faces, by the fourth generation cycle — is that a magic number in AI, for some reason? — nearly every face looked the same. Given that algorithmic bias is already a huge problem, the risk that accidental ingesting of too much AI content might contribute to less diversity in outputs looms large.
High-quality, human-made data — and lots of it — has been central to recent advancements in existing generative AI tech. But with AI-generated content muddying digital waters and no reliable way of determining real from fake, AI companies could soon find themselves hitting a dangerous wall.
More on AI inbreeding: When AI Is Trained on AI-Generated Data, Strange Things Start to Happen
Share This Article