“…Defendants took a different approach: theft.”
My Data, Your Data
A new lawsuit against ChatGPT creator OpenAI is alleging that the buzzy Silicon Valley firm’s AI training practices violated the privacy and copyright of — well, of pretty much everyone who’s ever posted anything online.
To train its powerful AI language models, OpenAI utilized an incredible amount of data scraped from various corners of the web. Although OpenAI doesn’t even know exactly what its systems are trained on, those datasets include everything from Wikipedia articles and famous novels to social media posts and incredibly niche erotica — and OpenAI didn’t ask permission for any of it.
The class action suit, filed in California, alleges that failing to follow proper procurement guidelines, including seeking the consent of those who produced that content in the first place, amounts to straight-up data theft.
“Despite established protocols for the purchase and use of personal information, Defendants took a different approach: theft,” reads the filing. “They systematically scraped 300 billion words from the internet, ‘books, articles, websites and posts — including personal information obtained without consent.'”
Not So Free Web
It’s a fair criticism. If you’ve been online at all in the past few decades, your digital outputs are likely embedded into OpenAI’s datasets, meaning that anything that OpenAI’s generative models churn out — for profit — might have bits and pieces of your silently-scraped digital labor embedded into it.
“All of that information is being taken at scale,” Ryan Clarkson, the managing partner at the firm suing OpenAI, told The Washington Post, “when it was never intended to be utilized by a large language model.”
That said, whether the case actually holds up in court remains to be seen. The internet’s infrastructure is complicated, and what’s largely seen as the free and open web is often neither of those things; platforms have their own user terms and agreements, and even if we’re the ones who have done the work to pack those sites with content, in many cases it technically belongs to the platform — and not, unfortunately, to the users.
“When you put content on a social media site or any site, you’re generally granting a very broad license to the site to be able to use your content in any way,” Katherine Gardner, an intellectual-property lawyer, told WaPo. “It’s going to be very difficult for the ordinary end user to claim that they are entitled to any sort of payment or compensation for use of their data as part of the training.”