While we were busy ringing in the New Year, the NYT sued ChatGPT and Microsoft for copyright infringement, claiming that Open AI and Microsoft used the Times’s articles to train their large language models. At the heart of the suit is
that AI can offer up the results of an entire news article as an answer to a query, without requiring a user to ever visit the Times’s website. In their defense (which has not been talked about much yet), the defendants will likely cite the “fair use”argument and they’ve publicly stated that the use of training materials for LLMs is “a new transformative purpose.”
Hogwash? Copyright infringement is copyright infringement, and using things without attribution is infringement, regardless of how novel the use is. It’s instructive to read the Times’s filing to best understand what’s at stake here.
We’re betting that this copyright story will be one of the biggest stories of 2024 (aside, perhaps, from politics). The Times is the first big media company to sue, but handfuls of writers and
artists including comedian Sarah Silverman and authors George R.R. Martin and John Grishman have already jumped aboard with similar suits.
Copyright Needs a Digital Facelift
From the moment you could cut and paste, copyright has been an irksome thorn in our digital life. Copyright law, which incidentally has its origins in the horse and buggy days of America’s 1700s, hasn’t changed much to meet the realities of a digital age, never mind the age of AI.
Plus, OpenAI and Microsoft have both been pretty cagey about who’s training what on whom. As longtime colleague Dan Tynan writes, “The results of this case could determine what kind of AI tools are available to the public, and the
business models they will operate under. If the makers of AI models have to pay for access to the public data they used, it will likely kill a lot of them, and make others prohibitively expensive to use. (On the other hand, OpenAI is valued at upwards of $80 billion — it can afford to pay for the materials that helped to make it rich, at least on paper.)”
It’s not the first time users of digital platforms have been pawns in the game. When we collectively realized that our eyeballs and comments were the product that social media platforms really needed to sell to advertisers, it laid the groundwork for what’s about to follow. We’re about to find out that every digital word we write is part of the collective canon of
words that train AIs that will make a lot of money off our sweat and blood. It’s the digital equivalent of a feudal system.
At a minimum, it’s time for AI companies to disclose what materials they use in the training of their models. Proper citations for the origins of the information for each query would be helpful too. And it’s time to start thinking seriously about financial remuneration to those whose words (and images) feed the AI beast. |