Technology

“Not for harvesting machines”: Data revolts erupt against AI

July 15, 2023

For more than 20 years, Kit Loffstadt has been writing alternate universe fan fiction for Star Wars heroes and Buffy the Vampire Slayer villains and sharing her stories online for free.

But in May, Ms Loffstadt stopped publishing her creations after learning that a data company had copied her stories and uploaded them to the internet Artificial Intelligence Technology underlying ChatGPT, the viral chatbot. Distraught, she hid her writings behind a suspended account.

Ms. Loffstadt also helped organize a rebellion against AI systems last month. She, along with dozens of other fanfiction writers, published a barrage of irreverent stories online to overwhelm and confuse the data collection services that feed the writers’ work into AI technology.

“We all need to do what we can to show them that the results of our creativity can’t be left to the machines,” said Ms Loffstadt, a 42-year-old voice actress from South Yorkshire, UK.

Fanfiction writers are just one group currently staging revolts against AI systems Fever because of the technology has Silicon Valley and the world under control. Social media companies like Reddit and Twitter, news organizations like The New York Times and NBC News, authors like Paul Tremblay and the actress have made their mark in recent months Sarah Silverman have all spoken out against AI sucking up their data without permission.

Their protests have taken different forms. Authors and artists lock their files to protect their works or boycott certain websites that publish AI-generated content, while companies like Reddit want it to fee for access to their data. At least 10 lawsuits have been filed this year against AI companies accused of training their systems for artists’ creative work without consent. Last week, Ms. Silverman and writers Christopher Golden and Richard Kadrey sued OpenAI, the maker of ChatGPT, and others for using their work with AI.

At the heart of the rebellions is a new understanding of it online information – Stories, artwork, news articles, message board posts, and photos – can have significant untapped value.

The new wave of AI — known as “generative AI” for the text, images, and other content it generates — relies on complex systems such as large language modelswho are capable of producing human-like prose. These models are trained on vast amounts of data of all kinds so that they can answer human questions, mimic writing styles, or produce comedy and poetry.

This has led tech companies to seek even more data for their AI systems. Google, Meta, and OpenAI essentially leveraged information from across the Internet, including large databases of fan fiction, tons of news articles, and book collections, many of which were freely available online. In technology industry parlance, this has been referred to as “scraping” the Internet.

GPT-3 by OpenAI, an AI system released in 2020, includes 500 billion “tokens,” each representing parts of words found primarily online. Some AI models span more than a trillion tokens.

The practice of internet scraping has been around for a long time and has largely been disclosed by the companies and non-profit organizations that have practiced it. But it wasn’t well understood or seen as particularly problematic by the companies that owned the data. That changed after ChatGPT launched in November and the public learned more about the underlying AI models behind the chatbots.

“What’s happening here is a fundamental realignment of the value of data,” said Brandon Duderstadt, founder and CEO of Nomic, an AI company. “It used to be thought that you could get value out of data by making it accessible to everyone and showing ads. The thought now is that you lock your data because you get a lot more value out of it when you use it as input to your AI.”

The data protests are unlikely to have any long-term impact. Financially strong tech giants like Google and Microsoft already have mountains of proprietary information and the resources to license more. But as the era of easy-to-extract content comes to an end, smaller AI startups and nonprofits that were hoping to compete with the big players may not be able to get enough content to train their systems.

In a statement, OpenAI said ChatGPT was trained on “licensed content, public domain content, and content created by human AI trainers.” It added: “We respect the rights of creators and writers and look forward to continuing to work with them to protect their interests.”

Google said in a statement it was involved in discussions about how publishers could manage their content in the future. “We believe everyone benefits from a vibrant content ecosystem,” the company said. Microsoft did not respond to a request for comment.

The data revolts erupted last year after ChatGPT became a global phenomenon. In November, a group of programmers filed a proposed class action lawsuit against Microsoft and OpenAI, claiming that the companies infringed their copyright after their code was used to train an AI-powered programming assistant.

Getty Images, which provides stock photos and videos, filed a lawsuit in January Stability AIan AI company that creates images from text descriptions claims the start-up used copyrighted photos to train its systems.

Then, in June, Clarkson, a Los Angeles law firm, filed a 151-page class action lawsuit against OpenAI and Microsoft, describing how OpenAI had collected data from minors and saying web scraping violated copyright law and constituted “theft.” On Tuesday, the company filed a similar lawsuit against Google.

“The data rebellion we’re seeing across the country is society’s way of fighting back against the idea that big tech simply has the right to take any information from any source and make it their own,” Ryan said Clarkson, the founder of Clarkson.

Santa Clara University Law School professor Eric Goldman said the lawsuit’s arguments are wide-ranging and unlikely to be accepted by the court. But the wave of litigation is just beginning, he said, and there are “second and third waves” coming that will determine the future of AI.

Larger companies are also fighting back against AI scrapers. In April, Reddit said It wanted to charge for access to its application programming interface (API), the method that allows third parties to download and analyze the social network’s extensive database of personal conversations.

Reddit CEO Steve Huffman said at the time that his company “doesn’t have to give all that value away for free to some of the biggest companies in the world.”

That same month, Stack Overflow, a question-and-answer site for computer programmers, announced that it would also be asking AI companies to pay for data. The site contains nearly 60 million questions and answers. His move has been previously reported by Wired.

News organizations are also fighting back against AI systems. In an internal memo about the use of generative AI in June, The Times said AI companies should “respect our intellectual property.” A Times spokesman declined to elaborate.

For individual artists and writers, battling AI systems has meant rethinking where they publish.

Nicholas Kole, 35, an illustrator from Vancouver, British Columbia, was concerned about how his unique art style could be reproduced by an AI system and suspected the technology had corrupted his work. He plans to continue posting his creations on Instagram, Twitter, and other social media sites to attract customers, but has stopped posting on sites like ArtStation, which post AI-generated content alongside human-made content.

“It just feels like wanton theft from me and other artists,” Mr. Kole said. “It triggers a hole in my existential fear.”

At Archive of Our Own, a fan fiction database with more than 11 million stories, authors are pressuring the site to ban data scraping and AI-generated stories.

In May, when a few Twitter accounts shared examples of how ChatGPT mimicked the style of popular fan fiction and posted it on Archive of Our Own, dozens of writers rose in an uproar. They blocked their stories and wrote subversive content to mislead the AI scrapers. They also urged Archive of Our Own officials to stop allowing AI-generated content.

Betsy Rosenblatt, who provides legal counsel for Archive of Our Own and is a professor at the University of Tulsa College of Law, said the site has a policy of “maximum inclusivity” and doesn’t want to be able to identify which stories were written with AI

For Ms. Loffstadt, the fan fiction author, the battle against AI began when she wrote a story about Horizon Zero Dawn, a video game in which humans battle AI-powered robots in a post-apocalyptic world. In the game, she said, some of the robots are good and some are bad.

But in the real world, she said, “they’re lured into doing bad things thanks to hubris and corporate greed.”

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

LEAVE A REPLY Cancel reply

EDITOR PICKS

POPULAR POSTS

QUICK LINKS

ABOUT US

FOLLOW US

Cookie bar