WordPress and Tumblr Plan to Sell User Content to AI Companies

Automattic, the parent company of sites like WordPress and Tumblr, is in talks to sell content from its platforms to AI companies like MidJourney and OpenAI for training purposes, according to a new report from 404 Media Tuesday. And while the details of the deal are still sketchy, Automattic is trying to reassure users they can opt-out at any time.

ChatGPT’s Creator Buddies Up to Congress | Future Tech

404 reports there’s conflict within Automattic as some of the content that was being scraped for the AI companies included private content not intended to be saved by the company. To complicate matters even further, advertising content that isn’t even owned by Automattic, including ads from an old Apple Music campaign, has also reportedly made its way into the training data set.

The plans at Automattic have been so controversial internally, that a product manager has even started pulling his own photos off Tumblr to make sure they’re not used to train AI, according to 404.

Generative AI has become a big business ever since OpenAI first launched ChatGPT in late 2022 and text-prompt image creators soon followed from a number of companies. The technology works by being “trained” on enormous amounts of data, which allows it to generate videos, images, or text that appears original. But major publishers have complained, with some even filing lawsuits, alleging that much of the data used to train these systems was either pirated or doesn’t constitute “fair use” under existing copyright regimes.

Automattic plans to introduce a new setting as soon as Wednesday that will let users opt out of training AI systems, according to 404 Media, but it’s not clear whether the setting will be toggled on or off by default for most users. WordPress competitor Squarespace introduced a similar setting to opt out of allowing your data to be used for training AI last year.

In response to emailed questions on Tuesday, Automattic directed Gizmodo to a new post that more or less confirmed 404 Media’s reporting, while trying to sell the move to consumers as an opportunity to “give you more control over the content you’ve created.”

“AI is rapidly transforming nearly every aspect of our world, including the way we create and consume content. At Automattic, we’ve always believed in a free and open web and individual choice. Like other tech companies, we’re closely following these advancements, including how to work with AI companies in a way that respects our users’ preferences,” the blog post reads.

But the lengthy statement comes across as incredibly defensive, noting that “no law exists that requires crawlers to follow these preferences,” and suggesting that the company is simply following best practices in the industry to give users the option to decide if they want their content used for training AI.

“Regardless of geographic location, we want to provide you tools that grant as much control as possible. Since respectable companies do follow these settings, they’re the best method to enforce how content is crawled on the web,” Automattic’s statement reads.

“Our partnerships will respect all opt-out settings. We also plan to take that a step further and regularly update any partners about people who newly opt-out and ask that their content be removed from past sources and future training.”

Read More

Matt Novak