OpenAI’s Web Crawler Has Been Blocked by The New York Times

AI companies are being watched closely when it comes to the data they use to train their AI models. OpenAI, specifically, has been facing several lawsuits due to alleged unconsented data scraping. The New York Times is getting ahead by blocking its web crawler before it gathers data.

New York Times Headquarters in New York City — Gary Hershorn/Getty Image

OpenAI Can No Longer Scrape Data from NYT

The publication has taken measures to make sure that its data will not be used to train AI models. It is especially vigilant towards OpenAI as it blocks its web crawler, GPTBot, based on The New York Times's robot.txt page.

Reports say that the restriction has been up since August 17th, which is around the time that the news company updated its Terms of Service disallowing other companies scraping information to train AI systems and algorithms.

This data includes texts, photos, videos, and metadata, so in short, everything. The publication is even considering legal action against the AI company, according to The Verge, which just adds to the long list of lawsuits that OpenAI is already facing.

Back in July, comedian Sarah Silverman joined a class-action suit against the ChatGPT maker, accusing it of copyright infringement. Silverman claims that OpenAI, along with Meta, "copied and ingested" her content to train AI programs.

The people behind the lawsuit include authors like Christopher Golden and Richard Kadrey. The lawsuit states that ChatGPT's ability to generate summaries of the plaintiff's works would only be possible if the chatbot was trained using the mentioned data.

According to The New York Times, ChatGPT was able to summarize the comedian's memoir, "The Bedwetter," to which it wrote that one of the key topics in the first part of the memoir is "Silverman's struggle with enuresis, or bedwetting, which extended into her teenage years."

It's Not Just NYT

Given that AI needs to gather as much data as it can use to be efficient in providing answers to its users, it is possible that the AI company developing it could collect data from various sources, and that includes social networking sites.

Mainstream platforms like X (Twitter) as well as Reddit have already addressed the issue and implemented new policies to prevent such things from happening. Both mentioned sites have gone through controversial means just to prevent their data from being scraped.

Reddit implemented its new API pricing, which prompted a protest led by its moderators. More than 8,000 subreddits participated in the blackout. In the end, nothing much came of it, and a lot of third-party apps decided to shut down.

The platform's CEO Steve Huffman says that "the Reddit corpus of data is really valuable. But we don't need to give all of that value to some of the largest companies in the world for free," as mentioned in Cybernews.

Musk stated that it was also the reason why X suddenly had a daily read limit for its users. The Tesla CEO reasoned that "several entities tried to scrape every tweet ever made in a short period of time," which is why they had to set limits in place.