Reddit Limits Internet Archive Access to Curb AI Data Scraping

Spread the love

eWEEK content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

Reddit has started blocking The Internet Archive (IA) from archiving most of its platform. The company says the move was prompted by concerns that AI developers were pulling Reddit-originated data from archived pages, something IA has not been able to stop.

Table of Contents

What is the Internet Archive?

The Internet Archive has been around longer than Google, Facebook, and LinkedIn. Founded in 1996, it was designed to give the public open access to the growing collection of information on the internet.

One of its most recognized features, the Wayback Machine, was launched in 2001. While the IA preserves a wide range of online material, the Wayback Machine specifically allows users to see websites as they looked on certain dates, even if the originals were deleted or changed.

Today, the IA says it has preserved more than 835 billion web pages, along with books, images, videos, apps, and audio files.

Understanding the dispute

The IA regularly archived Reddit content, including original posts and comment threads to user profiles. Even if a post was deleted by the original author, a snapshot often remained available on the Wayback Machine. Deleted user profiles were even made available on the archive. While this information could easily be found by any tech-savvy Reddit user, the real issue arose when automated bots started using the IA to scrape data for use in their own AI models.

The problem escalated when automated programs began using those archives to gather large amounts of data for AI training. Because the platform bans automated scraping without approval, Reddit argues it had little choice but to limit IA’s access when the archive could not reliably block these activities.

Tim Rathschmidt, a spokesperson for Reddit, was recently quoted as saying: “Until they’re able to defend their site and comply with platform policies (e.g., respecting user privacy, re: deleting removed content) we’re limiting some of their access to Reddit data to protect redditors.”

The restriction does not cut off IA entirely; the Wayback Machine can still capture Reddit’s homepage but not full posts, comments, or subreddit pages.

Will other sites follow suit?

The change comes just over a year after Reddit said it would not block “good faith actors” such as the IA from accessing its content.

Reddit insists that the block is a result of AI bots using the IA and the Wayback Machine to scrape data. However, some critics — citing Reddit’s recent plan to introduce paid subreddits — see it as a way for the social media platform to ensure a smooth transition to content monetization in the future. Others see it as a result of Reddit’s recent licensing deal with Google and OpenAI.

Whatever Reddit’s true reason may be, it will be interesting to see if other social media platforms will follow its lead.

Read our coverage of Reddit’s lawsuit against Anthropic and why it could set a precedent for how AI companies source their data.