Inside the Internet Archive: 99 Petabytes, One Mission

The Internet Archive stores a trillion web pages and 99 petabytes of data — but copyright lawsuits threaten its survival. Here's how it works and why it matters.

Deep Dive · Digital Preservation
Inside the
Internet
Archive

A trillion web pages, 99 petabytes of original data, and a legal battle that could erase digital history — the story of the web’s most important library.

April 26, 2026  ·  8 min read  ·  Internet Archive · Copyright · Digital Preservation

Tucked inside a former Christian Science church in San Francisco, a non-profit organisation quietly crawls the web, archives television broadcasts, preserves out-of-print books, and maintains what is arguably the most important digital institution in existence. It’s called the Internet Archive — and some of the world’s largest copyright holders are trying to shut it down.

99
Petabytes
of original data stored and preserved
1T+
Web pages
harvested across decades of crawls
Top 200
Websites
more than 100 rely on Wayback Machine links

What is the Internet Archive?

Founded in 1996 by Brewster Kahle, the Internet Archive began with a deceptively simple mission: preserve the web before it disappears. Web pages are transient by nature — links rot, companies fold, governments scrub content. The Archive’s Wayback Machine provides a permanent, timestamped record of what the web looked like at any given moment.

Today it extends far beyond web crawling. The Archive preserves television broadcasts, audio recordings, software, and books — functioning as a free, public digital library accessible to anyone with an internet connection.

“It’s petabytes of materials that are now at least safe, so you can actually leverage the public domain.”

The infrastructure behind it

The scale of the operation is staggering. The Archive runs on custom-built server hardware housed in its San Francisco headquarters, with a notable feature of its data centre: a large, basic air fan is the primary cooling mechanism — with a handwritten note reminding staff not to turn it off.

Data is stored across multiple petabyte-class arrays, continually crawled and indexed by automated bots. Every new snapshot of a web page is deduplicated against existing records before being committed to storage. The organisation operates on a non-profit model, funded almost entirely by donations and grants.

The Wayback Machine as infrastructure: More than 100 of the top 200 most-visited websites on the internet actively link to Wayback Machine URLs to restore broken or deleted content. It functions less like a museum and more like load-bearing infrastructure for the modern web.

A brief history

1996
Brewster Kahle founds the Internet Archive with a focus on web crawling and long-term digital preservation
2001
The Wayback Machine launches publicly, making the archive’s web history accessible to anyone
2011
Controlled Digital Lending programme begins, allowing the Archive to loan scanned books like a traditional library
2020
National Emergency Library launched during COVID-19 pandemic, triggering a lawsuit from major publishers
2024
Courts rule against the Archive on Controlled Digital Lending; appeals process ongoing as of 2026
 

The threats it faces

The Internet Archive is fighting battles on multiple fronts. Publishers and record labels have pursued legal action over its lending and music hosting programmes. The core dispute centres on whether digitising a physical book and lending one copy at a time — mirroring traditional library practice — constitutes fair use or copyright infringement.

Publisher lawsuits

Major publishers sued over Controlled Digital Lending, arguing digital loans require separate licensing deals

🎵

Music rights

Record labels targeted the Archive’s 78rpm music collection, much of which predates modern copyright frameworks

🌐

DMCA takedowns

Copyright holders issue takedown notices against specific web snapshots, creating gaps in the historical record

💰

Funding pressure

Legal costs strain the non-profit’s budget, which runs entirely on donations, grants, and institutional support

Why it matters beyond nostalgia

Journalists rely on the Wayback Machine to document when governments or corporations quietly alter or delete public statements. Researchers use it to study how misinformation spreads and evolves. Courts have accepted Wayback Machine snapshots as evidence. Developers use the Archive’s APIs to build tools that surface public domain material.

The Archive’s unsung heroes — the staff and volunteers who digitise physical media, maintain metadata, and process takedown requests — are preserving cultural history that would otherwise be lost to bit-rot and corporate indifference.

For developers: The Archive exposes a robust public API. You can access Wayback Machine availability data, search the full-text index, embed archived pages, and download bulk datasets — all free of charge. See archive.org/developers for documentation.

What happens if it disappears?

There is no backup for the Internet Archive. No government body, no corporate entity, no academic institution maintains an equivalent public resource at this scale. If the organisation were forced to shut down by litigation costs or legal injunction, decades of web history, out-of-print books, broadcast recordings, and software would become effectively inaccessible to the public.

That’s not a hypothetical — it’s precisely the outcome some of the plaintiffs appear to be pursuing. The case is a defining test of whether digital libraries can operate under the same principles as physical ones, or whether every act of digital preservation requires a separate commercial licence.

There is no backup for the Internet Archive. No government, no corporation maintains an equivalent public resource at this scale.