|LWN.net needs you!|
Without subscribers, LWN would simply not exist. Please consider signing up for a subscription and helping to keep LWN publishing
I recently took a deep dive into web site archival for friends who were worried about losing control over the hosting of their work online in the face of poor system administration or hostile removal. This makes web site archival an essential instrument in the toolbox of any system administrator. As it turns out, some sites are much harder to archive than others. This article goes through the process of archiving traditional web sites and shows how it falls short when confronted with the latest fashions in the single-page applications that are bloating the modern web.
Converting simple sites
For simple or static sites, the venerable Wget program works well. The incantation to mirror a full web site, however, is byzantine:
$ nice wget --mirror --execute robots=off --no-verbose --convert-links \ --backup-converted --page-requisites --adjust-extension \ --base=./ --directory-prefix=./ --span-hosts \ --domains=www.example.com,example.com http://www.example.com/
The above downloads the content of the web page, but also crawls everything within the specified domains. Before you run this against your favorite site, consider the impact such a crawl might have on the site. The above command line deliberately ignores robots.txt rules, as is now common practice for archivists, and hammer the website as fast as it can. Most crawlers have options to pause between hits and limit bandwidth usage to avoid overwhelming the target site.
The above command will also fetch "page requisites" like style sheets (CSS), images, and scripts. The downloaded page contents are modified so that links point to the local copy as well. Any web server can host the resulting file set, which results in a static copy of the original web site.
That is, when things go well. Anyone who has ever worked with a computer
knows that things seldom go according to plan; all sorts of
things can make the procedure derail in interesting ways. For example,
it was trendy for a while to have calendar blocks in web sites. A CMS
would generate those on the fly and make crawlers go into an infinite
loop trying to retrieve all of the pages. Crafty archivers can resort to regular expressions
(e.g. Wget has a
--reject-regex option) to ignore problematic
resources. Another option, if the administration interface for the
web site is accessible, is to disable calendars, login forms, comment
forms, and other dynamic areas. Once the site becomes static, those
will stop working anyway, so it makes sense to remove such clutter
from the original site as well.
Traditional archival methods sometimes fail in the dumbest way. When
trying to build an offsite backup of a local newspaper
(pamplemousse.ca), I found that
WordPress adds query strings (e.g.
content-type detection in the web servers that serve the archive, which
rely on the file extension to send the right
Content-Type header. When such an archive is
loaded in a web browser, it fails to load scripts, which breaks dynamic websites.
As the web moves toward using the browser as a virtual machine to run arbitrary code, archival methods relying on pure HTML parsing need to adapt. The solution for such problems is to record (and replay) the HTTP headers delivered by the server during the crawl and indeed professional archivists use just such an approach.
Creating and displaying WARC files
At the Internet Archive, Brewster Kahle and Mike Burner designed the ARC (for "ARChive") file format in 1996 to provide a way to aggregate the millions of small files produced by their archival efforts. The format was eventually standardized as the WARC ("Web ARChive") specification that was released as an ISO standard in 2009 and revised in 2017. The standardization effort was led by the International Internet Preservation Consortium (IIPC), which is an "international organization of libraries and other organizations established to coordinate efforts to preserve internet content for the future", according to Wikipedia; it includes members such as the US Library of Congress and the Internet Archive. The latter uses the WARC format internally in its Java-based Heritrix crawler.
A WARC file aggregates multiple resources like HTTP headers, file
contents, and other metadata in a single compressed
archive. Conveniently, Wget actually supports the file format with
--warc parameter. Unfortunately, web browsers cannot render WARC
files directly, so a viewer or some conversion is necessary to access
the archive. The simplest such viewer I have found is pywb, a
Python package that runs a simple webserver to offer a
Wayback-Machine-like interface to browse the contents of WARC
files. The following set of commands will render a WARC file on
$ pip install pywb $ wb-manager init example $ wb-manager add example crawl.warc.gz $ wayback
This tool was, incidentally, built by the folks behind the Webrecorder service, which can use a web browser to save dynamic page contents.
Unfortunately, pywb has trouble loading WARC files generated by Wget because it followed an inconsistency in the 1.0 specification, which was fixed in the 1.1 specification. Until Wget or pywb fix those problems, WARC files produced by Wget are not reliable enough for my uses, so I have looked at other alternatives. A crawler that got my attention is simply called crawl. Here is how it is invoked:
$ crawl https://example.com/
(It does say "very simple" in the README.) The program does support
some command-line options, but most of its defaults are sane: it will fetch
page requirements from other domains (unless the
flag is used), but does not recurse out of the domain. By default, it
fires up ten parallel connections to the remote site, a setting that
can be changed with the
-c flag. But, best of all, the resulting WARC
files load perfectly in pywb.
Future work and alternatives
This article would also not be complete without a nod to the HTTrack project, the "website copier". Working similarly to Wget, HTTrack creates local copies of remote web sites but unfortunately does not support WARC output. Its interactive aspects might be of more interest to novice users unfamiliar with the command line.
In the same vein, during my research I found a full rewrite of Wget called Wget2 that has support for multi-threaded operation, which might make it faster than its predecessor. It is missing some features from Wget, however, most notably reject patterns, WARC output, and FTP support but adds RSS, DNS caching, and improved TLS support.
Finally, my personal dream for these kinds of tools would be to have them integrated with my existing bookmark system. I currently keep interesting links in Wallabag, a self-hosted "read it later" service designed as a free-software alternative to Pocket (now owned by Mozilla). But Wallabag, by design, creates only a "readable" version of the article instead of a full copy. In some cases, the "readable version" is actually unreadable and Wallabag sometimes fails to parse the article. Instead, other tools like bookmark-archiver or reminiscence save a screenshot of the page along with full HTML but, unfortunately, no WARC file that would allow an even more faithful replay.
The sad truth of my experiences with mirrors and archival is that data dies. Fortunately, amateur archivists have tools at their disposal to keep interesting content alive online. For those who do not want to go through that trouble, the Internet Archive seems to be here to stay and Archive Team is obviously working on a backup of the Internet Archive itself.(Log in to post comments)