Idea and Concept

Transactional Archiving consists of selectively capturing and storing transactions that take place between a web client (browser) and a web server.

Most existing web archives recurrently send out bots to crawl the content of web servers. This results in observations of a server's content at the time of crawling. Since the crawling frequency is generally not aligned with the change rate of a server's resources, this approach is typically not able to capture all versions of a server's resource. The resulting archive may provide an acceptable overview of a server's evolution over time, but it will not provide an accurate representation of the server's entire history. A SiteStory Web Archive, however, captures every version of a resource as it is being requested by a browser. The resulting archive is effectively representative of a server's entire history, although versions of resources that are never requested by a browser will also never be archived. Adding SiteStory archiving capabilities to an Apache Web Content Server does not affect its performance in any significant way.

The SiteStory Web Archive provides the following opportunities:

  • Dynamic Archiving of your Apache Web Content Server
  • Archive accessible via the Memento protocol
  • Archival data can be offloaded to WARC files
  • Archival data can be uploaded into an instance of the Internet Archive's Wayback software

Architecture Overview

Figure 1