Adding Transactional Archiving to your Apache Web Server.

This document contains information to get you started with "SiteStory Web Archive" our Transactional Web Archive software. It is aimed primarily at developers and contains simple installation instructions for:

  • "mod_sitestory", our SiteStory Apache Module and
  • a single SiteStory Web Archive server.

This document also contains a few commands to verify that the software is running as well as a few sections regarding the setup for long term archiving, which involves data distribution between SiteStory and Internet Archive (IA) Wayback archives.

Prerequisites

Supported Platforms

  • GNU/Linux is supported as a development and production platform for both mod_sitestory and SiteStory Web Archive.
  • All other platforms have not been tested, but if the required software (see below) runs on your platform, then it is possible that mod_sitestory and SiteStory Web Archive will run too.

Required Software

  • mod_sitestory, the SiteStory Apache Module requires the Apache Web Server, version 2.2 or higher, to be the primary web server that is serving the content to be archived. In addition, the Apache Extension Tool (apxs) needs to be installed. The Apache website has detailed instructions on how to install both the web server and the extension tool.
  • SiteStory Web Archive runs in Java, version 1.6 or greater (JDK 6 or greater). It runs as an application under Tomcat 6 or greater.

Download

To get the lasted version of the SiteStory Web Archive software, download the stable release from maven repository here or get latest continuous build at here. For the mod_sitestory.c see the source repository

Installation of SiteStory Apache Module

Pushing data from your Apache Web Server to the SiteStory Web Archive is done by installing a special filter to Apache which we call mod_sitestory. More information on Apache filters and mod_sitestory can be found here. The hands on example of the mod_sitestory.c installation can be found here: for Fedora Core and Ubuntu Linux. Installation of the mod_sitestory filter consists of executing the following command:sudo /usr/sbin/apxs -c -i –a mod_sitestory.c and editing the Apache configuration file. Add the following lines to the Apache configuration file:

 <IfModule sitestory_module>
    EnableArchiving On
    ArchiveHost www.myarchivename.com
    ArchivePort 8080
    ArchivePath /[appname]/put/
    ArchiveTimeGate http://www.myarchivename.com/[appname]/timegate/
    EnableIP On  
    Excluded /search /test 
 </IfModule>

Here are the meanings for each of the fields:

EnableArchiving
enables or disables recording of the client’s request IP address, possible values: On/Off
ArchiveHost
the host name of the transactional archive.
ArchivePort
the port number of the transactional archive.
ArchivePath
the path to the put interface of the archive, since the put interface of the transactional archive is constructed as http://[archivehost]:[port]/[appname]/put .
ArchiveTimeGate
the baseURL of the Memento TimeGate service at archive.
EnableIP
On/Off enables or disables recording of the client’s request IP address
Excluded
list of directories excluded from archiving (optional parameter). For example, if you want to exclude http://mycontenthost/search from archiving, specify /search as Excluded. All content of the listed directories, including their child directories, will be excluded.

Please note that only HTTP GET requests are archived with Response codes 200, 302, 303

Installation of SiteStory Web Archive under Tomcat

Setting up a SiteStory Web Archive server under Tomcat is straightforward. The server is contained in a single WAR file named 'sitestory.war'. The installation therefore only consists of editing a configuration file. You can rename the sitestory.war to your [appname].war if you want.

Once you copied the WAR file into the [TOMCAT-HOME]/webapps directory, you can find the ta.properties file in the [TOMCAT-HOME]/webapps/sitestory/WEB-INF/classes directory.

                 
#rest.min.grizzly.threads=7
#rest.max.grizzly.threads=128
#warcfiles.unload.dir=/storage/twa/db/warcfiles
#ta.warcwriterpool.maxwait=20000
#ta.warcwriterpool.maxactive=3
ta.index=gov.lanl.archive.index.bdb.IndexImplB
ta.storage.basedir=/storage/twa/db
#ta.index.basedir=/storage/twa/db/bdbindex
#put.ip.1=127.0.0.1
#put.ip.2=192.12.184.6

Change the value of the ta.storage.basedir to specify an existing (empty to start with) directory. All commented out parameters are optional. Here are the meanings for each of the fields:

ta.storage.basedir
This is the directory for the storage of HTTP response stream files. By default Berkeley DB indexes will be created here as well unless ta.index.basedir is specified. Make sure that Tomcat has permissions to create subdirectories in this directory.
ta.index.basedir
The location to store Berkeley DB indexes, which contain the transaction log of every page accessed by the content Apache server.
warcfiles.unload.dir
The directory to offload data in WARC format to migrate to the Wayback archive.
put.ip.1
IP address of the Apache content server, to restrict any other servers to put content to your archive. A List of IPs can be specified if needed.
ta.index
Class file of implementation of index store. Currently only one implementation exists based on Berkeley DB.
ta.warcwriterpool.maxactive
Parameter for WarcWriter of InternetArchive. Specify how many parallel WARCWriters to run.
ta.warcwriterpool.maxwait
Parameter for WarcWriter of InternetArchive. Specify how long to wait on WARCWriter from the pool of WARCWriters
rest.min.grizzly.threads
Minimum number of threads for standalone installation as grizzly server. Not used under Tomcat installation
rest.max.grizzly.threads
Maximum number of threads for standalone installation as grizzly server. Not used under Tomcat installation

These changes require a reboot of the Tomcat server. You can access any page of your content Apache server and look into the SiteStory Tomcat logs to check if the PUT service is working. Also, you can check that the directory structure has been created at ta.storage.basedir .

                                                                                                                                                                          
[user@host ta]$ pwd                                                                                                                                                                       
/storage/ta/db                                                                                                                                                                  
[user@host db]$ ls                                                                                                                                                                    
bdbindex  storage                                                                                                                                                                         
[user@host db]$ cd bdbindex                                                                                                                                                           
[user@host bdbindex]$ ls                                                                                                                                                                  
00000005.jdb  00000006.jdb  je.info.0  je.info.0.1  je.info.0.lck  je.lck  je.properties                                                                                                  
[user@host bdbindex]$ cd /storage/ta/db/storage                                                                                                                                 
[user@host storage]$ ls                                                                                                                                                                   
00  06  0c  12  18  1e  24  2a  30  36  3c  42  48  4e  54  5a  60  66  6c  72  78  7e  84  8a  90  96  9c  a2  a8  ae  b4  ba  c0  c6  cc  d2  d8  de  e4  ea  f0  f6  fc                
01  07  0d  13  19  1f  25  2b  31  37  3d  43  49  4f  55  5b  61  67  6d  73  79  7f  85  8b  91  97  9d  a3  a9  af  b5  bb  c1  c7  cd  d3  d9  df  e5  eb  f1  f7  fd                
02  08  0e  14  1a  20  26  2c  32  38  3e  44  4a  50  56  5c  62  68  6e  74  7a  80  86  8c  92  98  9e  a4  aa  b0  b6  bc  c2  c8  ce  d4  da  e0  e6  ec  f2  f8  fe                
03  09  0f  15  1b  21  27  2d  33  39  3f  45  4b  51  57  5d  63  69  6f  75  7b  81  87  8d  93  99  9f  a5  ab  b1  b7  bd  c3  c9  cf  d5  db  e1  e7  ed  f3  f9  ff                
04  0a  10  16  1c  22  28  2e  34  3a  40  46  4c  52  58  5e  64  6a  70  76  7c  82  88  8e  94  9a  a0  a6  ac  b2  b8  be  c4  ca  d0  d6  dc  e2  e8  ee  f4  fa                    
05  0b  11  17  1d  23  29  2f  35  3b  41  47  4d  53  59  5f  65  6b  71  77  7d  83  89  8f  95  9b  a1  a7  ad  b3  b9  bf  c5  cb  d1  d7  dd  e3  e9  ef  f5  fb                    
[user@host storage]$ cd 00                                                                                                                                                                
[user@host 00]$ ls                                                                                                                                                                        
01  0c  12  17  1d  25  2d  35  42  49  51  58  5e  62  66  74  7a  84  88  8c  91  98  9e  a2  ad  b1  b7  be  c3  c9  cf  d3  d8  e2  e7  f0  f9                                        
04  0d  13  1a  1e  26  30  37  45  4c  53  5b  5f  64  67  75  7e  85  8a  8d  95  9a  a0  a9  ae  b4  ba  c1  c4  ca  d1  d5  dc  e4  e9  f2  fa                                        
05  0e  14  1c  21  2b  31  3b  48  4f  57  5c  60  65  69  76  81  87  8b  8f  97  9d  a1  aa  af  b5  bc  c2  c7  ce  d2  d7  e1  e5  ed  f3  fc                                        
[user@host]$ cd 01                                                                                                                                                                        
[user@host]$ ls                                                                                                                                                                           
3da4-350d-4896-8fa8-9772c63a914d.body                                                                                                                                                     

Memento Interface to SiteStory Web Archive

The SiteStory Web Archive provides REST - style services to access and manage content. To let Memento clients access archived content from the SiteStory Web Archive, the following services are provided. Please refer to the Memento protocol for additional information.

TimeGate
To access the TimeGate from SiteStory Web Archive (installed as sitestory.war) for the resource with URI [original_url] that is served by the Content Server:
curl -D headers.txt -H Accept-Datetime:'Wed, 29 Sep 2008 12:00:04 GMT’ \
http://[host]:[port]/sitestory/timegate/[original_url] 
     
Memento
To retrieve a Memento (archived version of a resource) from SiteStory for the resource with URI [original_url] that is served by the Content Server:
curl -D headers.txt \
http://[host]:[port]/sitestory/memento/20110311000508/[original_url] 
In this URI, 20110311000508 represents the archival date/time of the resource with the URI [original_url] expressed in the form YYYYMMDDHHMMSS.
TimeMap
To retrieve a link-value formatted TimeMap from SiteStory for the resource with URI [original_url] that is served by the Content Server:
curl \
http://[host]:[port]/sitestory/timemap/link/[original_url]

Exporting and Deleting services of SiteStory Web Archive

The SiteStory Web Archive provides a service to offload HTTP request/response data to WARC file format. This procedure can be a long-term data managing tool to offload data to Wayback archives.

WARCUnload
curl \
http://[archivehost]:[port]/sitestory/warcunload/20120329000508/* 
The archived data before the 20120329000508 date will be recoded to the wayback files at the directory, specified by warcfiles.unload.dir in ta.properties. We decided to omit the recording of revisit (by digest) records to WARC files to avoid stress to the Wayback archive. To compensate, we added special metadata records with information on time interval where the recorded response body bitstream was the same, based on digest calculation over bitstream as well as number of page hits during that period, see a sample of WARC file here .
Delete
curl \
http://[archivehost]:[port]/sitestory/delete/20120410000508/*
 
The archived data before the 20120329000508 date will be deleted from SiteStory.
curl \
http://[archivehost]:[port]/sitestory/delete/20120410000508/http://myhost/test
 
The archived data with url="http://myhost/test" before the 20120329000508 date will be deleted from SiteStory.
curl \
http://[archivehost]:[port]/sitestory/delete/*/http://myhost/test
The archived data with url="http://myhost/test" will be deleted from SiteStory. Please use it with caution to avoid break of referential integrity of archive.

PUT protocol

Optimization

connectionTimeout of Tomcat
If your Apache Content server contains some pages which require a long time to load, we recommend to change Tomcat's connectionTimeout parameter in the server.xml to a larger value, for example
        <Connector port="8080" protocol="HTTP/1.1"
               connectionTimeout="60000" maxKeepAliveRequests="15" maxThreads="300"
               redirectPort="8443" />  
       
     
A value too low for the timeout parameter may effect the completion of the HTTP PUT requests to the SiteStory Web Archive (used by mod_sitestory filter)
je.properties
Copy je.properties to your db indexdir: ../bdbindex. You can add the following parameters:
 je.log.fileMax=500000000 //You can alter size of the db file (default is 10M) to have more compact storage.
 je.evictor.useMemoryFloor=65 //You can tune memory parameters
 je.maxMemory=120000000
 je.env.runCleaner=true //set to clean logs periodically
 je.env.runEvictor=true
 je.env.recovery=true
 je.env.sharedLatches=false