This document contains information to get you started with "SiteStory Web Archive" our Transactional Web Archive software. It is aimed primarily at developers and contains simple installation instructions for:
This document also contains a few commands to verify that the software is running as well as a few sections regarding the setup for long term archiving, which involves data distribution between SiteStory and Internet Archive (IA) Wayback archives.
To get the lasted version of the SiteStory Web Archive software, download the stable release from maven repository here or get latest continuous build at here. For the mod_sitestory.c see the source repository
Pushing data from your Apache Web Server to the SiteStory Web Archive is done by installing a special filter to Apache which we call mod_sitestory. More information on Apache filters and mod_sitestory can be found here. The hands on example of the mod_sitestory.c installation can be found here: for Fedora Core and Ubuntu Linux. Installation of the mod_sitestory filter consists of executing the following command:sudo /usr/sbin/apxs -c -i –a mod_sitestory.c and editing the Apache configuration file. Add the following lines to the Apache configuration file:
<IfModule sitestory_module> EnableArchiving On ArchiveHost www.myarchivename.com ArchivePort 8080 ArchivePath /[appname]/put/ ArchiveTimeGate http://www.myarchivename.com/[appname]/timegate/ EnableIP On Excluded /search /test </IfModule>
Here are the meanings for each of the fields:
Please note that only HTTP GET requests are archived with Response codes 200, 302, 303
Setting up a SiteStory Web Archive server under Tomcat is straightforward. The server is contained in a single WAR file named 'sitestory.war'. The installation therefore only consists of editing a configuration file. You can rename the sitestory.war to your [appname].war if you want.
Once you copied the WAR file into the [TOMCAT-HOME]/webapps directory, you can find the ta.properties file in the [TOMCAT-HOME]/webapps/sitestory/WEB-INF/classes directory.
#rest.min.grizzly.threads=7 #rest.max.grizzly.threads=128 #warcfiles.unload.dir=/storage/twa/db/warcfiles #ta.warcwriterpool.maxwait=20000 #ta.warcwriterpool.maxactive=3 ta.index=gov.lanl.archive.index.bdb.IndexImplB ta.storage.basedir=/storage/twa/db #ta.index.basedir=/storage/twa/db/bdbindex #put.ip.1=127.0.0.1 #put.ip.2=192.12.184.6
Change the value of the ta.storage.basedir to specify an existing (empty to start with) directory. All commented out parameters are optional. Here are the meanings for each of the fields:
These changes require a reboot of the Tomcat server. You can access any page of your content Apache server and look into the SiteStory Tomcat logs to check if the PUT service is working. Also, you can check that the directory structure has been created at ta.storage.basedir .
[user@host ta]$ pwd /storage/ta/db [user@host db]$ ls bdbindex storage [user@host db]$ cd bdbindex [user@host bdbindex]$ ls 00000005.jdb 00000006.jdb je.info.0 je.info.0.1 je.info.0.lck je.lck je.properties [user@host bdbindex]$ cd /storage/ta/db/storage [user@host storage]$ ls 00 06 0c 12 18 1e 24 2a 30 36 3c 42 48 4e 54 5a 60 66 6c 72 78 7e 84 8a 90 96 9c a2 a8 ae b4 ba c0 c6 cc d2 d8 de e4 ea f0 f6 fc 01 07 0d 13 19 1f 25 2b 31 37 3d 43 49 4f 55 5b 61 67 6d 73 79 7f 85 8b 91 97 9d a3 a9 af b5 bb c1 c7 cd d3 d9 df e5 eb f1 f7 fd 02 08 0e 14 1a 20 26 2c 32 38 3e 44 4a 50 56 5c 62 68 6e 74 7a 80 86 8c 92 98 9e a4 aa b0 b6 bc c2 c8 ce d4 da e0 e6 ec f2 f8 fe 03 09 0f 15 1b 21 27 2d 33 39 3f 45 4b 51 57 5d 63 69 6f 75 7b 81 87 8d 93 99 9f a5 ab b1 b7 bd c3 c9 cf d5 db e1 e7 ed f3 f9 ff 04 0a 10 16 1c 22 28 2e 34 3a 40 46 4c 52 58 5e 64 6a 70 76 7c 82 88 8e 94 9a a0 a6 ac b2 b8 be c4 ca d0 d6 dc e2 e8 ee f4 fa 05 0b 11 17 1d 23 29 2f 35 3b 41 47 4d 53 59 5f 65 6b 71 77 7d 83 89 8f 95 9b a1 a7 ad b3 b9 bf c5 cb d1 d7 dd e3 e9 ef f5 fb [user@host storage]$ cd 00 [user@host 00]$ ls 01 0c 12 17 1d 25 2d 35 42 49 51 58 5e 62 66 74 7a 84 88 8c 91 98 9e a2 ad b1 b7 be c3 c9 cf d3 d8 e2 e7 f0 f9 04 0d 13 1a 1e 26 30 37 45 4c 53 5b 5f 64 67 75 7e 85 8a 8d 95 9a a0 a9 ae b4 ba c1 c4 ca d1 d5 dc e4 e9 f2 fa 05 0e 14 1c 21 2b 31 3b 48 4f 57 5c 60 65 69 76 81 87 8b 8f 97 9d a1 aa af b5 bc c2 c7 ce d2 d7 e1 e5 ed f3 fc [user@host]$ cd 01 [user@host]$ ls 3da4-350d-4896-8fa8-9772c63a914d.body
The SiteStory Web Archive provides REST - style services to access and manage content. To let Memento clients access archived content from the SiteStory Web Archive, the following services are provided. Please refer to the Memento protocol for additional information.
curl -D headers.txt -H Accept-Datetime:'Wed, 29 Sep 2008 12:00:04 GMT’ \ http://[host]:[port]/sitestory/timegate/[original_url]
curl -D headers.txt \ http://[host]:[port]/sitestory/memento/20110311000508/[original_url]
curl \ http://[host]:[port]/sitestory/timemap/link/[original_url]
The SiteStory Web Archive provides a service to offload HTTP request/response data to WARC file format. This procedure can be a long-term data managing tool to offload data to Wayback archives.
curl \ http://[archivehost]:[port]/sitestory/warcunload/20120329000508/*
curl \ http://[archivehost]:[port]/sitestory/delete/20120410000508/*
curl \ http://[archivehost]:[port]/sitestory/delete/20120410000508/http://myhost/test
curl \ http://[archivehost]:[port]/sitestory/delete/*/http://myhost/test
<Connector port="8080" protocol="HTTP/1.1" connectionTimeout="60000" maxKeepAliveRequests="15" maxThreads="300" redirectPort="8443" />
je.log.fileMax=500000000 //You can alter size of the db file (default is 10M) to have more compact storage. je.evictor.useMemoryFloor=65 //You can tune memory parameters je.maxMemory=120000000 je.env.runCleaner=true //set to clean logs periodically je.env.runEvictor=true je.env.recovery=true je.env.sharedLatches=false