Proposals for Uniform Access to Raw Mementos

Last updated: May 11, 2017

Please contribute further to this proposal by coninuing the discussion at the Memento RFC Extensions GitHub repository. This document will be updated as this proposal matures.

Abstract

Most web archives augment Mementos when presenting them to the user, often for usability or legal purposes. Research efforts and software projects need access the original captured "raw" Mementos. So that users and software do not need to resort to archive-specific solutions, this document describes proposals for uniform methods of discovering raw Mementos in a web archive. This document is based on two prior blog posts that explored access to raw Mementos in web archives [1], [2].

Table of Contents

Most web archives augment Mementos when presenting them to the user, often for usability or legal purposes, as shown in Figures 1 and 2. Additionally, some archives rewrite links to allow navigation within an archive, shown in Figure 3. This way the end user can visit other pages within the same archive from the same time period. Smaller archives, because of the size of their collections, do not benefit as much from these rewritten links. Of course, for Memento users, these rewritten links are not really required.

Figure 1: An example of the PRONI archive augmenting a Memento for usability purposes. These augmentations, oulined in red, were not in the original resource.

Figure 2: An example of augmented content from the UK National Web Archive, showing that the titles of pages are augmented with the string [ARCHIVED CONTENT] so that users will not confuse them with live search results.

Figure 3: An example of rewritten links. The Memento on the left has been augmented by the web archive and thus its links are altered to intra-archive URIs. The Memento on the right has no such augmentations.

In many cases, access to the original, unaltered "raw" content is needed. This is, for example, the case for some research studies, like those at Archives Unleashed, that require the original HTTP response headers and the original unaltered content. Unaltered content is also needed to replay the original web content in various software projects, like TimeTravel Reconstruct.

Some web archives, such as the Internet Archive, preserve the original content and original headers that existed at the time of capture of a Memento. This content is currently available, but requires knowledge of the software and configuration of each archive in advance. Here we discuss a proposal for permitting uniform access to this content, regardless of archive configuration or software. Our proposal relies upon the Prefer header introduced in RFC7240.
As of this writing, some web archives provide the following dimensions of raw content for a Memento:
  • original-content
  • original-links
  • echo-original-headers
In this section, we describe these dimensions in terms of what the client desires of the Memento.
A client desiring original-content seeks a Memento that contains the same content as existed on the web at the moment it was captured by the web archive. The client's needs are satisfied if the Memento contains no additional text or code, including banners, navigational elements, branding, JavaScript, CSS, or other additional embedded content. The text within the page must match the text that existed at the time of capture. Some web archives convert captured content from one content type to another (e.g., creating screenshots of HTML pages and storing them as PNG). To satisfy the client, the resulting Memento must have the same content-type as was encountered at the time of capture.

In the case of original-links, the client wants the original URIs in the returned Memento as they existed at the time of that Memento's capture. The term links here refers to links encountered anywhere in the document, including embedded resources such as JavaScript, images, and CSS. As such, the client does not want the web archive to rewrite any of those URIs. Some web archives change the links in a Memento to intra-archive URIs. Others alter the links to point to live web resources if the archive does not have a Memento for the linked content.

Clients may also desire the content of the original HTTP headers that existed at the time of capture of the Memento. To avoid issues with presenting stale headers to HTTP clients, servers can expose these headers in the Memento's HTTP response with a prefix, such as X-Archive-Orig-, followed by the name of the original header (e.g., X-Archive-Orig-Content-Type corresponds to the original Content-Type from the time of capture). In this case, the client prefers that the server echo-original-headers.
The original-content and original-links dimensions can currently be satisfied by OpenWayback any pywb's id_ and im_ URI patterns (e.g., http://web.archive.org/web/20170415072537im_/bbc.com). The echo-original-headers dimension is satisfied by im_, but not really by id_. In some versions of OpenWayback, the id_ URI pattern produces the actual original headers, which are stale, and, as noted, may create issues with HTTP clients. In others, the id_ URI pattern acts like the im_ URI pattern. Not all web archives utilize the same playback engine or even the same version of a given engine. For example, the behavior of im_ and id_ at the Internet Archive is identical, but the two URI patterns produce different behaviors at Archive-It. The goal of this document is to propose a uniform method that does not require a user or software developer to create separate cases for each archive or playback engine.

The Prefer header, specified in RFC 7240, provides HTTP clients with the ability to specify preferences to influence the server's response. We propose that Memento clients use the Prefer header to specify which dimensions of rawness they desire.

There are two components of the Memento infrastructure that can produce a Memento that satisfies the user's preferences:
  • the TimeGate
  • the Memento itself
This following sections discusses how each of these options would operate as well as their pros and cons.
Figure 4 shows a simplified interaction between a client, a TimeGate, and a Memento. This section explains how the Prefer header could be used against a TimeGate to obtain a raw Memento.

Figure 4: Simplified diagram of using the HTTP Prefer Header Against a TimeGate
With the Prefer header, a client can issue the following request to a URI-G if they prefer a raw Memento in any of these dimensions. If a client desires a Memento with the echo-original-headers, then they would include Prefer: echo-original-headers in the HTTP request, as shown in Example 1.

Example 1: Request headers sent to a TimeGate asking for a Memento with only echo-original-headers

GET /timegate/http://www.example.com/ HTTP/1.1
Host: an.archive.org
Accept-Datetime: Sat, 24 Apr 2010 13:00:05 GMT
Prefer: echo-original-headers
Connection: close
    
The TimeGate would then use the Preference-Applied header to indicate which dimensions are satisfied by the URI-M in the Location header of its response. Seen in the HTTP response of Example 2, the Preference-Applied header contains the value echo-original-headers, indicating that the URI listed in the Location header satisfies this dimension. A client can then make a decision based on (1) what it requested, and (2) what preferences are actually satisfied in the response.

Example 2: Response headers from a TimeGate for the request from Example 1

HTTP/1.1 302 Found
Date: Mon, 08 May 2017 17:07:16 GMT
Location: an.archive.org/all/20100414235211/http://www.example.org/
Vary: Accept-Datetime, Prefer
Last-Modified: Mon, 08 May 2017 17:06:44 GMT
Link: <http://www.example.org>;rel="original",
  <http://an.archive.org/timemap/link/http://example.org>;rel="timemap"; type="application/link-format"
Preference-Applied: echo-original-headers
Content-Type: text/plain; charset=UTF-8
Content-Length: 0
Connection: close
    
A client can also indicate multiple preferences, separated by commas. In Example 3, a client issues the following request to a URI-G, stating their preference for all of these dimensions, using Prefer: original-content, original-links, echo-original-headers.

Example 3: Request headers sent to a TimeGate asking for a Memento with only original-content, original-links, and echo-original-headers

GET /timegate/http://www.example.com/ HTTP/1.1
Host: an.archive.org
Accept-Datetime: Sat, 24 Apr 2010 13:00:05 GMT
Prefer: original-content, original-links, echo-original-headers
Connection: close
    
Just as with one preference, the TimeGate would then use the Preference-Applied header to indicate which dimensions are satisfied by the URI-M in the Location header of its response. Seen in the HTTP response of Example 4, the Preference-Applied header contains the value original-content, original-links, echo-original-headers, indicating that the URI listed in the Location header satisfies these three dimensions.

Example 4: Response headers from a TimeGate for the request from Example 1

HTTP/1.1 302 Found
Date: Mon, 08 May 2017 17:07:16 GMT
Location: an.archive.org/all/20100414235211/http://www.example.org/
Vary: Accept-Datetime, Prefer
Last-Modified: Mon, 08 May 2017 17:06:44 GMT
Link: <http://www.example.org>;rel="original",
  <http://an.archive.org/timemap/link/http://example.org>;rel="timemap"; type="application/link-format"
Preference-Applied: original-content, original-links, echo-original-headers
Content-Type: text/plain; charset=UTF-8
Content-Length: 0
Connection: close
    
Note that the values in the Preference-Applied header will depend upon the dimensions available for the given Memento. Because they reflect the preferences supported by a given Memento, these values may be a subset of the preferences specified in the request or may include additional preferences not requested by the client.

The Vary header would also contain the Prefer value in addition to the accept-datetime value in order to indicate that clients can influence the TimeGate's response by using the Prefer header. This allows responses to be cached for requests that share the same options in their request headers.

The next request would be a normal GET to the URI-M in the Location header.

Memento clients not seeking raw Mementos will just submit requests as usual, not including the Prefer header.
Pros:
  • Use of Prefer against a TimeGate allows for all negotiation to be handled at the same resource.

Cons:
  • This solution tightly integrates the solution for raw Mementos with the existing Memento framework. Archives will need to expose the dimensions of rawness that they support so that Memento aggregators can then evaluate the Mementos offered by each archive.
  • Because the TimeGate must redirect to a URI-M that satisfies the preferred dimensions of rawness, Mementos containing different dimensions of rawness will need to be identified by different URI-Ms.
  • With negotiation at the TimeGate, this option increases the complexity of the TimeMaps that supply data for aggregators and TimeGates. What if an archive only supports some of the dimensions of rawness for a subset of its holdings?
  • RFC 7089 states "It is the TimeGate server's responsibility to honor (or not) such content negotiation, and in doing so it MUST always first select a Memento that meets the user agent's datetime preference, and then consider honoring regular content negotiation for it." What about the dimensions of rawness? If a TimeGate receives a preference for echo-original-headers and other elements of content negotiation, which takes precedence and how should the TimeGate respond? Among Mementos from different archives, which answer should an aggregator provide?
  • Is negotiation needed for TimeMaps? If each dimension exists at a different URI-M. How does an aggregator supply a TimeMap with a certain subset of dimensions? Is it sufficient to mix dimensions if not all are present? Should there be strict TimeMaps that only list URI-Ms that specifically adhere to the stated preferences, even if there are raw Mementos that provide more preferences? Do TimeMaps now need to specify dimensions of rawness along each Memento entry?
Figure 5 shows a simplified interaction between a client, a TimeGate, and a Memento. This section explains how the Prefer header could be used directly against Memento to obtain a raw representation of that Memento.

Figure 5: Simplified diagram of using the HTTP Prefer Header Against a Memento
Instead of asking the TimeGate to redirect to a raw Memento, a client could issue the following request directly to a URI-M if they prefer a raw Memento in any of these dimensions. In this scenario, a TimeGate plays no role in the generating the response. If a client desires a Memento that will the echo-original-headers and contains original-content and original-links, then they would include Prefer: original-content, original-links, echo-original-headers in the HTTP request, as shown in Example 5. As with Option 1, the client can also supply multiple preferences, separated by commas.

Example 5: Request headers sent directly to a URI-M asking for a Memento with echo-original-headers and original-content

GET /web/20160721152544/http://www.example.com/ HTTP/1.1
Host: an.archive.org
Prefer: original-content, original-links, echo-original-headers
Connection: close
    

A server would then use the Preference-Applied header in its response to indicate which dimensions it had satisfied from the request of this URI-M. Seen in the HTTP response of Example 6, the Preference-Applied header contains the value original-content, original-links, echo-original-headers, indicating that the response satisfies these three dimensions.

Example 6: Response headers for a Memento located at http://an.archive.org/web/20160721152544/http://www.example.com/

HTTP/1.1 200 OK
Server: Tengine/2.1.0
Date: Thu, 21 Jul 2016 17:34:15 GMT
Content-Type: text/html;charset=utf-8
Content-Length: 109672
Connection: keep-alive
Memento-Datetime: Thu, 21 Jul 2016 15:25:44 GMT
Content-Location: /web/20100424130005_raw/http://www.example.com/
Vary: prefer
Preference-Applied: original-content, original-links, echo-original-headers
Link: <http://www.example.com/>; rel="original", 
  <http://an.archive.org/web/timemap/link/http://www.example.com/>; rel="timemap"; type="application/link-format",
  <http://an.archive.org/web/http://www.example.com/>; rel="timegate"
X-Archive-Orig-content-type: text/html; charset=utf-8
X-Archive-Orig-vary: Accept-Encoding
X-Archive-Orig-connection: close
X-Archive-Orig-date: Thu, 21 Jul 2016 15:25:44 GMT
X-Archive-Orig-content-length: 109672
    
Again, as before, the values in the Preference-Applied header may be a subset or superset of the preferences specified by the client, depending on the availability of a Memento satisfying those preferences in the web archive.

To satisfy the echo-original-headers preference, the server also includes the X-Archive-Orig-* headers that contain the values of the original headers at the time of capture.

The inclusion of the header Vary: prefer indicates that clients can influence the Memento's response by using the Prefer header. The response can then be cached for requests that have the same options in the request headers.

Optionally, a web archive may include the Content-Location response header to indicate the location of the specific URI-M that satisfies these preferences, if it exists.

If a server can only apply some of the preferences, then that the server only lists which preferences have been applied to the Memento in the Preference-Applied header.

If the client issues no Prefer header in the request, then the server can still use the Preference-Applied header to indicate which preferences are met by default. Again, the Vary header indicates that clients can influence the response via the use of the Prefer request header. The Content-Location header indicates the URI-M of the Memento. The response headers for such a default Memento from an archive like the Internet Archive are shown in Example 7. Because this web archive will echo-original-headers on default requests, we see that value used in the Preference-Applied header.

Example 7: Default response headers for a Memento located at http://an.archive.org/web/20160721152544/http://www.example.com/

HTTP/1.1 200 OK
Server: Tengine/2.1.0
Date: Thu, 21 Jul 2016 16:17:09 GMT
Content-Type: text/html;charset=utf-8
Content-Length: 127383
Connection: keep-alive
Memento-Datetime: Thu, 21 Jul 2016 15:25:44 GMT
Content-Location: /web/20100424130005/http://www.example.com/
Vary: prefer
Preference-Applied: echo-original-headers
Link: <http://www.example.com/>; rel="original", 
  <http://an.archive.org/web/timemap/link/http://www.example.com/>; rel="timemap"; type="application/link-format",
  <http://an.archive.org/web/http://www.example.com/>; rel="timegate"
X-Archive-Orig-content-type: text/html; charset=utf-8
X-Archive-Orig-vary: Accept-Encoding
X-Archive-Orig-connection: close
X-Archive-Orig-date: Thu, 21 Jul 2016 15:25:44 GMT
X-Archive-Orig-content-length: 109672
Pros:
  • This solution is decoupled from the Memento Protocol.
  • If an archive adds a new dimension at a later date, then it only needs to be supported by the specific Mementos with this new dimension.
  • Requests to a single URI-M can result in responses containing different dimensions of rawness, thus the archive does not need to provide special TimeMaps for different dimensions, or combinations of dimensions.
  • This option saves on the number of requests if iterating through a TimeMap. For each URI-M, a client has one HTTP transaction instead of two. With option 1, the client would need to negotiate with the TimeGate for each URI-M, and then request the URI-M. With option 2, the client only needs to issue a request to each URI-M.
  • Archives control what information on rawness that they expose.
Cons:
  • Clients will need to separately send the Prefer header to the Memento after negotiating with the TimeGate.
  • The systems used by web archives must support this Prefer feature for all Mementos.