From: Sverre B. <sv...@us...> - 2005-10-20 13:28:22
|
Update of /cvsroot/archive-access/archive-access/projects/wera/src/articles In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv2004 Added Files: what-is-wera.xml Log Message: New file --- NEW FILE: what-is-wera.xml --- <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN" "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd"> <article> <title>What is Wera?</title> <articleinfo> <releaseinfo>$id$</releaseinfo> <author> <surname>Bang</surname> <firstname>Sverre</firstname> </author> </articleinfo> <section> <title>Introduction</title> <para>WERA (WEb ARchive Access) is an archive viewer application that gives an Internet Archive Wayback Machine-like access to web archive collections as well as the possibility to do full text search and easy navigation between different versions of a web page.</para> <para>The Wera search interface is shown below.</para> <figure> <title>Wera Search</title> <mediaobject> <imageobject> <imagedata fileref="images/searchresult1.png" /> </imageobject> </mediaobject> </figure> <para>Whne the user clicks the Timeline link of a specific hit, the Timeline View shows up (shown below). Each version (timestamp) of the given url is marked along the timeline. The user may navigate between the different versions by clicking directly in the timeline or by clicking the arrows first, previous, next and last.</para> <figure> <title>Wera Timeline View</title> <mediaobject> <imageobject> <imagedata fileref="images/timeline1.png" /> </imageobject> </mediaobject> </figure> </section> <section> <title>Wera simple setup</title> <para>The simplest setup of Wera would be that Wera, NutchWax, the web archive (ARC files) and the interface to the web archive (arcretriever) are all installed on the same machine. This is illustrated in the figure below.</para> <figure> <title>Wera overview</title> <mediaobject> <imageobject> <imagedata fileref="images/wera1.png" /> </imageobject> </mediaobject> </figure> <para>Explanation of figure 1:</para> <itemizedlist> <listitem> <para>The user submits a query in the Wera search WUI.</para> </listitem> <listitem> <para>Based on the query submitted Wera constructs a search request and sends it (1) to NutchWax (http get request, e.g. http://localhost:8082/nutchwax/opensearch?query=lux&start=0&hitsPerPage=10&hitsPerDup=1&dedupField=exacturl)</para> </listitem> <listitem> <para>NutchWax constructs an <ulink url="http://opensearch.a9.com/">a9 Opensearch RSS</ulink> (XML) formatted result set and sends this as a reply to Wera (2).</para> </listitem> <listitem> <para>Wera formats the result set for output to the user. For each hit, Wera sends two new queries to NutchWax (1, 2) for determining number of versions matching query and versions total (this functionality may be disabled in the Wera configuration in order to reduce the query load on NutchWax).</para> </listitem> <listitem> <para>When the user click on the Timeline link of a hit two things happens:</para> <itemizedlist> <listitem> <para>Wera executes a search on <emphasis>exacturl</emphasis> in order to display the timeline with all the available versions (timestamps) of the given url marked along the line.</para> </listitem> <listitem> <para>Wera executes searches on <emphasis>exaturl</emphasis> to find the version closest to the timestamp submitted as parameter to the timeline view script (1,2). For that particular version Wera constructs a request to the arcretriever containing the name of the ARC file where the version recides as well as the offset within that file where the version is stored (the ARC name and offset are stored in the index). Wera now requests, and receives an archived resource (3, 4) from the arcretriever (request example: http://localhost:8082/arcretriever/arcretriever?reqtype=getfile&aid=5902508/IAH-20051004171809-00000-test). If the resource is of type text/html (information in result set from NutchWax), a javascript link rewriter is inserted in the resource to ensure that links point to Wera rather than out to the internet. Before Wera delivers the resource to the users browser, header information on content type and encoding is set according to values received in the NutchWax result set. This is done to ensure that the users browser renders the resource correctly.</para> <note> <para>A resource of type text/html will often contain inline references to images etc. Provided the javascript link rewriter does its job on these, the step above will be repeated for each of these.</para> </note> </listitem> </itemizedlist> </listitem> </itemizedlist> <section> <title>Practical use</title> <para>The figure below shows a more likely setup of Wera and it's surroundings,- Web Archive, NutchWax and Wera are located on different machines. All the Arc files recides at host A1 to An, and all these has on beforehand been indexed by NutchWax (see NutchWax documentation for details on indexing). </para> <figure> <title>Wera interfacing several archive nodes - Currently unsupported</title> <mediaobject> <imageobject> <imagedata fileref="images/wera2.png" /> </imageobject> </mediaobject> </figure> <para>So how do we make Wera aware which Arc Retriever to fetch a given resource from prior to displaying it in the timeline view? Each resource in a ARC collection will have to be marked with collection name in the index. E.g in the example figure all resources in ARC files on A1 would be tagged A1, resource on A2 is tagged A2 etc. In the Wera configuration each collection has to be mapped to a given arc retriever.</para> <note> <para>Currently Wera does not support the direct mapping between collection and retriever. Such mapping will be added in a later release. However, it does support mapping between collection and other Wera installations. See below for background and details on this.</para> </note> <para>The original vision for the NwaToolset (the predecessor of Wera) was to enable search across the different Nordic Web Archives and provide seamless navigation within the different archives. The ability to search across the different indexes was solved by the using <ulink url="http://fastsearch.com/">Fast Search & Transfer</ulink>'s multi node architecture. To enable Wera to retrieve a particular document with a given aid from the right archive the collection field was introduced. The Wera config file would hold the mapping from collection to archive (or rather Wera installation).</para> <para>Another reason to include the collection field was to ensure that the actual link rewriting was done by the owner of the document. Each archive holder would have to set up their own NwaToolset Access Module. When one Access module was requesting a document from a remote archive the remote Access module should make the necessary changes to the document before delivering it to the calling Access Module. The reason for this was to make sure that the owner had full control over what was delivered to the calling site, thus being able to threat the document in accordance with local policies rather than the policies of the caller site. The figure below illustrates the currently supported use of mapping between collection and archive nodes.</para> <figure> <title>Wera interfacing several archive nodes - Currently supported</title> <mediaobject> <imageobject> <imagedata fileref="images/wera3.png" /> </imageobject> </mediaobject> </figure> <para>In the Wera installation of W1 the different collections indexed in NutchWax is mapped to corresponding Wera installations of W2- Wn. When the timeline view on W1 encounters a resource located on a different node (e.g. the collection mapping points to the Wera installation of W2) it requests that resource from the Wera installation at W2. Wera at W2 fetches the resource from its Retriever and does the necessary changes to the file before delivering it to Wera at W1 (e.g. inserts javascript link rewriter or rewrites it server side). When Wera at W1 receives this file it does an additional rewrite in order to have the links point to itself rather than to W2's Wera.</para> <para>Of course, all the Wera might recide on the same host, which in effect will be the same solution as described in the previous example (direct mapping between collection and retriever). It will however introduce some extra overhead, with unnecessary http traffic and file parsing.</para> </section> </section> </article> |