[Archive-access-cvs] archive-access/projects/wera/src/articles what-is-wera.xml,NONE,1.1

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Update of /cvsroot/archive-access/archive-access/projects/wera/src/articles
In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv2004

Added Files:
	what-is-wera.xml 
Log Message:
New file

--- NEW FILE: what-is-wera.xml ---
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN"
"http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd">
<article>
  <title>What is Wera?</title>

  <articleinfo>
    <releaseinfo>$id$</releaseinfo>

    <author>
      <surname>Bang</surname>

      <firstname>Sverre</firstname>
    </author>
  </articleinfo>

  <section>
    <title>Introduction</title>

    <para>WERA (WEb ARchive Access) is an archive viewer application that
    gives an Internet Archive Wayback Machine-like access to web archive
    collections as well as the possibility to do full text search and easy
    navigation between different versions of a web page.</para>

    <para>The Wera search interface is shown below.</para>

    <figure>
      <title>Wera Search</title>

      <mediaobject>
        <imageobject>
          <imagedata fileref="images/searchresult1.png" />
        </imageobject>
      </mediaobject>
    </figure>

    <para>Whne the user clicks the Timeline link of a specific hit, the
    Timeline View shows up (shown below). Each version (timestamp) of the
    given url is marked along the timeline. The user may navigate between the
    different versions by clicking directly in the timeline or by clicking the
    arrows first, previous, next and last.</para>

    <figure>
      <title>Wera Timeline View</title>

      <mediaobject>
        <imageobject>
          <imagedata fileref="images/timeline1.png" />
        </imageobject>
      </mediaobject>
    </figure>
  </section>

  <section>
    <title>Wera simple setup</title>

    <para>The simplest setup of Wera would be that Wera, NutchWax, the web
    archive (ARC files) and the interface to the web archive (arcretriever)
    are all installed on the same machine. This is illustrated in the figure
    below.</para>

    <figure>
      <title>Wera overview</title>

      <mediaobject>
        <imageobject>
          <imagedata fileref="images/wera1.png" />
        </imageobject>
      </mediaobject>
    </figure>

    <para>Explanation of figure 1:</para>

    <itemizedlist>
      <listitem>
        <para>The user submits a query in the Wera search WUI.</para>
      </listitem>

      <listitem>
        <para>Based on the query submitted Wera constructs a search request
        and sends it (1) to NutchWax (http get request, e.g.
        http://localhost:8082/nutchwax/opensearch?query=lux&amp;start=0&amp;hitsPerPage=10&amp;hitsPerDup=1&amp;dedupField=exacturl)</para>
      </listitem>

      <listitem>
        <para>NutchWax constructs an <ulink url="http://opensearch.a9.com/">a9
        Opensearch RSS</ulink> (XML) formatted result set and sends this as a
        reply to Wera (2).</para>
      </listitem>

      <listitem>
        <para>Wera formats the result set for output to the user. For each
        hit, Wera sends two new queries to NutchWax (1, 2) for determining
        number of versions matching query and versions total (this
        functionality may be disabled in the Wera configuration in order to
        reduce the query load on NutchWax).</para>
      </listitem>

      <listitem>
        <para>When the user click on the Timeline link of a hit two things
        happens:</para>

        <itemizedlist>
          <listitem>
            <para>Wera executes a search on <emphasis>exacturl</emphasis> in
            order to display the timeline with all the available versions
            (timestamps) of the given url marked along the line.</para>
          </listitem>

          <listitem>
            <para>Wera executes searches on <emphasis>exaturl</emphasis> to
            find the version closest to the timestamp submitted as parameter
            to the timeline view script (1,2). For that particular version
            Wera constructs a request to the arcretriever containing the name
            of the ARC file where the version recides as well as the offset
            within that file where the version is stored (the ARC name and
            offset are stored in the index). Wera now requests, and receives
            an archived resource (3, 4) from the arcretriever (request
            example:
            http://localhost:8082/arcretriever/arcretriever?reqtype=getfile&amp;aid=5902508/IAH-20051004171809-00000-test).
            If the resource is of type text/html (information in result set
            from NutchWax), a javascript link rewriter is inserted in the
            resource to ensure that links point to Wera rather than out to the
            internet. Before Wera delivers the resource to the users browser,
            header information on content type and encoding is set according
            to values received in the NutchWax result set. This is done to
            ensure that the users browser renders the resource
            correctly.</para>

            <note>
              <para>A resource of type text/html will often contain inline
              references to images etc. Provided the javascript link rewriter
              does its job on these, the step above will be repeated for each
              of these.</para>
            </note>
          </listitem>
        </itemizedlist>
      </listitem>
    </itemizedlist>

    <section>
      <title>Practical use</title>

      <para>The figure below shows a more likely setup of Wera and it's
      surroundings,- Web Archive, NutchWax and Wera are located on different
      machines. All the Arc files recides at host A1 to An, and all these has
      on beforehand been indexed by NutchWax (see NutchWax documentation for
      details on indexing). </para>

      <figure>
        <title>Wera interfacing several archive nodes - Currently
        unsupported</title>

        <mediaobject>
          <imageobject>
            <imagedata fileref="images/wera2.png" />
          </imageobject>
        </mediaobject>
      </figure>

      <para>So how do we make Wera aware which Arc Retriever to fetch a given
      resource from prior to displaying it in the timeline view? Each resource
      in a ARC collection will have to be marked with collection name in the
      index. E.g in the example figure all resources in ARC files on A1 would
      be tagged A1, resource on A2 is tagged A2 etc. In the Wera configuration
      each collection has to be mapped to a given arc retriever.</para>

      <note>
        <para>Currently Wera does not support the direct mapping between
        collection and retriever. Such mapping will be added in a later
        release. However, it does support mapping between collection and other
        Wera installations. See below for background and details on
        this.</para>
      </note>

      <para>The original vision for the NwaToolset (the predecessor of Wera)
      was to enable search across the different Nordic Web Archives and
      provide seamless navigation within the different archives. The ability
      to search across the different indexes was solved by the using <ulink
      url="http://fastsearch.com/">Fast Search &amp; Transfer</ulink>'s multi
      node architecture. To enable Wera to retrieve a particular document with
      a given aid from the right archive the collection field was introduced.
      The Wera config file would hold the mapping from collection to archive
      (or rather Wera installation).</para>

      <para>Another reason to include the collection field was to ensure that
      the actual link rewriting was done by the owner of the document. Each
      archive holder would have to set up their own NwaToolset Access Module.
      When one Access module was requesting a document from a remote archive
      the remote Access module should make the necessary changes to the
      document before delivering it to the calling Access Module. The reason
      for this was to make sure that the owner had full control over what was
      delivered to the calling site, thus being able to threat the document in
      accordance with local policies rather than the policies of the caller
      site. The figure below illustrates the currently supported use of
      mapping between collection and archive nodes.</para>

      <figure>
        <title>Wera interfacing several archive nodes - Currently
        supported</title>

        <mediaobject>
          <imageobject>
            <imagedata fileref="images/wera3.png" />
          </imageobject>
        </mediaobject>
      </figure>

      <para>In the Wera installation of W1 the different collections indexed
      in NutchWax is mapped to corresponding Wera installations of W2- Wn.
      When the timeline view on W1 encounters a resource located on a
      different node (e.g. the collection mapping points to the Wera
      installation of W2) it requests that resource from the Wera installation
      at W2. Wera at W2 fetches the resource from its Retriever and does the
      necessary changes to the file before delivering it to Wera at W1 (e.g.
      inserts javascript link rewriter or rewrites it server side). When Wera
      at W1 receives this file it does an additional rewrite in order to have
      the links point to itself rather than to W2's Wera.</para>

      <para>Of course, all the Wera might recide on the same host, which in
      effect will be the same solution as described in the previous example
      (direct mapping between collection and retriever). It will however
      introduce some extra overhead, with unnecessary http traffic and file
      parsing.</para>
    </section>
  </section>
</article>