[Archive-access-cvs] archive-access/projects/wera/src/articles manual.xml,NONE,1.1

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Update of /cvsroot/archive-access/archive-access/projects/wera/src/articles
In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv23369/wera/src/articles

Added Files:
	manual.xml 
Log Message:
First time add of wera.  Moved here from nwa.nb.no.


--- NEW FILE: manual.xml ---
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"
"http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd">
<article>
  <title>WERA Manual</title>

  <articleinfo>
    <releaseinfo>$Id: manual.xml,v 1.3 2005/07/15 17:12:14 sverreb Exp
    $</releaseinfo>

    <copyright>
      <year>2003, 2004</year>

      <holder>Royal Library in Stockholm</holder>

      <holder>Royal Library in Copenhagen</holder>

      <holder>Helsinki University Library in Finland</holder>

      <holder>National Library of Norway</holder>

      <holder>National and University Library of Iceland</holder>
    </copyright>

    <author>
      <surname>Bang</surname>

      <firstname>Sverre</firstname>
    </author>
  </articleinfo>

  <section>
    <title>Introduction</title>

    <para>WERA (Web ARchive Access) is a freely available solution for
    searching and navigating archived web document collections.</para>

    <para>A web archive may consist of a large number of web documents, but
    also several versions of the same web document (i.e. the documents where
    downloaded from the same URL). Potential users of WERA might be anyone
    that has a web archive. Examples of such users may be:</para>

    <itemizedlist>
      <listitem>
        <para>National Libraries or other organisations collecting parts of
        the internet for long term preservation.</para>
      </listitem>

      <listitem>
        <para>Companies or organisations keeping a historical collection of
        their own web site and/or intranet.</para>
      </listitem>

      <listitem>
        <para>Private persons keeping a historical collection of their own web
        site.</para>
      </listitem>
    </itemizedlist>

    <para>Note that in the following text a archived file and a web page is
    not necessarily the same thing. What the user experience as one web
    document may consist of several archived files (e.g. a web page which
    comprises the html file and the inline images).</para>

    <section>
      <title>Overview</title>

      <para>In order to use WERA for searching, browsing and navigating your
      archived web documents you will need some additional components. These
      are:</para>

      <itemizedlist>
        <listitem>
          <para>A Search Engine which holds a full-text index of the archived
          web documents. Currently the NutchWAX search engine is
          supported.</para>
        </listitem>

        <listitem>
          <para>A Document Retriever which serves as the interface between the
          Access module and the web archive. The Document Retriever delivers
          archived files and associated metadata to WERA upon request.</para>
        </listitem>
      </itemizedlist>

      <para>A key requirement for the web archive is that the web documents
      contents are stored unaltered and that a metadata set consisting of at
      least the original url, timestamp and mime-type of the archived files is
      available.</para>

      <section>
        <title>NutchWAX</title>

        <para>Currently the Jakarta Lucene based NutchWAX search engine is
        supported. WERA must (at the moment) be downloaded and installed
        separately. See http://archive-access.sourceforge.net/projects/nutch/
        for further information.</para>
      </section>

      <section>
        <title>WERA</title>

        <para>WERA provides the user with interfaces for searching, browsing
        and navigating the archived web pages.</para>

        <para>When the user submits a query, WERA uses the search engine to
        find the archived files containing the text(s) satisfying the query.
        When the user asks for a specific URL WERA will return the archived
        file with that particular URL (e.g. the archived file originally
        downloaded from the url http://www.nb.no/index.html). Before the file
        is delivered to the user's browser a javaScript is inserted in the
        file so that the inline links and references are altered by the
        browser to point into the archive rather than out to the
        Internet.</para>

        <para>The resulting web page is presented with a timeline at the top
        and the web document below it. The timeline queries the index for all
        archived versions of the web page and displays the timestamps
        graphically along the line.<figure>
            <title>Access</title>

            <mediaobject>
              <imageobject>
                <imagedata fileref="images/access.jpg" />
              </imageobject>
            </mediaobject>
          </figure></para>

        <para>Searching a web archive through WERA resembles using a Internet
        search engine like Google. An example of WERA search interface is
        shown below.<figure>
            <title>Search result</title>

            <mediaobject>
              <imageobject>
                <imagedata fileref="images/searchresult.png" />
              </imageobject>
            </mediaobject>
          </figure></para>

        <para>Clicking the Overview link of a specific hit will display all
        the dates for the versions found for the chosen URL (the overview does
        not contain any information of which versions that actually satisfied
        the query term given in the first place).</para>

        <figure>
          <title>Overview</title>

          <mediaobject>
            <imageobject>
              <imagedata fileref="images/overview.png" />
            </imageobject>
          </mediaobject>
        </figure>

        <para>Clicking one of the links in the Overview page will take the
        user to the Timeline view with the chosen version displayed. Clicking
        the Timeline link in the Search result page will also take the user to
        the Timeline page but the version displayed will be the most recent of
        the versions satisfying the query term given.</para>

        <figure>
          <title>Timeline</title>

          <mediaobject>
            <imageobject>
              <imagedata fileref="images/timeline.png" />
            </imageobject>
          </mediaobject>
        </figure>

        <para>When navigating from the Overview or the result list of a search
        interface, the URL of the chosen version is passed along and shown in
        the URL field. A URL may also be entered manually.</para>

        <para>Navigation between the different versions is done by directly
        clicking a specific point on the timeline, or by using the arrows
        first, previous, next and last.</para>

        <para>When entering the timeline view the resolution is set to auto.
        This means that the timeline automatically drills down to the
        resolution needed to display single versions along the line. The Auto
        checkbox may be unchecked in order to manually choose the resolution
        (choosing a different resolution when in auto also disables auto
        resolution).</para>

        <para>It is also possible to perform a search from the Timeline by
        typing in a query term and pressing <emphasis>Go</emphasis>.</para>
      </section>
    </section>
  </section>

  <section>
    <title>Installation</title>

    <para>This chapter describes how to obtain, install and configure
    WERA..</para>

    <section>
      <title>Obtaining WERA</title>

      <para>The latest version of WERA may be downloaded from WERA <ulink
      url="http://nwatoolset.sourceforge.net">home page</ulink> at
      sourceforge.</para>
    </section>

    <section>
      <title>Installing</title>

      <para>Information on how to distribute the different components of WERA
      on different hosts will be provided in a later version of WERA.</para>

      <section>
        <title>System Requirements</title>

        <para>WERA and the NWA Adapted Lucene search engine has been tested on
        different builds of <emphasis>RedHat</emphasis> (7.3, 8, AS2 etc.),
        <emphasis>Fedora</emphasis> and <emphasis>Suse</emphasis> Linux. There
        is no reason to believe that the system will not work on other
        linux/unix ditributions.</para>

        <itemizedlist>
          <listitem>
            <para>A JVM</para>
          </listitem>

          <listitem>
            <para>Apache http server w. PHP 4.3 or 4.4 (make sure that XML
            support is enabled, see end for details). WERA will NOT work
            properly with PHP 5, because of the new Object Model in
            PHP5.</para>

            <para>If PHP not installed, the quickest solution may be to
            install XAMPP, http://www.apachefriends.org/en/xampp.html</para>

            <para><emphasis role="bold">PHP XML support:</emphasis></para>

            <para>XML support is needed by WERA to handle the search results
            returned from the NutchWAX search engine. To verify that XML
            support is enabled in php simply store the following text in a
            php-file (e.g. info.php) and save it in the apache DocumentRoot
            directory:</para>

            <para><userinput>&lt;?php phpinfo(); ?&gt;</userinput></para>

            <para>Open up
            <userinput>http://&lt;yourhost&gt;/info.php</userinput> in a
            browser and check that PHP has <emphasis
            role="bold">not</emphasis> been compiled with --disable-xml</para>
          </listitem>

          <listitem>
            <para>Tomcat servlet container
            (http://jakarta.apache.org/tomcat/index.html). The ArcRetriever
            web app has been tested on v.5.0.27 and 5.0.28</para>
          </listitem>

          <listitem>
            <para>NutchWAX. A bundling of Nutch and extensions for searching
            search Web Archive Collections (WACs)
            http://archive-access.sourceforge.net/projects/nutch/</para>
          </listitem>
        </itemizedlist>
      </section>

      <section>
        <title>Java Based Installer</title>

        <para>To install WERA do the following:</para>

        <itemizedlist>
          <listitem>
            <para>Download wera-x-y-z-installer.tar.gz from
            sourceforge.</para>
          </listitem>

          <listitem>
            <para>Unpack the gzipped tarball in a temporary directory on the
            host where you want wera installed.</para>
          </listitem>

          <listitem>
            <para>Invoke the installer using <userinput>java -jar
            wera-x-y-z-installer.jar</userinput>.</para>
          </listitem>

          <listitem>
            <para>Follow the on-screen instructions.</para>
          </listitem>
        </itemizedlist>

        <para>The installer will confgure WERA in accordance with the input
        provided by you during the installation process. See the section on
        manual installation in order to view and change these settings (E.g if
        NutchWAX and/or your ARC file collection recide on different hosts
        than WERA.).</para>

        <para>If the machine you are installing on does not have X installed,
        or if you are invoking the installer over ssh and X port forwarding is
        not working properly the installer should fall back to text mode. If
        this fails, try using the manual install preocedure.</para>
      </section>

      <section>
        <title>Manual installation</title>

        <para>To install WERA manually do the following:</para>

        <itemizedlist>
          <listitem>
            <para>Download wera-x-y-z-manual-install.tar.gz from
            sourceforge.</para>
          </listitem>

          <listitem>
            <para>Unpack the gzipped tarball into the Apache document root
            directory on the host where you want WERA installed.</para>
          </listitem>

          <listitem>
            <para>Move the file ArcRetriever.war from
            &lt;apcheWebRootDir&gt;/wera/ to the webapps directory of the
            tomcat installation of the host where your ARC-files
            recide.</para>
          </listitem>

          <listitem>
            <para>Edit the file &lt;apcheWebRootDir&gt;/wera/lib/config.inc
            (see below for details).</para>
          </listitem>
        </itemizedlist>

        <section>
          <title>Settings</title>

          <para>Settings for WERA can be found in the file
          &lt;apacheWebRootDir&gt;/wera/lib/config.inc. Edit this file in
          order to configure WERA for your environment. Parameters to
          adapt:</para>

          <table>
            <title>Settings in config.inc</title>

            <tgroup cols="2">
              <tbody>
                <row>
                  <entry>$conf_rootpath = "/opt/lampp/htdocs/wera";</entry>

                  <entry>Change this so that it corresponds with your
                  environment i.e. &lt;apacheWebRootDir&gt;/wera (you may of
                  course rename the extracted wera directory to something
                  else, and even choose to place it further down in the
                  directory structure)</entry>
                </row>

                <row>
                  <entry>$conf_searchengine_url =
                  "http://localhost:8080/nutchwax/opensearch";</entry>

                  <entry>Open the url
                  http://&lt;nutchwaxhost&gt;:&lt;port&gt;/nutchwax/ and click
                  the RSS icon. The url of this page is the url you want to
                  enter as conf_searchine_url (do not include the query part
                  i.e. the ? and everything preceding it). If nutchwax is
                  installed on the same host as you installed WERA on and
                  tomcat is serving on port 8080, the default setting should
                  work.</entry>
                </row>

                <row>
                  <entry>$conf_aid_prefix = "/var/arcs/";</entry>

                  <entry>The current version of the ArcRetriever needs to know
                  where the ARC-files are located. All the ARC-files that you
                  indexed with nucth should be placed in one directory. The
                  path goes into this parameter.</entry>
                </row>

                <row>
                  <entry>$conf_aid_suffix = ".arc.gz";</entry>

                  <entry>The suffix of the ARC files in above
                  directory.</entry>
                </row>

                <row>
                  <entry>$document_retriever =
                  "http://localhost:8080/ArcRetriever/ArcRetriever";</entry>

                  <entry>Change the host name and port to point the tomcat
                  installation of the host where your ARC-files
                  recide.</entry>
                </row>

                <row>
                  <entry>$conf_http_host = "http://localhost/wera";</entry>

                  <entry>Change <emphasis>localhost</emphasis> to the host
                  name of the machine where you are installing WERA. Add the
                  port number if different from 80
                  (&lt;hostname&gt;:&lt;port&gt;). If you renamed the wera
                  directory or unpacked it further down relative to
                  ApacheWebRoot, update this parameter accordingly.</entry>
                </row>
              </tbody>
            </tgroup>
          </table>

          <para>There are other parameters to tweak as well, but for a simple
          setup of WERA the above settings should do. Information on setting
          other parameters will be prepared in later releases.</para>
        </section>
      </section>
    </section>
  </section>

  <section>
    <title>Using WERA</title>

    <para>After installing WERA you should go through the following
    steps.</para>

    <orderedlist>
      <listitem>
        <para>Test that the ArcRetriever is functioning correctly</para>
      </listitem>

      <listitem>
        <para>Check the search and navigate functionality</para>
      </listitem>
    </orderedlist>

    <para>These steps are described in more detail below.</para>

    <section>
      <title>Testing the Retriever</title>

      <para>In order to test the Retriever try accessing the following urls in
      a browser (or use <command>wget [URL]</command> from the command
      line):</para>

      <itemizedlist>
        <listitem>
          <para>http://&lt;hostname&gt;.&lt;domainname&gt;[:port]/&lt;retriever&gt;?reqtype=&lt;reqtype&gt;&amp;aid=&lt;aid&gt;</para>
        </listitem>
      </itemizedlist>

      <para>Where <emphasis>retriever</emphasis> is the retriever script doing
      the retrieval, <emphasis>reqtype</emphasis> is the request type and the
      <emphasis>aid</emphasis> is the unique identifier (within the archive)
      for a harvested file. The <emphasis>getmeta</emphasis> request will
      return archived technical metadata for the file in question and the
      <emphasis>getfile</emphasis> request will return the archived file
      itself.</para>

      <para>To find the aid of one partcular document in your archive open the
      url http://&lt;nutchwaxhost&gt;:&lt;port&gt;/nutchwax/ and enter execute
      a query. Scroll down to the RSS icon and click it. For one particular
      result copy the value of nutch:arcoffset and nutch:arcname and build the
      aid:
      &lt;arcoffset&gt;/&lt;conf_aid_prefix&gt;&lt;arcname&gt;&lt;conf_aid_suffix&gt;</para>

      <para>An example of the result of the getmeta request
      http://localhost:8080/ArcRetriever/ArcRetriever?aid=5160509//home/wera/arcs/IAH-20041102080031-00007-utvikling1.nb.no.arc.gz&amp;reqtype=getmeta
      is given below.</para>

      <screen format="linespecific">&lt;?xml version="1.0" encoding="UTF-8"?&gt;
  &lt;retrievermessage&gt;
  &lt;head&gt;
  &lt;reqtype&gt;getmeta&lt;/reqtype&gt;
  &lt;aid&gt;5160509//home/wera/arcs/IAH-20041102080031-00007-utvikling1.nb.no.arc.gz&lt;/aid&gt;
  &lt;/head&gt;
  &lt;body&gt;
    &lt;metadata&gt;
      &lt;url&gt;http://www.nla.gov.au/raam/&lt;/url&gt;
      &lt;archival_time&gt;20041102080756&lt;/archival_time&gt;
      &lt;last_modified_time&gt;20041102080756&lt;/last_modified_time&gt;
      &lt;content_length&gt;&lt;/content_length&gt;
      &lt;contenttype&gt;
        &lt;type&gt;text/html&lt;/type&gt;
        &lt;charset&gt;&lt;/charset&gt;
      &lt;/contenttype&gt;
      &lt;filestatus&gt;online&lt;/filestatus&gt;
      &lt;filestatus_long&gt;&lt;/filestatus_long&gt;
      &lt;content_checksum&gt;ZBYZIFD6PK5ZHCUQGTKZSZ2LJMZUD554&lt;/content_checksum&gt;
      &lt;http-header&gt;HTTP/1.1 200 OK
       Date: Tue, 02 Nov 2004 08:07:57 GMT
       Server: Apache/1.3.29 (Unix) PHP/4.1.2 mod_perl/1.27 mod_jk/1.2.0 mod_ssl/2.8.16 OpenSSL/0.9.6l
       X-Powered-By: PHP/4.1.2
       Connection: close
       Content-Type: text/html&lt;/http-header&gt;
    &lt;/metadata&gt;
  &lt;/body&gt;
&lt;/retrievermessage&gt;</screen>
    </section>

    <section>
      <title>Indexing</title>

      <para>When indexing, make sure you invoke the NutchWAX indexer
      (indexarcs.sh) with the <emphasis>-n</emphasis> option. If not, nutchWAX
      will remove all duplicate urls from the index. Using WERA against such
      an index will give only one version per url on the WERA timeline.</para>
    </section>

    <section>
      <title>Searching</title>

      <para>Open a browser and type in the URL http://localhost/wera (or the
      url to where you installed WERA).</para>
    </section>
  </section>
</article>