From: Michael S. <sta...@us...> - 2005-10-04 22:59:39
|
Update of /cvsroot/archive-access/archive-access/projects/wera/src/articles In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv23369/wera/src/articles Added Files: manual.xml Log Message: First time add of wera. Moved here from nwa.nb.no. --- NEW FILE: manual.xml --- <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE article PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN" "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd"> <article> <title>WERA Manual</title> <articleinfo> <releaseinfo>$Id: manual.xml,v 1.3 2005/07/15 17:12:14 sverreb Exp $</releaseinfo> <copyright> <year>2003, 2004</year> <holder>Royal Library in Stockholm</holder> <holder>Royal Library in Copenhagen</holder> <holder>Helsinki University Library in Finland</holder> <holder>National Library of Norway</holder> <holder>National and University Library of Iceland</holder> </copyright> <author> <surname>Bang</surname> <firstname>Sverre</firstname> </author> </articleinfo> <section> <title>Introduction</title> <para>WERA (Web ARchive Access) is a freely available solution for searching and navigating archived web document collections.</para> <para>A web archive may consist of a large number of web documents, but also several versions of the same web document (i.e. the documents where downloaded from the same URL). Potential users of WERA might be anyone that has a web archive. Examples of such users may be:</para> <itemizedlist> <listitem> <para>National Libraries or other organisations collecting parts of the internet for long term preservation.</para> </listitem> <listitem> <para>Companies or organisations keeping a historical collection of their own web site and/or intranet.</para> </listitem> <listitem> <para>Private persons keeping a historical collection of their own web site.</para> </listitem> </itemizedlist> <para>Note that in the following text a archived file and a web page is not necessarily the same thing. What the user experience as one web document may consist of several archived files (e.g. a web page which comprises the html file and the inline images).</para> <section> <title>Overview</title> <para>In order to use WERA for searching, browsing and navigating your archived web documents you will need some additional components. These are:</para> <itemizedlist> <listitem> <para>A Search Engine which holds a full-text index of the archived web documents. Currently the NutchWAX search engine is supported.</para> </listitem> <listitem> <para>A Document Retriever which serves as the interface between the Access module and the web archive. The Document Retriever delivers archived files and associated metadata to WERA upon request.</para> </listitem> </itemizedlist> <para>A key requirement for the web archive is that the web documents contents are stored unaltered and that a metadata set consisting of at least the original url, timestamp and mime-type of the archived files is available.</para> <section> <title>NutchWAX</title> <para>Currently the Jakarta Lucene based NutchWAX search engine is supported. WERA must (at the moment) be downloaded and installed separately. See http://archive-access.sourceforge.net/projects/nutch/ for further information.</para> </section> <section> <title>WERA</title> <para>WERA provides the user with interfaces for searching, browsing and navigating the archived web pages.</para> <para>When the user submits a query, WERA uses the search engine to find the archived files containing the text(s) satisfying the query. When the user asks for a specific URL WERA will return the archived file with that particular URL (e.g. the archived file originally downloaded from the url http://www.nb.no/index.html). Before the file is delivered to the user's browser a javaScript is inserted in the file so that the inline links and references are altered by the browser to point into the archive rather than out to the Internet.</para> <para>The resulting web page is presented with a timeline at the top and the web document below it. The timeline queries the index for all archived versions of the web page and displays the timestamps graphically along the line.<figure> <title>Access</title> <mediaobject> <imageobject> <imagedata fileref="images/access.jpg" /> </imageobject> </mediaobject> </figure></para> <para>Searching a web archive through WERA resembles using a Internet search engine like Google. An example of WERA search interface is shown below.<figure> <title>Search result</title> <mediaobject> <imageobject> <imagedata fileref="images/searchresult.png" /> </imageobject> </mediaobject> </figure></para> <para>Clicking the Overview link of a specific hit will display all the dates for the versions found for the chosen URL (the overview does not contain any information of which versions that actually satisfied the query term given in the first place).</para> <figure> <title>Overview</title> <mediaobject> <imageobject> <imagedata fileref="images/overview.png" /> </imageobject> </mediaobject> </figure> <para>Clicking one of the links in the Overview page will take the user to the Timeline view with the chosen version displayed. Clicking the Timeline link in the Search result page will also take the user to the Timeline page but the version displayed will be the most recent of the versions satisfying the query term given.</para> <figure> <title>Timeline</title> <mediaobject> <imageobject> <imagedata fileref="images/timeline.png" /> </imageobject> </mediaobject> </figure> <para>When navigating from the Overview or the result list of a search interface, the URL of the chosen version is passed along and shown in the URL field. A URL may also be entered manually.</para> <para>Navigation between the different versions is done by directly clicking a specific point on the timeline, or by using the arrows first, previous, next and last.</para> <para>When entering the timeline view the resolution is set to auto. This means that the timeline automatically drills down to the resolution needed to display single versions along the line. The Auto checkbox may be unchecked in order to manually choose the resolution (choosing a different resolution when in auto also disables auto resolution).</para> <para>It is also possible to perform a search from the Timeline by typing in a query term and pressing <emphasis>Go</emphasis>.</para> </section> </section> </section> <section> <title>Installation</title> <para>This chapter describes how to obtain, install and configure WERA..</para> <section> <title>Obtaining WERA</title> <para>The latest version of WERA may be downloaded from WERA <ulink url="http://nwatoolset.sourceforge.net">home page</ulink> at sourceforge.</para> </section> <section> <title>Installing</title> <para>Information on how to distribute the different components of WERA on different hosts will be provided in a later version of WERA.</para> <section> <title>System Requirements</title> <para>WERA and the NWA Adapted Lucene search engine has been tested on different builds of <emphasis>RedHat</emphasis> (7.3, 8, AS2 etc.), <emphasis>Fedora</emphasis> and <emphasis>Suse</emphasis> Linux. There is no reason to believe that the system will not work on other linux/unix ditributions.</para> <itemizedlist> <listitem> <para>A JVM</para> </listitem> <listitem> <para>Apache http server w. PHP 4.3 or 4.4 (make sure that XML support is enabled, see end for details). WERA will NOT work properly with PHP 5, because of the new Object Model in PHP5.</para> <para>If PHP not installed, the quickest solution may be to install XAMPP, http://www.apachefriends.org/en/xampp.html</para> <para><emphasis role="bold">PHP XML support:</emphasis></para> <para>XML support is needed by WERA to handle the search results returned from the NutchWAX search engine. To verify that XML support is enabled in php simply store the following text in a php-file (e.g. info.php) and save it in the apache DocumentRoot directory:</para> <para><userinput><?php phpinfo(); ?></userinput></para> <para>Open up <userinput>http://<yourhost>/info.php</userinput> in a browser and check that PHP has <emphasis role="bold">not</emphasis> been compiled with --disable-xml</para> </listitem> <listitem> <para>Tomcat servlet container (http://jakarta.apache.org/tomcat/index.html). The ArcRetriever web app has been tested on v.5.0.27 and 5.0.28</para> </listitem> <listitem> <para>NutchWAX. A bundling of Nutch and extensions for searching search Web Archive Collections (WACs) http://archive-access.sourceforge.net/projects/nutch/</para> </listitem> </itemizedlist> </section> <section> <title>Java Based Installer</title> <para>To install WERA do the following:</para> <itemizedlist> <listitem> <para>Download wera-x-y-z-installer.tar.gz from sourceforge.</para> </listitem> <listitem> <para>Unpack the gzipped tarball in a temporary directory on the host where you want wera installed.</para> </listitem> <listitem> <para>Invoke the installer using <userinput>java -jar wera-x-y-z-installer.jar</userinput>.</para> </listitem> <listitem> <para>Follow the on-screen instructions.</para> </listitem> </itemizedlist> <para>The installer will confgure WERA in accordance with the input provided by you during the installation process. See the section on manual installation in order to view and change these settings (E.g if NutchWAX and/or your ARC file collection recide on different hosts than WERA.).</para> <para>If the machine you are installing on does not have X installed, or if you are invoking the installer over ssh and X port forwarding is not working properly the installer should fall back to text mode. If this fails, try using the manual install preocedure.</para> </section> <section> <title>Manual installation</title> <para>To install WERA manually do the following:</para> <itemizedlist> <listitem> <para>Download wera-x-y-z-manual-install.tar.gz from sourceforge.</para> </listitem> <listitem> <para>Unpack the gzipped tarball into the Apache document root directory on the host where you want WERA installed.</para> </listitem> <listitem> <para>Move the file ArcRetriever.war from <apcheWebRootDir>/wera/ to the webapps directory of the tomcat installation of the host where your ARC-files recide.</para> </listitem> <listitem> <para>Edit the file <apcheWebRootDir>/wera/lib/config.inc (see below for details).</para> </listitem> </itemizedlist> <section> <title>Settings</title> <para>Settings for WERA can be found in the file <apacheWebRootDir>/wera/lib/config.inc. Edit this file in order to configure WERA for your environment. Parameters to adapt:</para> <table> <title>Settings in config.inc</title> <tgroup cols="2"> <tbody> <row> <entry>$conf_rootpath = "/opt/lampp/htdocs/wera";</entry> <entry>Change this so that it corresponds with your environment i.e. <apacheWebRootDir>/wera (you may of course rename the extracted wera directory to something else, and even choose to place it further down in the directory structure)</entry> </row> <row> <entry>$conf_searchengine_url = "http://localhost:8080/nutchwax/opensearch";</entry> <entry>Open the url http://<nutchwaxhost>:<port>/nutchwax/ and click the RSS icon. The url of this page is the url you want to enter as conf_searchine_url (do not include the query part i.e. the ? and everything preceding it). If nutchwax is installed on the same host as you installed WERA on and tomcat is serving on port 8080, the default setting should work.</entry> </row> <row> <entry>$conf_aid_prefix = "/var/arcs/";</entry> <entry>The current version of the ArcRetriever needs to know where the ARC-files are located. All the ARC-files that you indexed with nucth should be placed in one directory. The path goes into this parameter.</entry> </row> <row> <entry>$conf_aid_suffix = ".arc.gz";</entry> <entry>The suffix of the ARC files in above directory.</entry> </row> <row> <entry>$document_retriever = "http://localhost:8080/ArcRetriever/ArcRetriever";</entry> <entry>Change the host name and port to point the tomcat installation of the host where your ARC-files recide.</entry> </row> <row> <entry>$conf_http_host = "http://localhost/wera";</entry> <entry>Change <emphasis>localhost</emphasis> to the host name of the machine where you are installing WERA. Add the port number if different from 80 (<hostname>:<port>). If you renamed the wera directory or unpacked it further down relative to ApacheWebRoot, update this parameter accordingly.</entry> </row> </tbody> </tgroup> </table> <para>There are other parameters to tweak as well, but for a simple setup of WERA the above settings should do. Information on setting other parameters will be prepared in later releases.</para> </section> </section> </section> </section> <section> <title>Using WERA</title> <para>After installing WERA you should go through the following steps.</para> <orderedlist> <listitem> <para>Test that the ArcRetriever is functioning correctly</para> </listitem> <listitem> <para>Check the search and navigate functionality</para> </listitem> </orderedlist> <para>These steps are described in more detail below.</para> <section> <title>Testing the Retriever</title> <para>In order to test the Retriever try accessing the following urls in a browser (or use <command>wget [URL]</command> from the command line):</para> <itemizedlist> <listitem> <para>http://<hostname>.<domainname>[:port]/<retriever>?reqtype=<reqtype>&aid=<aid></para> </listitem> </itemizedlist> <para>Where <emphasis>retriever</emphasis> is the retriever script doing the retrieval, <emphasis>reqtype</emphasis> is the request type and the <emphasis>aid</emphasis> is the unique identifier (within the archive) for a harvested file. The <emphasis>getmeta</emphasis> request will return archived technical metadata for the file in question and the <emphasis>getfile</emphasis> request will return the archived file itself.</para> <para>To find the aid of one partcular document in your archive open the url http://<nutchwaxhost>:<port>/nutchwax/ and enter execute a query. Scroll down to the RSS icon and click it. For one particular result copy the value of nutch:arcoffset and nutch:arcname and build the aid: <arcoffset>/<conf_aid_prefix><arcname><conf_aid_suffix></para> <para>An example of the result of the getmeta request http://localhost:8080/ArcRetriever/ArcRetriever?aid=5160509//home/wera/arcs/IAH-20041102080031-00007-utvikling1.nb.no.arc.gz&reqtype=getmeta is given below.</para> <screen format="linespecific"><?xml version="1.0" encoding="UTF-8"?> <retrievermessage> <head> <reqtype>getmeta</reqtype> <aid>5160509//home/wera/arcs/IAH-20041102080031-00007-utvikling1.nb.no.arc.gz</aid> </head> <body> <metadata> <url>http://www.nla.gov.au/raam/</url> <archival_time>20041102080756</archival_time> <last_modified_time>20041102080756</last_modified_time> <content_length></content_length> <contenttype> <type>text/html</type> <charset></charset> </contenttype> <filestatus>online</filestatus> <filestatus_long></filestatus_long> <content_checksum>ZBYZIFD6PK5ZHCUQGTKZSZ2LJMZUD554</content_checksum> <http-header>HTTP/1.1 200 OK Date: Tue, 02 Nov 2004 08:07:57 GMT Server: Apache/1.3.29 (Unix) PHP/4.1.2 mod_perl/1.27 mod_jk/1.2.0 mod_ssl/2.8.16 OpenSSL/0.9.6l X-Powered-By: PHP/4.1.2 Connection: close Content-Type: text/html</http-header> </metadata> </body> </retrievermessage></screen> </section> <section> <title>Indexing</title> <para>When indexing, make sure you invoke the NutchWAX indexer (indexarcs.sh) with the <emphasis>-n</emphasis> option. If not, nutchWAX will remove all duplicate urls from the index. Using WERA against such an index will give only one version per url on the WERA timeline.</para> </section> <section> <title>Searching</title> <para>Open a browser and type in the URL http://localhost/wera (or the url to where you installed WERA).</para> </section> </section> </article> |