From: <bra...@us...> - 2008-08-09 02:36:36
|
Revision: 2532 http://archive-access.svn.sourceforge.net/archive-access/?rev=2532&view=rev Author: bradtofel Date: 2008-08-09 02:36:45 +0000 (Sat, 09 Aug 2008) Log Message: ----------- DOC update for 1.4 features. Modified Paths: -------------- trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml Added Paths: ----------- trunk/archive-access/projects/wayback/dist/src/site/xdoc/resource_store.xml Modified: trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml =================================================================== --- trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml 2008-08-09 01:20:56 UTC (rev 2531) +++ trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml 2008-08-09 02:36:45 UTC (rev 2532) @@ -53,7 +53,7 @@ <p> Once you have downloaded the .tar.gz file from sourceforge, you will need to unpack the file to access the - webapp file, <b>wayback.war</b>. + webapp file, <b>wayback-webapp-1.4.0.war</b>. </p> <p> Installation and configuration of this software involves the @@ -110,177 +110,24 @@ class="org.archive.wayback.webapp.WaybackCollection"> <property name="resourceStore" ... /> <property name="resourceIndex" ... /> + <property name="shutdownables" ... /> </bean> </pre> </p> <p> - The resourceStore property refers to a bean implementing org.archive.wayback.ResourceStore. + The resourceStore property refers to a bean implementing + <a href="resource_store.html">org.archive.wayback.ResourceStore</a>. </p> <p> - The resourceIndex property refers to a bean implementing org.archive.wayback.ResourceIndex. + The resourceIndex property refers to a bean implementing + <a href="resource_index.html">org.archive.wayback.ResourceIndex</a>. </p> - </section> + <p> + The shutdownables property refers to a list of beans implementing org.archive.wayback.Shutdownable, typically worker Threads performing automatic updates of the Collection. + </p> + </section> - - - <section name="org.archive.wayback.ResourceStore implementations"> - - - <subsection name="LocalResourceStore"> - <p> - This implementation works well for small - collections, where all the ARC/WARC files can be placed in a single - directory on the same computer running the wayback application. - Using NFS or another network filesystem technology and symbolic - links can allow this implementation to deal with files in - multiple directories, or across multiple storage nodes. This - implementation also includes the capability to run a background - thread to automatically notice new ARC/WARC files appearing, index - those files, and hand off the index data for merging with - a BDBResourceIndex. When using automatic indexing, any files added to - the 'dataDir' will automatically be indexed and queued for merging - with the ResourceIndex. Please see documentation for the - BDBResourceIndex for information on configuring automatic merging of - indexed data with a BDBResourceIndex. - </p> - <p> - The XML configuration template for a LocalResourceStore follows: - <pre> - -<property name="resourceStore"> - <bean class="org.archive.wayback.resourcestore.LocalResourceStore" - init-method="init"> - - <property name="dataDir" value="/tmp/wayback/arcs/" /> - - <property name="indexThread"> - <bean class="org.archive.wayback.resourcestore.AutoIndexThread"> - <property name="queuedDir" value="/tmp/wayback/arc-indexer/queued" /> - <property name="workDir" value="/tmp/wayback/arc-indexer/work" /> - <property name="runInterval" value="10000" /> - <property name="indexClient"> - <bean class="org.archive.wayback.resourceindex.indexer.IndexClient"> - <property name="tmpDir" value="/tmp/wayback/arc-indexer/tmp" /> - <property name="target" value="/tmp/wayback/index-data/incoming" /> - </bean> - </property> - </bean> - </property> - </bean> -</property> - - </pre> - </p> - <p> - Required configuration: - <ul> - <li> - <b> - dataDir - </b> - is the local directory where ARC files will be - located. - </li> - </ul> - </p> - <p> - Optional configuration (only needed if the indexThread property-bean - is specified, for automatic indexing) - <ul> - <li> - <b> - queuedDir - </b> - names a local directory where the indexer will maintain state - about ARC files that have already been indexed. - </li> - <li> - <b> - workDir - </b> - names a local directory where the indexer will maintain state - about ARC files that are about to be indexed. - </li> - <li> - <b> - runInterval - </b> - indicates the number of milliseconds between polling arcDir - for newly created ARC files. Default is 10000. - </li> - <li> - <b> - tmpDir - </b> - names a local directory where index data will be stored - temporarily before handing off to <b>target</b>. - </li> - <li> - <b> - target - </b> - names: - <ol> - <li> - a local directory where an BDBIndexUpdater is configured to - look for new index data to be merged with a BDBIndex. - </li> - <li> - a remote http:// URL where index data should be PUT, for - merging with a remote BDBIndex. - </li> - </ol> - </li> - </ul> - </p> - <p> - <b>Note:</b> upgrading from Wayback 1.0 to 1.2 requires changing - ResourceStore implementations from <b>LocalARCResourceStore</b> to - <b>LocalResourceStore</b>. <b>LocalARCResourceStore</b> is now - deprecated. - </p> - </subsection> - - - <subsection name="Http11ResourceStore"> - <p> - This implementation allows the wayback application to access - documents in remote ARC/WARC files via HTTP 1.1, and scales to - millions of ARC/WARC files. - </p> - <p> - The XML configuration template for an Http11ResourceStore follows: - <pre> - -<property name="resourceStore"> - <bean class="org.archive.wayback.resourcestore.Http11ResourceStore"> - <property name="urlPrefix" value="http://localhost:8080/arcproxy/" /> - </bean> -</property> - - </pre> - </p> - <p> - Required configuration: - <ul> - <li> - <b> - urlPrefix - </b> - this is the http:// prefix where ARC/WARC files are exported with - an ArcProxy installation. See elsewhere in this document for - information about setting up an ArcProxy. - </li> - </ul> - </p> - </subsection> - - - </section> - - - <section name="org.archive.wayback.ResourceIndex implementations"> Added: trunk/archive-access/projects/wayback/dist/src/site/xdoc/resource_store.xml =================================================================== --- trunk/archive-access/projects/wayback/dist/src/site/xdoc/resource_store.xml (rev 0) +++ trunk/archive-access/projects/wayback/dist/src/site/xdoc/resource_store.xml 2008-08-09 02:36:45 UTC (rev 2532) @@ -0,0 +1,143 @@ +<?xml version="1.0" encoding="ISO-8859-1"?> +<document> + <properties> + <title>Resource Store Configuration</title> + <author email="brad at archive dot org">Brad Tofel</author> + <revision>$$Id$$</revision> + </properties> + + <body> + <section name="ResourceStore configuration options"> + <subsection name="FileLocationDB"> + <p> + The Location Database provides a mapping between ARC/WARC file names + and the absolution location of those ARC/WARC files. Absolute + location, in this case, can refer to either HTTP URLs or absolute + paths to files on the local file system. + </p> + <p> + Whenever locations are added for a new filename that was not + previously present in the location database, a record (in this case a + line) is added to a log file. This log file can then be used to + determine which files have been seen by the location database. The + ResourceFileLocationDatabase interface includes methods to retrieve + the current length of this log file, and to return an iterator with + all records between two points in the log. This interface allows an + observer to poll the location database to create events when new files + are added to the underlying database. + </p> + </subsection> + <subsection name="Automatic Indexing Components"> + <p> + Wayback includes 5 Thread/Worker classes to enable automatic indexing + of new content: + <img src="images/AutoIndexing.png" /> + <ul> + <li> + <b>ResourceFileSourceUpdater</b> is responsible for repeatedly + scanning one or more ResourceFileSource instances, creating + manifests of the files seen in each, and handing the manifests + off to the ResourceFileLocationDBUpdater. In the future, for + larger installations, with 100s to 1000s of machines holding + ARC/WARC files, multiple instances of this component may run in + parallel. + </li> + <li> + <b>ResourceFileLocationDBUpdater</b> is responsible for noticing + new manifests appearing in an incoming directory, and merging + the contents of those manifests with the actual location database, + which is currently implemented using a BDBJE database. + </li> + <li> + <b>IndexQueueUpdater</b> is responsible for polling the location + database log, and adding newly discovered ARC/WARC files to the + IndexQueue. + </li> + <li> + <b>IndexWorker</b> is responsible for polling the IndexQueue, and + when file names are present in the queue, creating an index of + all resources in the ARC/WARC file, and handing the results to + the LocalResourceIndexUpdater. In the future, for larger + installations, multiple instances of this component may run in + parallel on multiple hosts, or this entire component may be + replaced by a distributed Hadoop indexing implementation. + </li> + <li> + <b>LocalResourceIndexUpdater</b> is responsible for noticing new + index result files appearing in an incoming directory, and merging + those results with an existing LocalResourceIndex. Currently the + only provided LocalResourceIndex that can be updated based on an + underlying BDBJE database, but future implementation may maintain + a set of sorted CDX files, or a combination of CDX files and a + BDBJE database. + </li> + </ul> + </p> + </subsection> + </section> + + <section name="org.archive.wayback.ResourceStore implementations"> + <p> + Wayback allows for several configurations enabling diverse collection + sizes and distribution of ARC/WARC files across many local directories + or across many servers. For most configurations, the default + LocationDBResourceStore will suffice, but Wayback is distributed with + 2 additional classes, FileProxy and SimpleResourceStore, which + provide an opportunity to insert a single HTTP caching server between + the Wayback service and an ARC/WARC storage cluster. + </p> + + <subsection name="LocationDBResourceStore"> + <p> + This implementation uses a LocationDB to convert ARC/WARC filenames + into absolute paths, or HTTP URLs. The underlying LocationDB can be + managed by the automatic indexing threads as described above, or it + can be manually managed with the <i>location-client</i> command line + tool. Be sure to enable the + org.archive.wayback.resourcestore.locationdb.FileProxyServlet + if you plan to manage the LocationDB manually. + </p> + </subsection> + <subsection name="SimpleResourceStore"> + <p> + This configuration depends on all ARC/WARC files appearing within a + single HTTP 1.1 exported root directory, or within a single local + directory. ARC/WARC file names are appended to a common prefix, either + a local directory on the host running Wayback, or under a single + remote directory. + </p> + <p> + The FileProxyServlet can be used to make all ARC/WARC files accessible + within a single HTTP directory, acting as a reverse proxy to the + actual host holding the ARC/WARC files. The FileProxyServlet uses a + LocationDB to translate requested ARC/WARC filenames into the actual + location of each file. + </p> + </subsection> + </section> + <section name="Telling Wayback where to look for your ARC/WARC files"> + <p> + When using the automatic indexing functionality, you need to provide a + list of ResourceFileSource objects to the ResourceFileSourceUpdater + class. Wayback currently contains 2 ResourceFileSource implementations: + <ul> + <li> + <b>DirectoryResourceFileSource</b> will recursively scan a local + directory for ARC/WARC files (ending with: .arc, .arc.gz, .warc, + or .warc.gz). The 'name' property of each + DirectoryResourceFileSource must be unique, and consist of valid + filename characters. + </li> + <li> + <b>JspUrlResourceFileSource</b> is a highly experimental + implementation which executes a local .jsp file, passing the 'url' + parameter to the .jsp. The local .jsp is expected to produce output + of the form (NAME URL) for all ARC/WARC files appearing under the + argument url prefix, presumably by parsing the directory index HTML + from the server hosting 'url'. + </li> + </ul> + </p> + </section> + </body> +</document> \ No newline at end of file This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |