From: <bra...@us...> - 2008-08-09 02:36:36
|
Revision: 2532 http://archive-access.svn.sourceforge.net/archive-access/?rev=2532&view=rev Author: bradtofel Date: 2008-08-09 02:36:45 +0000 (Sat, 09 Aug 2008) Log Message: ----------- DOC update for 1.4 features. Modified Paths: -------------- trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml Added Paths: ----------- trunk/archive-access/projects/wayback/dist/src/site/xdoc/resource_store.xml Modified: trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml =================================================================== --- trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml 2008-08-09 01:20:56 UTC (rev 2531) +++ trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml 2008-08-09 02:36:45 UTC (rev 2532) @@ -53,7 +53,7 @@ <p> Once you have downloaded the .tar.gz file from sourceforge, you will need to unpack the file to access the - webapp file, <b>wayback.war</b>. + webapp file, <b>wayback-webapp-1.4.0.war</b>. </p> <p> Installation and configuration of this software involves the @@ -110,177 +110,24 @@ class="org.archive.wayback.webapp.WaybackCollection"> <property name="resourceStore" ... /> <property name="resourceIndex" ... /> + <property name="shutdownables" ... /> </bean> </pre> </p> <p> - The resourceStore property refers to a bean implementing org.archive.wayback.ResourceStore. + The resourceStore property refers to a bean implementing + <a href="resource_store.html">org.archive.wayback.ResourceStore</a>. </p> <p> - The resourceIndex property refers to a bean implementing org.archive.wayback.ResourceIndex. + The resourceIndex property refers to a bean implementing + <a href="resource_index.html">org.archive.wayback.ResourceIndex</a>. </p> - </section> + <p> + The shutdownables property refers to a list of beans implementing org.archive.wayback.Shutdownable, typically worker Threads performing automatic updates of the Collection. + </p> + </section> - - - <section name="org.archive.wayback.ResourceStore implementations"> - - - <subsection name="LocalResourceStore"> - <p> - This implementation works well for small - collections, where all the ARC/WARC files can be placed in a single - directory on the same computer running the wayback application. - Using NFS or another network filesystem technology and symbolic - links can allow this implementation to deal with files in - multiple directories, or across multiple storage nodes. This - implementation also includes the capability to run a background - thread to automatically notice new ARC/WARC files appearing, index - those files, and hand off the index data for merging with - a BDBResourceIndex. When using automatic indexing, any files added to - the 'dataDir' will automatically be indexed and queued for merging - with the ResourceIndex. Please see documentation for the - BDBResourceIndex for information on configuring automatic merging of - indexed data with a BDBResourceIndex. - </p> - <p> - The XML configuration template for a LocalResourceStore follows: - <pre> - -<property name="resourceStore"> - <bean class="org.archive.wayback.resourcestore.LocalResourceStore" - init-method="init"> - - <property name="dataDir" value="/tmp/wayback/arcs/" /> - - <property name="indexThread"> - <bean class="org.archive.wayback.resourcestore.AutoIndexThread"> - <property name="queuedDir" value="/tmp/wayback/arc-indexer/queued" /> - <property name="workDir" value="/tmp/wayback/arc-indexer/work" /> - <property name="runInterval" value="10000" /> - <property name="indexClient"> - <bean class="org.archive.wayback.resourceindex.indexer.IndexClient"> - <property name="tmpDir" value="/tmp/wayback/arc-indexer/tmp" /> - <property name="target" value="/tmp/wayback/index-data/incoming" /> - </bean> - </property> - </bean> - </property> - </bean> -</property> - - </pre> - </p> - <p> - Required configuration: - <ul> - <li> - <b> - dataDir - </b> - is the local directory where ARC files will be - located. - </li> - </ul> - </p> - <p> - Optional configuration (only needed if the indexThread property-bean - is specified, for automatic indexing) - <ul> - <li> - <b> - queuedDir - </b> - names a local directory where the indexer will maintain state - about ARC files that have already been indexed. - </li> - <li> - <b> - workDir - </b> - names a local directory where the indexer will maintain state - about ARC files that are about to be indexed. - </li> - <li> - <b> - runInterval - </b> - indicates the number of milliseconds between polling arcDir - for newly created ARC files. Default is 10000. - </li> - <li> - <b> - tmpDir - </b> - names a local directory where index data will be stored - temporarily before handing off to <b>target</b>. - </li> - <li> - <b> - target - </b> - names: - <ol> - <li> - a local directory where an BDBIndexUpdater is configured to - look for new index data to be merged with a BDBIndex. - </li> - <li> - a remote http:// URL where index data should be PUT, for - merging with a remote BDBIndex. - </li> - </ol> - </li> - </ul> - </p> - <p> - <b>Note:</b> upgrading from Wayback 1.0 to 1.2 requires changing - ResourceStore implementations from <b>LocalARCResourceStore</b> to - <b>LocalResourceStore</b>. <b>LocalARCResourceStore</b> is now - deprecated. - </p> - </subsection> - - - <subsection name="Http11ResourceStore"> - <p> - This implementation allows the wayback application to access - documents in remote ARC/WARC files via HTTP 1.1, and scales to - millions of ARC/WARC files. - </p> - <p> - The XML configuration template for an Http11ResourceStore follows: - <pre> - -<property name="resourceStore"> - <bean class="org.archive.wayback.resourcestore.Http11ResourceStore"> - <property name="urlPrefix" value="http://localhost:8080/arcproxy/" /> - </bean> -</property> - - </pre> - </p> - <p> - Required configuration: - <ul> - <li> - <b> - urlPrefix - </b> - this is the http:// prefix where ARC/WARC files are exported with - an ArcProxy installation. See elsewhere in this document for - information about setting up an ArcProxy. - </li> - </ul> - </p> - </subsection> - - - </section> - - - <section name="org.archive.wayback.ResourceIndex implementations"> Added: trunk/archive-access/projects/wayback/dist/src/site/xdoc/resource_store.xml =================================================================== --- trunk/archive-access/projects/wayback/dist/src/site/xdoc/resource_store.xml (rev 0) +++ trunk/archive-access/projects/wayback/dist/src/site/xdoc/resource_store.xml 2008-08-09 02:36:45 UTC (rev 2532) @@ -0,0 +1,143 @@ +<?xml version="1.0" encoding="ISO-8859-1"?> +<document> + <properties> + <title>Resource Store Configuration</title> + <author email="brad at archive dot org">Brad Tofel</author> + <revision>$$Id$$</revision> + </properties> + + <body> + <section name="ResourceStore configuration options"> + <subsection name="FileLocationDB"> + <p> + The Location Database provides a mapping between ARC/WARC file names + and the absolution location of those ARC/WARC files. Absolute + location, in this case, can refer to either HTTP URLs or absolute + paths to files on the local file system. + </p> + <p> + Whenever locations are added for a new filename that was not + previously present in the location database, a record (in this case a + line) is added to a log file. This log file can then be used to + determine which files have been seen by the location database. The + ResourceFileLocationDatabase interface includes methods to retrieve + the current length of this log file, and to return an iterator with + all records between two points in the log. This interface allows an + observer to poll the location database to create events when new files + are added to the underlying database. + </p> + </subsection> + <subsection name="Automatic Indexing Components"> + <p> + Wayback includes 5 Thread/Worker classes to enable automatic indexing + of new content: + <img src="images/AutoIndexing.png" /> + <ul> + <li> + <b>ResourceFileSourceUpdater</b> is responsible for repeatedly + scanning one or more ResourceFileSource instances, creating + manifests of the files seen in each, and handing the manifests + off to the ResourceFileLocationDBUpdater. In the future, for + larger installations, with 100s to 1000s of machines holding + ARC/WARC files, multiple instances of this component may run in + parallel. + </li> + <li> + <b>ResourceFileLocationDBUpdater</b> is responsible for noticing + new manifests appearing in an incoming directory, and merging + the contents of those manifests with the actual location database, + which is currently implemented using a BDBJE database. + </li> + <li> + <b>IndexQueueUpdater</b> is responsible for polling the location + database log, and adding newly discovered ARC/WARC files to the + IndexQueue. + </li> + <li> + <b>IndexWorker</b> is responsible for polling the IndexQueue, and + when file names are present in the queue, creating an index of + all resources in the ARC/WARC file, and handing the results to + the LocalResourceIndexUpdater. In the future, for larger + installations, multiple instances of this component may run in + parallel on multiple hosts, or this entire component may be + replaced by a distributed Hadoop indexing implementation. + </li> + <li> + <b>LocalResourceIndexUpdater</b> is responsible for noticing new + index result files appearing in an incoming directory, and merging + those results with an existing LocalResourceIndex. Currently the + only provided LocalResourceIndex that can be updated based on an + underlying BDBJE database, but future implementation may maintain + a set of sorted CDX files, or a combination of CDX files and a + BDBJE database. + </li> + </ul> + </p> + </subsection> + </section> + + <section name="org.archive.wayback.ResourceStore implementations"> + <p> + Wayback allows for several configurations enabling diverse collection + sizes and distribution of ARC/WARC files across many local directories + or across many servers. For most configurations, the default + LocationDBResourceStore will suffice, but Wayback is distributed with + 2 additional classes, FileProxy and SimpleResourceStore, which + provide an opportunity to insert a single HTTP caching server between + the Wayback service and an ARC/WARC storage cluster. + </p> + + <subsection name="LocationDBResourceStore"> + <p> + This implementation uses a LocationDB to convert ARC/WARC filenames + into absolute paths, or HTTP URLs. The underlying LocationDB can be + managed by the automatic indexing threads as described above, or it + can be manually managed with the <i>location-client</i> command line + tool. Be sure to enable the + org.archive.wayback.resourcestore.locationdb.FileProxyServlet + if you plan to manage the LocationDB manually. + </p> + </subsection> + <subsection name="SimpleResourceStore"> + <p> + This configuration depends on all ARC/WARC files appearing within a + single HTTP 1.1 exported root directory, or within a single local + directory. ARC/WARC file names are appended to a common prefix, either + a local directory on the host running Wayback, or under a single + remote directory. + </p> + <p> + The FileProxyServlet can be used to make all ARC/WARC files accessible + within a single HTTP directory, acting as a reverse proxy to the + actual host holding the ARC/WARC files. The FileProxyServlet uses a + LocationDB to translate requested ARC/WARC filenames into the actual + location of each file. + </p> + </subsection> + </section> + <section name="Telling Wayback where to look for your ARC/WARC files"> + <p> + When using the automatic indexing functionality, you need to provide a + list of ResourceFileSource objects to the ResourceFileSourceUpdater + class. Wayback currently contains 2 ResourceFileSource implementations: + <ul> + <li> + <b>DirectoryResourceFileSource</b> will recursively scan a local + directory for ARC/WARC files (ending with: .arc, .arc.gz, .warc, + or .warc.gz). The 'name' property of each + DirectoryResourceFileSource must be unique, and consist of valid + filename characters. + </li> + <li> + <b>JspUrlResourceFileSource</b> is a highly experimental + implementation which executes a local .jsp file, passing the 'url' + parameter to the .jsp. The local .jsp is expected to produce output + of the form (NAME URL) for all ARC/WARC files appearing under the + argument url prefix, presumably by parsing the directory index HTML + from the server hosting 'url'. + </li> + </ul> + </p> + </section> + </body> +</document> \ No newline at end of file This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2008-08-19 02:50:12
|
Revision: 2563 http://archive-access.svn.sourceforge.net/archive-access/?rev=2563&view=rev Author: bradtofel Date: 2008-08-19 02:50:22 +0000 (Tue, 19 Aug 2008) Log Message: ----------- DOCS: updated in prep for 1.4. Modified Paths: -------------- trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml trunk/archive-access/projects/wayback/dist/src/site/xdoc/resource_index.xml Modified: trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml =================================================================== --- trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml 2008-08-19 02:49:55 UTC (rev 2562) +++ trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml 2008-08-19 02:50:22 UTC (rev 2563) @@ -552,7 +552,7 @@ </p> <p> Additionally, there is an experimental Firefox-specific plugin - developed by Oskar Grenholm, which sends a provides a novel interface + developed by Oskar Grenholm, which provides a novel interface to navigate between different captured versions of a page within Proxy mode, and also sends a special HTTP header which allows Wayback to uniquely associate the correct date with browsers, even those @@ -927,14 +927,19 @@ <subsection name="Limiting Access based on HTTP BASIC Authentication"> Access can be restricted to a particular Access Point using Tomcat's built-in configuration options. By adding the following configuration to - the web.xml, which assumes an Access Point named "8080:secure" (or + the web.xml, which assumes an Access Point named "8080:usersecure" (or really for any port): <pre> +<security-role> + <description>Secured-Wayback</description> + <role-name>wayback</role-name> +</security-role> + <security-constraint> <web-resource-collection> <web-resource-name>Secured-Wayback</web-resource-name> - <url-pattern>/secure/*</url-pattern> + <url-pattern>/usersecure/*</url-pattern> </web-resource-collection> <auth-constraint> <role-name>wayback</role-name> @@ -953,7 +958,7 @@ <pre> <role rolename="wayback"/> -<user password="changeM3" roles="wayback" username="brad"/> +<user password="changeM3" roles="wayback" name="brad"/> </pre> </subsection> Modified: trunk/archive-access/projects/wayback/dist/src/site/xdoc/resource_index.xml =================================================================== --- trunk/archive-access/projects/wayback/dist/src/site/xdoc/resource_index.xml 2008-08-19 02:49:55 UTC (rev 2562) +++ trunk/archive-access/projects/wayback/dist/src/site/xdoc/resource_index.xml 2008-08-19 02:50:22 UTC (rev 2563) @@ -41,17 +41,17 @@ <p> The following configuration is required for a LocalResourceIndex: <ul> - <li>source - a bean implementing SearchResultSource, which can be - one of the following: + <li><b>source</b> - a bean implementing SearchResultSource, which + can be one of the following: <ul> - <li>BDBIndex - a BDBJE database holding records for all + <li><b>BDBIndex</b> - a BDBJE database holding records for all documents within the WaybackCollection. This implementation allows for fast incremental updates to the index, and is required when using automatic indexing. This implementation scales well to 10's of millions of records.</li> - <li>CDXIndex - a sorted flat file containing one line per - document within the WaybackCollection. This + <li><b>CDXIndex</b> - a sorted flat file containing one line + per document within the WaybackCollection. This implementation requires that the CDX file be manually maintained, but scales to very large sizes, limited primarily by the size of file you can build and store. @@ -59,9 +59,9 @@ <b>arc-indexer</b> or <b>warc-indexer</b>, and the standard UNIX <b>sort</b> tool. </li> - <li>CompositeSearchResultSource - an implementation allowing - aggregation of multiple SearchResultSources into a single - logical SearchResultSource. Use of BDBIndex + <li><b>CompositeSearchResultSource</b> - an implementation + allowing aggregation of multiple SearchResultSources into + a single logical SearchResultSource. Use of BDBIndex SearchResultSources within this class is experimental, but this implementation has been used successfully in production installations to serve results from several @@ -84,23 +84,23 @@ <p> The following configurations are optional for LocalResourceIndexes: <ul> - <li>maxRecords - integer maximum number of records to process for a - single request. Useful to prevent a single request from using - too much Disk and CPU resources.</li> - <li>dedupeRecords - boolean value that should be set to <i>true</i> - when using deduplicated WARC records. This causes Wayback to - modify search results as they are read from the index, so - records indicating a resource was inspected but not saved are - accessible within the Wayback. Please see the + <li><b>maxRecords</b> - integer maximum number of records to process + for a single request. Useful to prevent a single request from + using too much Disk and CPU resources.</li> + <li><b>dedupeRecords</b> - boolean value that should be set to + <i>true</i> when using deduplicated WARC records. This causes + Wayback to modify search results as they are read from the + index, so records indicating a resource was inspected but not + saved are accessible within the Wayback. Please see the <a href="#Duplicate_Reduction">Duplicate Reduction</a> section below for more information.</li> - <li>annotater - experimental hook for modifying or omitting records - as they are read from the index. For example, additional + <li><b>annotater</b> - experimental hook for modifying or omitting + records as they are read from the index. For example, additional metadata could be associated with each record from an external datasource, and this extra metadata could then be exposed to end users through a .jsp customization.</li> - <li>canonicalizer - an implementation of UrlCanonicalizer. See the - section labeled URL Canonicalization below for more + <li><b>canonicalizer</b> - an implementation of UrlCanonicalizer. + See the section labeled URL Canonicalization below for more information.</li> </ul> </p> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2008-11-06 22:51:27
|
Revision: 2630 http://archive-access.svn.sourceforge.net/archive-access/?rev=2630&view=rev Author: bradtofel Date: 2008-11-06 22:51:24 +0000 (Thu, 06 Nov 2008) Log Message: ----------- DOC: clarified dependency on using url-client with -identity option on arc/warc-indexer Modified Paths: -------------- trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml trunk/archive-access/projects/wayback/dist/src/site/xdoc/resource_index.xml Modified: trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml =================================================================== --- trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml 2008-10-29 00:01:33 UTC (rev 2629) +++ trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml 2008-11-06 22:51:24 UTC (rev 2630) @@ -1110,8 +1110,11 @@ </p> <p> The <b>-identity</b> option causes the tools to skip canonicalization - of URLs. See the documentation for the <b>url-client</b> tool, and - the <a href="resource_index.html#URL_Canonicalization"> + of URLs. When using this option, you will need to pass the CDX + records through the url-client tool before sorting them into a + production CDX index. See the documentation for the + <b>url-client</b> tool, and the + <a href="resource_index.html#URL_Canonicalization"> URL Canonicalization </a> section for more information. </p> @@ -1182,15 +1185,19 @@ canonicalization function is applied to requested URLs. This tool will read space(" ") delimited lines from STDIN, and output the same lines on STDOUT, but with one column - altered. The column that is changed is assumed to be a URL, + altered. The column that is changed is assumed to be an URL, and the version output is the canonicalized form of the input URL. </p> <p> - This tool is mostly useful for debugging the - canonicalization function, but can also be used, if the - canonicalization function is altered, to update an existing - CDX index, without recreating CDX files from original ARCs. See the + This tool is required when using the <b>arc-indexer</b> or + <b>warc-indexer</b> tools with the <b>-identity</b> option. Typical + usage involves generating an <i>identity</i> CDX index, then + passing the lines in that index through this tool to canonicalize the + record URL key for queries. If the <i>identity</i> CDX files are + kept, then canonicalization schemes can be swapped without + reindexing the original ARC/WARC content. This tool can also be + useful for debugging the canonicalization function. See the section <a href="resource_index.html#URL_Canonicalization"> URL Canonicalization Modified: trunk/archive-access/projects/wayback/dist/src/site/xdoc/resource_index.xml =================================================================== --- trunk/archive-access/projects/wayback/dist/src/site/xdoc/resource_index.xml 2008-10-29 00:01:33 UTC (rev 2629) +++ trunk/archive-access/projects/wayback/dist/src/site/xdoc/resource_index.xml 2008-11-06 22:51:24 UTC (rev 2630) @@ -275,10 +275,14 @@ </li> <li> <b>user info removal</b> - http://us...@ex... => example.com, - http://user:pas...@ex... => example.com, + http://us...@ex... => example.com, + http://user:pas...@ex... => example.com, </li> <li> + <b>default port removal</b> + http://example.com:80 => example.com, + </li> + <li> <b>session ID removal</b> http://www.example.com/(S(a63098d96360a63098d96360))/page1.aspx => @@ -313,12 +317,12 @@ <p> At the IA, we have recently switched to building CDX files using the <b>-identity</b> option on the <b>arc-indexer</b> and - <b>warc-indexer</b> tools, and have added an additional step in our - CDX creation processes which uses the <b>url-client</b> tool before - sorting and merging CDX files. By keeping the original "identity" CDX - files, we have been able to test various URL canonicalization - strategies without the overhead of re-processing all the source - materials. + <b>warc-indexer</b> tools. The <b>-identity</b> option + <b>requires</b> passing records through the <b>url-client</b> + tool before sorting and merging into production CDX files. By keeping + the original "identity" CDX files, we have been able to test various + URL canonicalization strategies without the overhead of + re-processing all the ARC/WARC source materials. </p> </subsection> <subsection name="Future Directions within Wayback"> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2008-12-05 22:24:46
|
Revision: 2647 http://archive-access.svn.sourceforge.net/archive-access/?rev=2647&view=rev Author: bradtofel Date: 2008-12-05 22:24:43 +0000 (Fri, 05 Dec 2008) Log Message: ----------- oops. forgot to commit site updates within the 1.4.1 branch... this is deployed ad-hoc anyways at the moment, so we'll leave it committed here under 1.5.0. Modified Paths: -------------- trunk/archive-access/projects/wayback/dist/src/site/xdoc/index.xml trunk/archive-access/projects/wayback/dist/src/site/xdoc/release_notes.xml Modified: trunk/archive-access/projects/wayback/dist/src/site/xdoc/index.xml =================================================================== --- trunk/archive-access/projects/wayback/dist/src/site/xdoc/index.xml 2008-12-05 22:21:42 UTC (rev 2646) +++ trunk/archive-access/projects/wayback/dist/src/site/xdoc/index.xml 2008-12-05 22:24:43 UTC (rev 2647) @@ -74,6 +74,16 @@ </p> </section> <section name="News"> + <subsection name="Maintenance Release - 1.4.1, 11/10/2008"> + <p> + Release 1.4.1 fixes several problems discovered in the 1.4.0 + release, and most notably disables by default the AnchorDate and + AnchorWindow features which generated some confusion. Please + see the <a href="release_notes.html">release notes</a> for + a detailed list of changes. + </p> + </subsection> + <subsection name="New Release - 1.4.0, 8/20/2008"> <p> Release 1.4.0 has several new features, as well as several Modified: trunk/archive-access/projects/wayback/dist/src/site/xdoc/release_notes.xml =================================================================== --- trunk/archive-access/projects/wayback/dist/src/site/xdoc/release_notes.xml 2008-12-05 22:21:42 UTC (rev 2646) +++ trunk/archive-access/projects/wayback/dist/src/site/xdoc/release_notes.xml 2008-12-05 22:24:43 UTC (rev 2647) @@ -14,6 +14,61 @@ to release 1.2.0. </p> </section> + <section name="Release 1.4.1"> + <subsection name="Features"> + <ul> + <li> + Index filter which allows including/excluding records based on HTTP + response code field.(<i>ACC-43</i>) + </li> + <li> + Outputs log message instead of stack dump when failing to access + a Resource. + </li> + </ul> + </subsection> + <subsection name="Bug Fixes"> + <ul> + <li> + Some redirect records were not being located in index due to bad + logic in Duplicate record filter.(<i>ACC-30</i>) + </li> + <li> + Wayback was not throwing a NotInArchiveException when + Self-Redirect replay filter removes all records. (unreported) + </li> + <li> + Location HTTP header values were not being escaped before + placing in CDX, causing some records to have too many columns. + (<i>ACC-31</i>) + </li> + <li> + Search Result summary counts were incorrect in Url Prefix + searches.(<i>ACC-33</i>) + </li> + <li> + Implemented NoCache.jsp, a replay insert which adds a + <b>Cache-Control: no-cache</b> HTTP header to all replayed + documents.(<i>ACC-34</i>) + </li> + <li> + Timeline.jsp was using Request Date, not Capture date, which + caused Proxy Mode Timeline to show the wrong date. + (<i>ACC-36</i>) + </li> + <li> + Advanced Search reference implementation .jsp was broken. + (<i>ACC-37</i>) + </li> + <li> + AnchorDate and AnchorWindow functionality is now disabled by + default, and can be enabled via configuration on an AccessPoint. + (<i>ACC-46</i>) + </li> + </ul> + </subsection> + </section> + <section name="Release 1.4.0"> <subsection name="Features"> <ul> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2009-11-11 00:33:38
|
Revision: 2927 http://archive-access.svn.sourceforge.net/archive-access/?rev=2927&view=rev Author: bradtofel Date: 2009-11-11 00:20:58 +0000 (Wed, 11 Nov 2009) Log Message: ----------- website: now includes notes on 1.4.2 release Modified Paths: -------------- trunk/archive-access/projects/wayback/dist/src/site/xdoc/index.xml trunk/archive-access/projects/wayback/dist/src/site/xdoc/release_notes.xml Modified: trunk/archive-access/projects/wayback/dist/src/site/xdoc/index.xml =================================================================== --- trunk/archive-access/projects/wayback/dist/src/site/xdoc/index.xml 2009-11-11 00:18:08 UTC (rev 2926) +++ trunk/archive-access/projects/wayback/dist/src/site/xdoc/index.xml 2009-11-11 00:20:58 UTC (rev 2927) @@ -74,6 +74,13 @@ </p> </section> <section name="News"> + <subsection name="Maintenance Release - 1.4.2, 7/17/2009"> + <p> + Release 1.4.2 fixes several problems discovered in the 1.4.1 + release. Please see the <a href="release_notes.html">release notes</a> for + a detailed list of changes. + </p> + </subsection> <subsection name="Maintenance Release - 1.4.1, 11/10/2008"> <p> Release 1.4.1 fixes several problems discovered in the 1.4.0 Modified: trunk/archive-access/projects/wayback/dist/src/site/xdoc/release_notes.xml =================================================================== --- trunk/archive-access/projects/wayback/dist/src/site/xdoc/release_notes.xml 2009-11-11 00:18:08 UTC (rev 2926) +++ trunk/archive-access/projects/wayback/dist/src/site/xdoc/release_notes.xml 2009-11-11 00:20:58 UTC (rev 2927) @@ -14,6 +14,86 @@ to release 1.2.0. </p> </section> + <section name="Release 1.4.2"> + <subsection name="Features"> + <ul> + <li> + Added exactSchemeOnly configuration to AccessPoint, allowing + explicit distinction between http:// and https://(<i>ACC-32</i>) + </li> + <li> + Now times out requests to a slow/non-responsive RemoteResourceIndex + and remote(HTTP 1.1) ResourceStore nodes.(<i>ACC-38</i>) + </li> + <li> + experimental OpenSearchQuery .jsp implementations(<i>ACC-56</i>) + </li> + <li> + FileProxyServlet now accepts /OFFSET trailing path in addition to + Content-Range HTTP header.(<i>ACC-74</i>) + </li> + <li> + warc-indexer now has -all option to produce a CDX line for ALL + records, not just captures and revisits(<i>ACC-75</i>) + </li> + <li> + now includes file+offset for all records, keying off mime-time of + warc/revist to determine revisits at query time.(<i>ACC-76</i>) + </li> + <li> + Allow prefixing of original HTTP headers with a fixed string. + (<i>ACC-77</i>) + </li> + <li> + Now Wayback rewrites Content-Base HTTP headers.(<i>ACC-78</i>) + </li> + <li> + Timeline.jsp improvements which prevent Timeline from being severely + distorted on some pages. + </li> + <li> + Improvement to ArchivalUrl client-rewrite.js to preserve link text, + working around a bug in Internet Explorer. + </li> + </ul> + </subsection> + <subsection name="Bug Fixes"> + <ul> + <li> + Now all mime-types are escaped to prevent spaces from getting into + the CDX files.(<i>ACC-45</i>) + </li> + <li> + Some CSS URLs were being rewritten twice. (<i>ACC-53</i>) + </li> + <li> + No longer writing original pages Content-Length HTTP header to + output, which caused original pages with Lower-Case "L" in + "Content-length" to return wrong length, truncating replayed + documents. This caused some replayed pages to not have embedded + disclaimers, nor javascript rewriting of links and images. + (<i>ACC-60</i>) + </li> + <li> + Fixed severe problem with live web robots.txt retrieval where wrong + offset was being writting into the live web ResourceIndex. + (<i>ACC-62</i>) + </li> + <li> + Charset extraction from HTTP headers is now case-insensitive. + (<i>ACC-63</i>) + </li> + <li> + No longer adding content to HTML pages with FrameSet tags, as they + were being broken.(<i>ACC-65</i>) + </li> + <li> + No longer set GMT as default timezone for entire JVM.(<i>ACC-70</i>) + </li> + </ul> + </subsection> + </section> + <section name="Release 1.4.1"> <subsection name="Features"> <ul> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2010-10-21 23:02:13
|
Revision: 3296 http://archive-access.svn.sourceforge.net/archive-access/?rev=3296&view=rev Author: bradtofel Date: 2010-10-21 23:02:07 +0000 (Thu, 21 Oct 2010) Log Message: ----------- RELEASE DOCS Modified Paths: -------------- trunk/archive-access/projects/wayback/dist/src/site/xdoc/release_notes.xml trunk/archive-access/projects/wayback/dist/src/site/xdoc/requirements.xml trunk/archive-access/projects/wayback/dist/src/site/xdoc/resource_index.xml Modified: trunk/archive-access/projects/wayback/dist/src/site/xdoc/release_notes.xml =================================================================== --- trunk/archive-access/projects/wayback/dist/src/site/xdoc/release_notes.xml 2010-10-21 23:01:21 UTC (rev 3295) +++ trunk/archive-access/projects/wayback/dist/src/site/xdoc/release_notes.xml 2010-10-21 23:02:07 UTC (rev 3296) @@ -14,6 +14,162 @@ to release 1.2.0. </p> </section> + <section name="Release 1.6.0"> + <subsection name="Major Features"> + <ul> + <li> + <a href="http://www.mementoweb.org/guide/quick-intro/">Memento</a> integration. + </li> + <li> + Improved live-web fetching, enabling simpler external caching of + robots.txt documents, or other arbitrary content used to improve + function of a replay session. + </li> + <li> + Customizable logging, via a logging.properties configuration file. + </li> + <li> + Vastly improved Server-side HTML rewriting capabilities, including + customizable rewriting of specific tags and attributes, rewriting + of (some easily recognizable) URLs within JavaScript and CSS. + </li> + <li> + Snazzy embedded toolbar with "sparkline" indicating the distribution + of captures for a given HTML page, control elements enabling + navigation between various versions of the current page, and a + search box to navigate to other URLs directly from a replay session. + </li> + <li> + Improved hadoop CDX generation capabilities for large scale indexes. + </li> + <li> + SWF (Flash) rewriting, to contextualize URLs embedded within flash + content. + </li> + <li> + ArchivalUrl mode now accepts identity ("id_") flag to indicate + transparent replaying of original content. + </li> + <li> + NotInArchive can now optionally trigger an attempt to fill in + content from the live web, on the fly. + </li> + <li> + Updated license to Apache 2. + </li> + </ul> + </subsection> + <subsection name="Major Bug Fixes"> + <ul> + <li> + More robust handling of chunk encoded resources. + </li> + <li> + Fixed problem with improperly resolving path-relative URLs found + in HTML, CSS, Javascript, SWF content. + </li> + <li> + Fixed problem with improperly escaping URLs within HTML when + rewriting them. + </li> + <li> + Fixed problem where a misconfigured or missing administrative + exclusion file was allowing results to be returned, instead of + returning and appropriate error. + </li> + <li> + No longer extracts resources from the ResourceStore before + redirecting to the closest version, which was a major inefficiency. + </li> + </ul> + </subsection> + <subsection name="Minor Features"> + <ul> + <li> + Now provide closeMatches list of search results which were not + applicable given the users request, but that may be useful for + followup requests. + </li> + <li> + Archival Url mode now allows rotating through several character + encoding detection schemes. + </li> + <li> + Proxy Replay mode now accepts ArchivalURL format requests, allowing + dates to be explicitly requested via proxy mode. + </li> + <li> + AccessPoints can be now configured to optional require strict host + matching for queries and replay requests. + </li> + <li> + Now filters URLs which contain user-info (USER:PAS...@ex...) + from the ResourceIndex + </li> + <li> + ArchivalURL mode requests without a datespec are now interpreted as + a request for the most recent capture of the URL. + </li> + <li> + Improvements in mapping incoming requests to AccessPoints, to allow + virtual hosts to target specific AccessPoints. + </li> + <li> + ResourceNotAvailable exceptions now include other close search + results, allowing the UI to offer other versions which may be + available. + </li> + <li> + ArchivalURL mode now forwards request flags (cs_, js_, im_, etc) + when redirecting to a closer date. + </li> + <li> + ResourceStore implementation now allows retrying when confronted + with possibly-transient HTTP 502 errors. + </li> + </ul> + </subsection> + <subsection name="Minor Bug Fixes"> + <ul> + <li> + cdx-indexer (replacement for arc-indexer and warc-indexer) tool now + returns accurate error code on failure. + </li> + <li> + No longer sets JVM-wide default timezone to GMT - now it is set + appropriately on Calendars when needed. + </li> + <li> + Hostname comparison is now case-insensitive. + </li> + <li> + Server-relative archival url redirects now include query arguments + when redirecting. + </li> + <li> + Server-relative archival url redirects now include a Vary HTTP + header, to fix problems when a cache is used between clients and + the Wayback service. + </li> + <li> + Fixed problem with robots.txt caching within a single request, + which caused serious inefficiency. + </li> + <li> + Fixed problem with resources redirecting to alternate HTTP/HTTPS + version of themselves. + </li> + <li> + Fixed problem with accurately converting 14-digit Timestamps into + Date objects for later comparison. + </li> + <li> + Automatically remaps the oft-misused charset "iso-8859-1" to the + superset "cp1252". + </li> + </ul> + </subsection> + </section> <section name="Release 1.4.2"> <subsection name="Features"> <ul> Modified: trunk/archive-access/projects/wayback/dist/src/site/xdoc/requirements.xml =================================================================== --- trunk/archive-access/projects/wayback/dist/src/site/xdoc/requirements.xml 2010-10-21 23:01:21 UTC (rev 3295) +++ trunk/archive-access/projects/wayback/dist/src/site/xdoc/requirements.xml 2010-10-21 23:02:07 UTC (rev 3296) @@ -10,16 +10,16 @@ <section name="Runtime Requirements"> <subsection name="JAVA"> <p> - Tested working with SUN v1.5.0_01. + Tested working with SUN v1.6. It is highly recommended you + use the latest version available for your operating system. </p> </subsection> <subsection name="Tomcat"> <p> - Tested working with Apache Tomcat 5.5, which can be - <a href="http://tomcat.apache.org/download-55.cgi"> - downloaded here - </a> - . + Tested working with Apache Tomcat + <a href="http://tomcat.apache.org/download-55.cgi">5.5</a>, + and + <a href="http://tomcat.apache.org/download-60.cgi">6.0</a>. </p> </subsection> Modified: trunk/archive-access/projects/wayback/dist/src/site/xdoc/resource_index.xml =================================================================== --- trunk/archive-access/projects/wayback/dist/src/site/xdoc/resource_index.xml 2010-10-21 23:01:21 UTC (rev 3295) +++ trunk/archive-access/projects/wayback/dist/src/site/xdoc/resource_index.xml 2010-10-21 23:02:07 UTC (rev 3296) @@ -102,6 +102,10 @@ <li><b>canonicalizer</b> - an implementation of UrlCanonicalizer. See the section labeled URL Canonicalization below for more information.</li> + <li><b>filter</b> - an implementation of + ObjectFilter<CaptureSearchResult> which will remove + records at query time from the index.</li> + </ul> </p> <p> @@ -153,6 +157,7 @@ </ul> </p> </subsection> + <!-- <subsection name="NutchResourceIndex configuration options"> <p> This implementation, similar to the RemoteResourceIndex, accesses @@ -189,6 +194,7 @@ </ul> </p> </subsection> + --> </section> <section name="URL Canonicalization"> <subsection name="Introduction and Concepts"> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2010-12-02 05:13:46
|
Revision: 3352 http://archive-access.svn.sourceforge.net/archive-access/?rev=3352&view=rev Author: bradtofel Date: 2010-12-02 05:13:40 +0000 (Thu, 02 Dec 2010) Log Message: ----------- DOC tweaks Modified Paths: -------------- trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml trunk/archive-access/projects/wayback/dist/src/site/xdoc/downloads.xml trunk/archive-access/projects/wayback/dist/src/site/xdoc/hadoop.xml trunk/archive-access/projects/wayback/dist/src/site/xdoc/index.xml trunk/archive-access/projects/wayback/dist/src/site/xdoc/release_notes.xml Modified: trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml =================================================================== --- trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml 2010-12-02 05:12:57 UTC (rev 3351) +++ trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml 2010-12-02 05:13:40 UTC (rev 3352) @@ -177,7 +177,7 @@ </li> <li> <b>livewebPrefix</b> - a String URL prefix indicating the host, - port, and path to the correct Replay AccessPoint. + port, and path to an AccessPoint configured with Live Web fetching. </li> <li><b>locale</b> - A specific Locale to use for all requests within this AccessPoint, overriding the users preferred Locale Modified: trunk/archive-access/projects/wayback/dist/src/site/xdoc/downloads.xml =================================================================== --- trunk/archive-access/projects/wayback/dist/src/site/xdoc/downloads.xml 2010-12-02 05:12:57 UTC (rev 3351) +++ trunk/archive-access/projects/wayback/dist/src/site/xdoc/downloads.xml 2010-12-02 05:13:40 UTC (rev 3352) @@ -11,7 +11,7 @@ <subsection name="Releases"> <p>All releases are available off the - <a href="http://sourceforge.net/project/showfiles.php?group_id=118427">Sourceforge Downloads</a> page. Release notes can be found here, + <a href="http://sourceforge.net/project/showfiles.php?group_id=118427">Sourceforge Downloads</a> page. Full <a href="release_notes.html">Release notes</a> are available for releases beyond 1.2.0. </p> </subsection> Modified: trunk/archive-access/projects/wayback/dist/src/site/xdoc/hadoop.xml =================================================================== --- trunk/archive-access/projects/wayback/dist/src/site/xdoc/hadoop.xml 2010-12-02 05:12:57 UTC (rev 3351) +++ trunk/archive-access/projects/wayback/dist/src/site/xdoc/hadoop.xml 2010-12-02 05:13:40 UTC (rev 3352) @@ -10,7 +10,7 @@ <body> <section name="Overview"> <p> - Wayback is distributed with an .jar file that + Wayback is distributed with a .jar file that simplifies creation of large-scale CDX files using hadoop. This code is experimental, and will primarily be useful only if your CDX files are very large - more than a few hundred GB (or more, depending on your Modified: trunk/archive-access/projects/wayback/dist/src/site/xdoc/index.xml =================================================================== --- trunk/archive-access/projects/wayback/dist/src/site/xdoc/index.xml 2010-12-02 05:12:57 UTC (rev 3351) +++ trunk/archive-access/projects/wayback/dist/src/site/xdoc/index.xml 2010-12-02 05:13:40 UTC (rev 3352) @@ -68,8 +68,8 @@ In the local, standalone mode, this software includes the capability to scan for new archived content in a specified location, and to automatically index and serve the new content as it appears. Directing - the Wayback to look for ARC files in the directory where an instance of - the Heritrix web crawler is writing ARC output should provide the + the Wayback to look for W/ARC files in the directory where an instance of + the Heritrix web crawler is writing W/ARC output should provide the capability to browse content archived by Heritrix as it is crawled. </p> </section> Modified: trunk/archive-access/projects/wayback/dist/src/site/xdoc/release_notes.xml =================================================================== --- trunk/archive-access/projects/wayback/dist/src/site/xdoc/release_notes.xml 2010-12-02 05:12:57 UTC (rev 3351) +++ trunk/archive-access/projects/wayback/dist/src/site/xdoc/release_notes.xml 2010-12-02 05:13:40 UTC (rev 3352) @@ -43,8 +43,8 @@ Improved hadoop CDX generation capabilities for large scale indexes. </li> <li> - SWF (Flash) rewriting, to contextualize URLs embedded within flash - content. + SWF (Flash) rewriting, to contextualize absolute URLs embedded + within flash content. </li> <li> ArchivalUrl mode now accepts identity ("id_") flag to indicate This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |