You can subscribe to this list here.
2005 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
(10) |
Sep
(36) |
Oct
(339) |
Nov
(103) |
Dec
(152) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2006 |
Jan
(141) |
Feb
(102) |
Mar
(125) |
Apr
(203) |
May
(57) |
Jun
(30) |
Jul
(139) |
Aug
(46) |
Sep
(64) |
Oct
(105) |
Nov
(34) |
Dec
(162) |
2007 |
Jan
(81) |
Feb
(57) |
Mar
(141) |
Apr
(72) |
May
(9) |
Jun
(1) |
Jul
(144) |
Aug
(88) |
Sep
(40) |
Oct
(43) |
Nov
(34) |
Dec
(20) |
2008 |
Jan
(44) |
Feb
(45) |
Mar
(16) |
Apr
(36) |
May
(8) |
Jun
(77) |
Jul
(177) |
Aug
(66) |
Sep
(8) |
Oct
(33) |
Nov
(13) |
Dec
(37) |
2009 |
Jan
(2) |
Feb
(5) |
Mar
(8) |
Apr
|
May
(36) |
Jun
(19) |
Jul
(46) |
Aug
(8) |
Sep
(1) |
Oct
(66) |
Nov
(61) |
Dec
(10) |
2010 |
Jan
(13) |
Feb
(16) |
Mar
(38) |
Apr
(76) |
May
(47) |
Jun
(32) |
Jul
(35) |
Aug
(45) |
Sep
(20) |
Oct
(61) |
Nov
(24) |
Dec
(16) |
2011 |
Jan
(22) |
Feb
(34) |
Mar
(11) |
Apr
(8) |
May
(24) |
Jun
(23) |
Jul
(11) |
Aug
(42) |
Sep
(81) |
Oct
(48) |
Nov
(21) |
Dec
(20) |
2012 |
Jan
(30) |
Feb
(25) |
Mar
(4) |
Apr
(6) |
May
(1) |
Jun
(5) |
Jul
(5) |
Aug
(8) |
Sep
(6) |
Oct
(6) |
Nov
|
Dec
|
From: <bra...@us...> - 2009-10-15 22:27:30
|
Revision: 2805 http://archive-access.svn.sourceforge.net/archive-access/?rev=2805&view=rev Author: bradtofel Date: 2009-10-15 22:27:18 +0000 (Thu, 15 Oct 2009) Log Message: ----------- Added getFilters(), setFilters(), addFilters() methods. Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/ObjectFilterChain.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/ObjectFilterChain.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/ObjectFilterChain.java 2009-09-19 02:57:12 UTC (rev 2804) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/ObjectFilterChain.java 2009-10-15 22:27:18 UTC (rev 2805) @@ -25,6 +25,7 @@ package org.archive.wayback.util; import java.util.ArrayList; +import java.util.Collection; /** * ObjectFilterChain implements AND logic to chain together multiple @@ -48,6 +49,20 @@ } /** + * @return the filters + */ + public ArrayList<ObjectFilter<E>> getFilters() { + return filters; + } + + /** + * @param filters the filters to set + */ + public void setFilters(ArrayList<ObjectFilter<E>> filters) { + this.filters = filters; + } + + /** * @param filter to be added to the chain. filters are processed in the * order they are added to the chain. */ @@ -55,6 +70,11 @@ filters.add(filter); } + public void addFilters(Collection<ObjectFilter<E>> list) { + filters.addAll(list); + } + + /* (non-Javadoc) * @see org.archive.wayback.cdx.filter.RecordFilter#filterRecord(org.archive.wayback.cdx.CDXRecord) */ This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2009-09-19 02:57:20
|
Revision: 2804 http://archive-access.svn.sourceforge.net/archive-access/?rev=2804&view=rev Author: binzino Date: 2009-09-19 02:57:12 +0000 (Sat, 19 Sep 2009) Log Message: ----------- Updated documents for 0.12.8 release. Modified Paths: -------------- tags/nutchwax-0_12_8/archive/BUILD-NOTES.txt tags/nutchwax-0_12_8/archive/HOWTO.txt tags/nutchwax-0_12_8/archive/INSTALL.txt tags/nutchwax-0_12_8/archive/README.txt tags/nutchwax-0_12_8/archive/RELEASE-NOTES.txt Modified: tags/nutchwax-0_12_8/archive/BUILD-NOTES.txt =================================================================== --- tags/nutchwax-0_12_8/archive/BUILD-NOTES.txt 2009-08-27 23:56:42 UTC (rev 2803) +++ tags/nutchwax-0_12_8/archive/BUILD-NOTES.txt 2009-09-19 02:57:12 UTC (rev 2804) @@ -1,6 +1,6 @@ BUILD-NOTES.txt -2009-07-24 +2009-09-18 Aaron Binns ====================================================================== @@ -79,7 +79,7 @@ ---------------------------------------------------------------------- The file - /opt/nutchwax-0.12.7/conf/tika-mimetypes.xml + /opt/nutchwax-0.12.8/conf/tika-mimetypes.xml contains two errors: one where a mimetype is referenced before it is defined; and a second where a definition has an illegal character. @@ -110,11 +110,11 @@ You can either apply these patches yourself, or copy an already-patched copy from: - /opt/nutchwax-0.12.7/contrib/archive/conf/tika-mimetypes.xml + /opt/nutchwax-0.12.8/contrib/archive/conf/tika-mimetypes.xml to - /opt/nutchwax-0.12.7/conf/tika-mimetypes.xml + /opt/nutchwax-0.12.8/conf/tika-mimetypes.xml ---------------------------------------------------------------------- Modified: tags/nutchwax-0_12_8/archive/HOWTO.txt =================================================================== --- tags/nutchwax-0_12_8/archive/HOWTO.txt 2009-08-27 23:56:42 UTC (rev 2803) +++ tags/nutchwax-0_12_8/archive/HOWTO.txt 2009-09-19 02:57:12 UTC (rev 2804) @@ -1,6 +1,6 @@ HOWTO.txt -2009-07-24 +2009-09-18 Aaron Binns Table of Contents @@ -26,7 +26,7 @@ This HOWTO assumes it is installed in - /opt/nutchwax-0.12.7 + /opt/nutchwax-0.12.8 2. ARC/WARC files. @@ -68,10 +68,10 @@ $ mkdir crawl $ cd crawl - $ /opt/nutchwax-0.12.7/bin/nutchwax import ../manifest - $ /opt/nutchwax-0.12.7/bin/nutch updatedb crawldb -dir segments - $ /opt/nutchwax-0.12.7/bin/nutch invertlinks linkdb -dir segments - $ /opt/nutchwax-0.12.7/bin/nutch index indexes crawldb linkdb segments/* + $ /opt/nutchwax-0.12.8/bin/nutchwax import ../manifest + $ /opt/nutchwax-0.12.8/bin/nutch updatedb crawldb -dir segments + $ /opt/nutchwax-0.12.8/bin/nutch invertlinks linkdb -dir segments + $ /opt/nutchwax-0.12.8/bin/nutch index indexes crawldb linkdb segments/* $ ls -F1 crawldb/ indexes/ @@ -96,7 +96,7 @@ $ cd ../ $ ls -F1 crawl/ - $ /opt/nutchwax-0.12.7/bin/nutchwax search computer + $ /opt/nutchwax-0.12.8/bin/nutchwax search computer This calls the NutchWaxBean to execute a simple keyword search for "computer". Use whatever query term you think appears in the @@ -109,7 +109,7 @@ The Nutch(WAX) web application is bundled with NutchWAX as - /opt/nutchwax-0.12.7/nutch-1.0-dev.war + /opt/nutchwax-0.12.8/nutch-1.0-dev.war Simply deploy that web application in the same fashion as with Nutch. Modified: tags/nutchwax-0_12_8/archive/INSTALL.txt =================================================================== --- tags/nutchwax-0_12_8/archive/INSTALL.txt 2009-08-27 23:56:42 UTC (rev 2803) +++ tags/nutchwax-0_12_8/archive/INSTALL.txt 2009-09-19 02:57:12 UTC (rev 2804) @@ -1,6 +1,6 @@ INSTALL.txt -2009-07-24 +2009-09-18 Aaron Binns Table of Contents @@ -63,10 +63,10 @@ ------------------ As mentioned above, NutchWAX 0.12 is built against Nutch-1.0-dev. Although the Nutch project released 1.0 in early 2009, there were so -many changes that NutchWAX 0.12.7 is still built against pre-1.0 +many changes that NutchWAX 0.12.8 is still built against pre-1.0 codebase. -The specific SVN revision that NutchWAX 0.12.7 is built against is: +The specific SVN revision that NutchWAX 0.12.8 is built against is: 701524 @@ -81,14 +81,14 @@ SVN: NutchWAX ------------- -Once you have Nutch-1.0-dev checked-out, check-out NutchWAX 0.12.7 +Once you have Nutch-1.0-dev checked-out, check-out NutchWAX 0.12.8 source into Nutch's "contrib" directory. $ cd contrib - $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_12_7/archive + $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_12_8/archive This will create a sub-directory named "archive" containing the -NutchWAX 0.12.7 sources. +NutchWAX 0.12.8 sources. Build and install ----------------- @@ -115,7 +115,7 @@ $ cd /opt $ tar xvfz nutch-1.0-dev.tar.gz - $ mv nutch-1.0-dev nutchwax-0.12.7 + $ mv nutch-1.0-dev nutchwax-0.12.8 ====================================================================== @@ -128,24 +128,24 @@ Install it simply by untarring it, for example: $ cd /opt - $ tar xvfz nutchwax-0.12.7.tar.gz + $ tar xvfz nutchwax-0.12.8.tar.gz ====================================================================== Install start-up scripts ====================================================================== -NutchWAX 0.12.7 comes with a Unix init.d script which can be used to +NutchWAX 0.12.8 comes with a Unix init.d script which can be used to automatically start the searcher slaves for a multi-node search configuration. Assuming you installed NutchWAX as - /opt/nutchwax-0.12.7 + /opt/nutchwax-0.12.8 the script is found at - /opt/nutchwax-0.12.7/contrib/archive/etc/init.d/searcher-slave + /opt/nutchwax-0.12.8/contrib/archive/etc/init.d/searcher-slave This script can be placed in /etc/init.d then added to the list of startup scripts to run at bootup by using commands appropriate to your Modified: tags/nutchwax-0_12_8/archive/README.txt =================================================================== --- tags/nutchwax-0_12_8/archive/README.txt 2009-08-27 23:56:42 UTC (rev 2803) +++ tags/nutchwax-0_12_8/archive/README.txt 2009-09-19 02:57:12 UTC (rev 2804) @@ -1,6 +1,6 @@ README.txt -2009-07-24 +2009-09-18 Aaron Binns Table of Contents @@ -13,7 +13,7 @@ Introduction ====================================================================== -Welcome to NutchWAX 0.12.7! +Welcome to NutchWAX 0.12.8! NutchWAX is a set of add-ons to Nutch in order to index and search archived web data. Modified: tags/nutchwax-0_12_8/archive/RELEASE-NOTES.txt =================================================================== --- tags/nutchwax-0_12_8/archive/RELEASE-NOTES.txt 2009-08-27 23:56:42 UTC (rev 2803) +++ tags/nutchwax-0_12_8/archive/RELEASE-NOTES.txt 2009-09-19 02:57:12 UTC (rev 2804) @@ -1,9 +1,9 @@ RELEASE-NOTES.TXT -2009-07-24 +2009-09-18 Aaron Binns -Release notes for NutchWAX 0.12.7 +Release notes for NutchWAX 0.12.8 For the most recent updates and information on NutchWAX, please visit the project wiki at: @@ -15,26 +15,42 @@ Overview ====================================================================== -Aside from the bugs listen in the following section, the main -feature enhancement in Nutchwax 0.12.7 is the addition of a tool -to update the boost values in an index. A new command has been -added to the 'nutchwax' command-line driver: +The main enhancement in NutchWAX 0.12.8 is the ability to configure +HTTP headers to support caching. - nutchwax reboost +The Archive is starting to use Squid to cache the HTTP responses from +NutchWAX and some explicit HTTP response headers were needed to enable +this. Rather than relying on the servlet container (Tomcat/Jetty) to +add the response headers, we added a servlet filter to NutchWAX. -This command takes a pagerank.txt file and an index, calculates (new) -boost values based on the pagerank.txt file and applies them to the -index. The boost values are modified in-place in the index. +Right now the filter is very basic, in the web.xml file we now have -This feaure is used by the Archive in our deployments where web data -is continuously crawled and archived over time, with pagerank values -updating accordingly. + <filter> + <filter-name>Cache Settings</filter-name> + <filter-class>org.archive.nutchwax.CacheSettingsFilter</filter-class> + <init-param> + <param-name>max-age</param-name> + <param-value>259200</param-value> <!-- 72 hours (in seconds) --> + </init-param> + </filter> -Before NW 0.12.7, content had to be re-indexed to take into account -updated pagerank information -- an expensive operation for large -indexes. Now, the pagerank-based boost values can be updated in -place. + <filter-mapping> + <filter-name>Cache Settings</filter-name> + <servlet-name>OpenSearch</servlet-name> + </filter-mapping> +which configures the filter to add a 'max-age' header with a 72 hour +limit. This filter is then applied to all instances of the OpenSearch +servlet. + +This allows browsers to cache the OpenSearch response for up to 72 +hours. It also enables any proxies between the browser and server to +cache the response as well. With the addition of Squid into our +deployment, we let Squid serve cached responses to repeat queries. + +Since our deployment updates every 4 days, a 72-hour expiration works +well. + ====================================================================== Issues ====================================================================== @@ -45,12 +61,10 @@ Issues resolved in this release: -WAX-55 NutchWaxBean's command-line searching should emit title along with other document metadata. +WAX-61 Change mime-type of OpenSearch XML response from text/xml to + application/xml. -WAX-56 Date-adder allows for duplicate dates to be added to a record. +WAX-62 Add ability to configure HTTP headers to support caching. -WAX-57 nutchwax command-driver doesn't properly enclose arguments in quotes. - -WAX-58 Need tool to update an existing index's norms based on pagerank information. - -WAX-59 Wrong log() function used in PageRankScoringFilter. +WAX-63 LengthNormUpdater returning error code if no fields in index + have norms is inconvenient. This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2009-08-27 23:56:50
|
Revision: 2803 http://archive-access.svn.sourceforge.net/archive-access/?rev=2803&view=rev Author: binzino Date: 2009-08-27 23:56:42 +0000 (Thu, 27 Aug 2009) Log Message: ----------- Changed situation where no fields have norms from error to warning. Modified Paths: -------------- tags/nutchwax-0_12_8/archive/src/java/org/archive/nutchwax/tools/LengthNormUpdater.java Modified: tags/nutchwax-0_12_8/archive/src/java/org/archive/nutchwax/tools/LengthNormUpdater.java =================================================================== --- tags/nutchwax-0_12_8/archive/src/java/org/archive/nutchwax/tools/LengthNormUpdater.java 2009-08-20 23:55:25 UTC (rev 2802) +++ tags/nutchwax-0_12_8/archive/src/java/org/archive/nutchwax/tools/LengthNormUpdater.java 2009-08-27 23:56:42 UTC (rev 2803) @@ -166,8 +166,8 @@ if ( fieldNames.isEmpty( ) ) { - System.out.println( "No fields with norms to update" ); - System.exit( 2 ); + System.out.println( "Warning: No fields with norms to update" ); + System.exit( 0 ); } Map<String,Integer> ranks = getPageRanks( pagerankFile ); This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2009-08-20 23:55:39
|
Revision: 2802 http://archive-access.svn.sourceforge.net/archive-access/?rev=2802&view=rev Author: binzino Date: 2009-08-20 23:55:25 +0000 (Thu, 20 Aug 2009) Log Message: ----------- Additional clean-up: remove obsolete elements, ensure output is UTF-8. Modified Paths: -------------- tags/nutchwax-0_12_8/archive/src/java/org/archive/nutchwax/OpenSearchServlet.java Modified: tags/nutchwax-0_12_8/archive/src/java/org/archive/nutchwax/OpenSearchServlet.java =================================================================== --- tags/nutchwax-0_12_8/archive/src/java/org/archive/nutchwax/OpenSearchServlet.java 2009-08-20 21:40:39 UTC (rev 2801) +++ tags/nutchwax-0_12_8/archive/src/java/org/archive/nutchwax/OpenSearchServlet.java 2009-08-20 23:55:25 UTC (rev 2802) @@ -92,9 +92,8 @@ // get parameters from request request.setCharacterEncoding("UTF-8"); String queryString = request.getParameter("query"); - if (queryString == null) - queryString = ""; - String urlQuery = URLEncoder.encode(queryString, "UTF-8"); + if (queryString == null) queryString = ""; + //String urlQuery = URLEncoder.encode(queryString, "UTF-8"); // the query language String queryLang = request.getParameter("lang"); @@ -133,12 +132,6 @@ } } - // Make up query string for use later drawing the 'rss' logo. - String params = "&hitsPerPage=" + hitsPerPage + - (queryLang == null ? "" : "&lang=" + queryLang) + - (sort == null ? "" : "&sort=" + sort + (reverse? "&reverse=true": "") + - (dedupField == null ? "" : "&dedupField=" + dedupField)); - Query query = Query.parse(queryString, queryLang, this.conf); if (NutchBean.LOG.isInfoEnabled()) { NutchBean.LOG.info("query: " + queryString); @@ -183,9 +176,6 @@ HitDetails[] details = bean.getDetails(show); Summary[] summaries = bean.getSummary(details, query); - String requestUrl = request.getRequestURL().toString(); - String base = requestUrl.substring(0, requestUrl.lastIndexOf('/')); - try { DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); factory.setNamespaceAware(true); @@ -200,14 +190,8 @@ Element channel = addNode(doc, rss, "channel"); addNode(doc, channel, "title", "Nutch: " + queryString); - addNode(doc, channel, "description", "Nutch search results for query: " - + queryString); - addNode(doc, channel, "link", - base+"/search.jsp" - +"?query="+urlQuery - +"&start="+start - +"&hitsPerDup="+hitsPerDup - +params); + addNode(doc, channel, "description", "Nutch search results for query: " + queryString); + addNode(doc, channel, "link", "" ); addNode(doc, channel, "opensearch", "totalResults", ""+totalResults); addNode(doc, channel, "opensearch", "startIndex", ""+start); @@ -232,30 +216,6 @@ } } - // Hmm, we should indicate whether or not the "totalResults" - // number as being exact some other way; perhaps just have a - // <nutch:totalIsExact>true</nutch:totalIsExact> element. - /* - if ((hits.totalIsExact() && end < hits.getTotal()) // more hits to show - || (!hits.totalIsExact() && (hits.getLength() > start+hitsPerPage))){ - addNode(doc, channel, "nutch", "nextPage", requestUrl - +"?query="+urlQuery - +"&start="+end - +"&hitsPerDup="+hitsPerDup - +params); - } - */ - - // Same here, this seems odd. - /* - if ((!hits.totalIsExact() && (hits.getLength() <= start+hitsPerPage))) { - addNode(doc, channel, "nutch", "showAllHits", requestUrl - +"?query="+urlQuery - +"&hitsPerDup="+0 - +params); - } - */ - for (int i = 0; i < length; i++) { Hit hit = show[i]; HitDetails detail = details[i]; @@ -274,24 +234,8 @@ addNode(doc, item, "description", summaries[i].toString() ); } addNode(doc, item, "link", url); - addNode(doc, item, "nutch", "site", hit.getDedupValue()); - addNode(doc, item, "nutch", "cache", base+"/cached.jsp?"+id); - addNode(doc, item, "nutch", "explain", base+"/explain.jsp?"+id - +"&query="+urlQuery+"&lang="+queryLang); - - // Probably don't need this as the XML processor/front-end can - // easily do this themselves. - if (hit.moreFromDupExcluded()) { - addNode(doc, item, "nutch", "moreFromSite", requestUrl - +"?query=" - +URLEncoder.encode("site:"+hit.getDedupValue() - +" "+queryString, "UTF-8") - +"&hitsPerSite="+0 - +params); - } - for (int j = 0; j < detail.getLength(); j++) { // add all from detail String field = detail.getField(j); if (!SKIP_DETAILS.contains(field)) @@ -304,9 +248,9 @@ DOMSource source = new DOMSource(doc); TransformerFactory transFactory = TransformerFactory.newInstance(); Transformer transformer = transFactory.newTransformer(); - transformer.setOutputProperty("indent", "yes"); + transformer.setOutputProperty( javax.xml.transform.OutputKeys.ENCODING, "UTF-8" ); StreamResult result = new StreamResult(response.getOutputStream()); - response.setContentType("application/xml"); + response.setContentType("application/rss+xml"); transformer.transform(source, result); } catch (javax.xml.parsers.ParserConfigurationException e) { This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2009-08-20 21:40:50
|
Revision: 2801 http://archive-access.svn.sourceforge.net/archive-access/?rev=2801&view=rev Author: binzino Date: 2009-08-20 21:40:39 +0000 (Thu, 20 Aug 2009) Log Message: ----------- Added configuration of CacheSettingsServlet. Removed obsolete mappings for Nutch CacheServlet. Modified Paths: -------------- tags/nutchwax-0_12_8/archive/src/nutch/src/web/web.xml Modified: tags/nutchwax-0_12_8/archive/src/nutch/src/web/web.xml =================================================================== --- tags/nutchwax-0_12_8/archive/src/nutch/src/web/web.xml 2009-08-20 21:03:30 UTC (rev 2800) +++ tags/nutchwax-0_12_8/archive/src/nutch/src/web/web.xml 2009-08-20 21:40:39 UTC (rev 2801) @@ -30,21 +30,11 @@ </listener> <servlet> - <servlet-name>Cached</servlet-name> - <servlet-class>org.apache.nutch.servlet.Cached</servlet-class> -</servlet> - -<servlet> <servlet-name>OpenSearch</servlet-name> <servlet-class>org.archive.nutchwax.OpenSearchServlet</servlet-class> </servlet> <servlet-mapping> - <servlet-name>Cached</servlet-name> - <url-pattern>/servlet/cached</url-pattern> -</servlet-mapping> - -<servlet-mapping> <servlet-name>OpenSearch</servlet-name> <url-pattern>/opensearch</url-pattern> </servlet-mapping> @@ -55,11 +45,25 @@ </servlet-mapping> <filter> + <filter-name>Cache Settings</filter-name> + <filter-class>org.archive.nutchwax.CacheSettingsFilter</filter-class> + <init-param> + <param-name>max-age</param-name> + <param-value>259200</param-value> <!-- 72 hours (in seconds) --> + </init-param> +</filter> + +<filter-mapping> + <filter-name>Cache Settings</filter-name> + <servlet-name>OpenSearch</servlet-name> +</filter-mapping> + +<filter> <filter-name>XSLT Filter</filter-name> <filter-class>org.archive.nutchwax.XSLTFilter</filter-class> <init-param> <param-name>xsltUrl</param-name> - <param-value>webapps/nutchwax-0.12.4/search.xsl</param-value> + <param-value>webapps/nutchwax-0.12.8/search.xsl</param-value> </init-param> </filter> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2009-08-20 21:03:38
|
Revision: 2800 http://archive-access.svn.sourceforge.net/archive-access/?rev=2800&view=rev Author: binzino Date: 2009-08-20 21:03:30 +0000 (Thu, 20 Aug 2009) Log Message: ----------- Fix WAX-62. Add CacheSettingsFilter class to set "Cache-Control" HTTP response header, as well as set the "Date" header. Added Paths: ----------- tags/nutchwax-0_12_8/archive/src/java/org/archive/nutchwax/CacheSettingsFilter.java Added: tags/nutchwax-0_12_8/archive/src/java/org/archive/nutchwax/CacheSettingsFilter.java =================================================================== --- tags/nutchwax-0_12_8/archive/src/java/org/archive/nutchwax/CacheSettingsFilter.java (rev 0) +++ tags/nutchwax-0_12_8/archive/src/java/org/archive/nutchwax/CacheSettingsFilter.java 2009-08-20 21:03:30 UTC (rev 2800) @@ -0,0 +1,92 @@ +/* + * Copyright (C) 2008 Internet Archive. + * + * This file is part of the archive-access tools project + * (http://sourceforge.net/projects/archive-access). + * + * The archive-access tools are free software; you can redistribute them and/or + * modify them under the terms of the GNU Lesser Public License as published by + * the Free Software Foundation; either version 2.1 of the License, or any + * later version. + * + * The archive-access tools are distributed in the hope that they will be + * useful, but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser + * Public License for more details. + * + * You should have received a copy of the GNU Lesser Public License along with + * the archive-access tools; if not, write to the Free Software Foundation, + * Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ +package org.archive.nutchwax; + +import java.io.IOException; +import java.io.OutputStream; +import java.io.PrintWriter; +import java.io.ByteArrayInputStream; +import java.io.ByteArrayOutputStream; + +import javax.servlet.Filter; +import javax.servlet.FilterChain; +import javax.servlet.FilterConfig; +import javax.servlet.ServletContext; +import javax.servlet.ServletException; +import javax.servlet.ServletOutputStream; +import javax.servlet.ServletRequest; +import javax.servlet.ServletResponse; +import javax.servlet.http.HttpServletResponse; +import javax.servlet.http.HttpServletResponseWrapper; + +import javax.xml.transform.Source; +import javax.xml.transform.stream.StreamSource; +import javax.xml.transform.Templates; +import javax.xml.transform.TransformerFactory; +import javax.xml.transform.Transformer; +import javax.xml.transform.stream.StreamResult; + + +public class CacheSettingsFilter implements Filter +{ + private String maxAge; + + public void init( FilterConfig config ) + throws ServletException + { + this.maxAge = config.getInitParameter( "max-age" ); + + if ( this.maxAge != null ) + { + this.maxAge = this.maxAge.trim( ); + + if ( this.maxAge.length( ) == 0 ) + { + this.maxAge = null; + } + else + { + this.maxAge = "max-age=" + this.maxAge; + } + } + } + + public void doFilter( ServletRequest request, ServletResponse response, FilterChain chain ) + throws IOException, ServletException + { + HttpServletResponse res = (HttpServletResponse) response; + + res.setDateHeader( "Date", System.currentTimeMillis( ) ); + + if ( this.maxAge != null ) + { + res.addHeader( "Cache-Control", this.maxAge ); + } + + chain.doFilter( request, res ); + } + + public void destroy() + { + + } + +} This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2009-08-20 21:02:48
|
Revision: 2799 http://archive-access.svn.sourceforge.net/archive-access/?rev=2799&view=rev Author: binzino Date: 2009-08-20 21:02:32 +0000 (Thu, 20 Aug 2009) Log Message: ----------- Fix WAX-61. Changed content-type to be "application/xml". Modified Paths: -------------- tags/nutchwax-0_12_8/archive/src/java/org/archive/nutchwax/OpenSearchServlet.java Modified: tags/nutchwax-0_12_8/archive/src/java/org/archive/nutchwax/OpenSearchServlet.java =================================================================== --- tags/nutchwax-0_12_8/archive/src/java/org/archive/nutchwax/OpenSearchServlet.java 2009-08-20 20:36:42 UTC (rev 2798) +++ tags/nutchwax-0_12_8/archive/src/java/org/archive/nutchwax/OpenSearchServlet.java 2009-08-20 21:02:32 UTC (rev 2799) @@ -306,7 +306,7 @@ Transformer transformer = transFactory.newTransformer(); transformer.setOutputProperty("indent", "yes"); StreamResult result = new StreamResult(response.getOutputStream()); - response.setContentType("text/xml"); + response.setContentType("application/xml"); transformer.transform(source, result); } catch (javax.xml.parsers.ParserConfigurationException e) { This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2009-08-20 20:36:56
|
Revision: 2798 http://archive-access.svn.sourceforge.net/archive-access/?rev=2798&view=rev Author: binzino Date: 2009-08-20 20:36:42 +0000 (Thu, 20 Aug 2009) Log Message: ----------- Creation of 0.12.8 tag/branch. Added Paths: ----------- tags/nutchwax-0_12_8/ This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 2797 http://archive-access.svn.sourceforge.net/archive-access/?rev=2797&view=rev Author: bradtofel Date: 2009-08-18 21:54:46 +0000 (Tue, 18 Aug 2009) Log Message: ----------- BUGFIX(unreported...) Need to set exit code to 1 on exception, not on no exception... Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/WarcIndexer.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/WarcIndexer.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/WarcIndexer.java 2009-08-12 01:02:39 UTC (rev 2796) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/WarcIndexer.java 2009-08-18 21:54:46 UTC (rev 2797) @@ -136,10 +136,10 @@ pw.println(lines.next()); } pw.close(); - System.exit(1); } catch (Exception e) { e.printStackTrace(); + System.exit(1); } } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 2796 http://archive-access.svn.sourceforge.net/archive-access/?rev=2796&view=rev Author: bradtofel Date: 2009-08-12 01:02:39 +0000 (Wed, 12 Aug 2009) Log Message: ----------- BUGFIX(ACC-80): was not setting exit status to non-zero when exceptions were caught... Uhg. Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/WarcIndexer.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/WarcIndexer.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/WarcIndexer.java 2009-07-24 20:51:49 UTC (rev 2795) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/WarcIndexer.java 2009-08-12 01:02:39 UTC (rev 2796) @@ -136,6 +136,8 @@ pw.println(lines.next()); } pw.close(); + System.exit(1); + } catch (Exception e) { e.printStackTrace(); } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2009-07-24 20:52:02
|
Revision: 2795 http://archive-access.svn.sourceforge.net/archive-access/?rev=2795&view=rev Author: binzino Date: 2009-07-24 20:51:49 +0000 (Fri, 24 Jul 2009) Log Message: ----------- Updated documentation for NutchWAX 0.12.7 release. Modified Paths: -------------- tags/nutchwax-0_12_7/archive/BUILD-NOTES.txt tags/nutchwax-0_12_7/archive/HOWTO.txt tags/nutchwax-0_12_7/archive/INSTALL.txt tags/nutchwax-0_12_7/archive/README.txt tags/nutchwax-0_12_7/archive/RELEASE-NOTES.txt Modified: tags/nutchwax-0_12_7/archive/BUILD-NOTES.txt =================================================================== --- tags/nutchwax-0_12_7/archive/BUILD-NOTES.txt 2009-07-24 19:13:40 UTC (rev 2794) +++ tags/nutchwax-0_12_7/archive/BUILD-NOTES.txt 2009-07-24 20:51:49 UTC (rev 2795) @@ -1,6 +1,6 @@ BUILD-NOTES.txt -2009-07-09 +2009-07-24 Aaron Binns ====================================================================== @@ -79,7 +79,7 @@ ---------------------------------------------------------------------- The file - /opt/nutchwax-0.12.6/conf/tika-mimetypes.xml + /opt/nutchwax-0.12.7/conf/tika-mimetypes.xml contains two errors: one where a mimetype is referenced before it is defined; and a second where a definition has an illegal character. @@ -110,11 +110,11 @@ You can either apply these patches yourself, or copy an already-patched copy from: - /opt/nutchwax-0.12.6/contrib/archive/conf/tika-mimetypes.xml + /opt/nutchwax-0.12.7/contrib/archive/conf/tika-mimetypes.xml to - /opt/nutchwax-0.12.6/conf/tika-mimetypes.xml + /opt/nutchwax-0.12.7/conf/tika-mimetypes.xml ---------------------------------------------------------------------- Modified: tags/nutchwax-0_12_7/archive/HOWTO.txt =================================================================== --- tags/nutchwax-0_12_7/archive/HOWTO.txt 2009-07-24 19:13:40 UTC (rev 2794) +++ tags/nutchwax-0_12_7/archive/HOWTO.txt 2009-07-24 20:51:49 UTC (rev 2795) @@ -1,6 +1,6 @@ HOWTO.txt -2009-07-09 +2009-07-24 Aaron Binns Table of Contents @@ -26,7 +26,7 @@ This HOWTO assumes it is installed in - /opt/nutchwax-0.12.6 + /opt/nutchwax-0.12.7 2. ARC/WARC files. @@ -68,10 +68,10 @@ $ mkdir crawl $ cd crawl - $ /opt/nutchwax-0.12.6/bin/nutchwax import ../manifest - $ /opt/nutchwax-0.12.6/bin/nutch updatedb crawldb -dir segments - $ /opt/nutchwax-0.12.6/bin/nutch invertlinks linkdb -dir segments - $ /opt/nutchwax-0.12.6/bin/nutch index indexes crawldb linkdb segments/* + $ /opt/nutchwax-0.12.7/bin/nutchwax import ../manifest + $ /opt/nutchwax-0.12.7/bin/nutch updatedb crawldb -dir segments + $ /opt/nutchwax-0.12.7/bin/nutch invertlinks linkdb -dir segments + $ /opt/nutchwax-0.12.7/bin/nutch index indexes crawldb linkdb segments/* $ ls -F1 crawldb/ indexes/ @@ -96,7 +96,7 @@ $ cd ../ $ ls -F1 crawl/ - $ /opt/nutchwax-0.12.6/bin/nutchwax search computer + $ /opt/nutchwax-0.12.7/bin/nutchwax search computer This calls the NutchWaxBean to execute a simple keyword search for "computer". Use whatever query term you think appears in the @@ -109,7 +109,7 @@ The Nutch(WAX) web application is bundled with NutchWAX as - /opt/nutchwax-0.12.6/nutch-1.0-dev.war + /opt/nutchwax-0.12.7/nutch-1.0-dev.war Simply deploy that web application in the same fashion as with Nutch. Modified: tags/nutchwax-0_12_7/archive/INSTALL.txt =================================================================== --- tags/nutchwax-0_12_7/archive/INSTALL.txt 2009-07-24 19:13:40 UTC (rev 2794) +++ tags/nutchwax-0_12_7/archive/INSTALL.txt 2009-07-24 20:51:49 UTC (rev 2795) @@ -1,6 +1,6 @@ INSTALL.txt -2009-07-09 +2009-07-24 Aaron Binns Table of Contents @@ -63,10 +63,10 @@ ------------------ As mentioned above, NutchWAX 0.12 is built against Nutch-1.0-dev. Although the Nutch project released 1.0 in early 2009, there were so -many changes that NutchWAX 0.12.6 is still built against pre-1.0 +many changes that NutchWAX 0.12.7 is still built against pre-1.0 codebase. -The specific SVN revision that NutchWAX 0.12.6 is built against is: +The specific SVN revision that NutchWAX 0.12.7 is built against is: 701524 @@ -81,14 +81,14 @@ SVN: NutchWAX ------------- -Once you have Nutch-1.0-dev checked-out, check-out NutchWAX 0.12.6 +Once you have Nutch-1.0-dev checked-out, check-out NutchWAX 0.12.7 source into Nutch's "contrib" directory. $ cd contrib - $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_12_6/archive + $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_12_7/archive This will create a sub-directory named "archive" containing the -NutchWAX 0.12.6 sources. +NutchWAX 0.12.7 sources. Build and install ----------------- @@ -115,7 +115,7 @@ $ cd /opt $ tar xvfz nutch-1.0-dev.tar.gz - $ mv nutch-1.0-dev nutchwax-0.12.6 + $ mv nutch-1.0-dev nutchwax-0.12.7 ====================================================================== @@ -128,24 +128,24 @@ Install it simply by untarring it, for example: $ cd /opt - $ tar xvfz nutchwax-0.12.6.tar.gz + $ tar xvfz nutchwax-0.12.7.tar.gz ====================================================================== Install start-up scripts ====================================================================== -NutchWAX 0.12.6 comes with a Unix init.d script which can be used to +NutchWAX 0.12.7 comes with a Unix init.d script which can be used to automatically start the searcher slaves for a multi-node search configuration. Assuming you installed NutchWAX as - /opt/nutchwax-0.12.6 + /opt/nutchwax-0.12.7 the script is found at - /opt/nutchwax-0.12.6/contrib/archive/etc/init.d/searcher-slave + /opt/nutchwax-0.12.7/contrib/archive/etc/init.d/searcher-slave This script can be placed in /etc/init.d then added to the list of startup scripts to run at bootup by using commands appropriate to your Modified: tags/nutchwax-0_12_7/archive/README.txt =================================================================== --- tags/nutchwax-0_12_7/archive/README.txt 2009-07-24 19:13:40 UTC (rev 2794) +++ tags/nutchwax-0_12_7/archive/README.txt 2009-07-24 20:51:49 UTC (rev 2795) @@ -1,6 +1,6 @@ README.txt -2009-07-09 +2009-07-24 Aaron Binns Table of Contents @@ -13,7 +13,7 @@ Introduction ====================================================================== -Welcome to NutchWAX 0.12.6! +Welcome to NutchWAX 0.12.7! NutchWAX is a set of add-ons to Nutch in order to index and search archived web data. Modified: tags/nutchwax-0_12_7/archive/RELEASE-NOTES.txt =================================================================== --- tags/nutchwax-0_12_7/archive/RELEASE-NOTES.txt 2009-07-24 19:13:40 UTC (rev 2794) +++ tags/nutchwax-0_12_7/archive/RELEASE-NOTES.txt 2009-07-24 20:51:49 UTC (rev 2795) @@ -1,9 +1,9 @@ RELEASE-NOTES.TXT -2009-07-09 +2009-07-24 Aaron Binns -Release notes for NutchWAX 0.12.6 +Release notes for NutchWAX 0.12.7 For the most recent updates and information on NutchWAX, please visit the project wiki at: @@ -15,44 +15,26 @@ Overview ====================================================================== -NutchWAX 0.12.6 contains a few convenient enhancements to 0.12.5 +Aside from the bugs listen in the following section, the main +feature enhancement in Nutchwax 0.12.7 is the addition of a tool +to update the boost values in an index. A new command has been +added to the 'nutchwax' command-line driver: - o Addition of 'search' and 'merge' commands to the 'nutchwax' - command-line driver. Now one can do + nutchwax reboost - nutchwax search foo +This command takes a pagerank.txt file and an index, calculates (new) +boost values based on the pagerank.txt file and applies them to the +index. The boost values are modified in-place in the index. - instead of +This feaure is used by the Archive in our deployments where web data +is continuously crawled and archived over time, with pagerank values +updating accordingly. - nutch org.archive.nutchwax.NutchWaxBean foo +Before NW 0.12.7, content had to be re-indexed to take into account +updated pagerank information -- an expensive operation for large +indexes. Now, the pagerank-based boost values can be updated in +place. - Similarly, the new NutchWAX index merging, which supports - parallel indexes, can be invoked via - - nutchwax merge output-index input-index... - - o Merging of parallel indexes into a single index. - - NutchWAX has a copy/paste/enhanced version of the Nutch index - merger that now supports parallel indexes. This allows parallel - indexes to be merged into a single index. To use this feature, - add the "-p" option to the NutchWAX 'merge' command indicating the - input index directories contain parallel index sub-dirs. - - nutchwax merge -p output-index input-pindexes... - - o Option to specify the directory where the index(es) and segments - live when doing a command-line search. - - Previously the directory was obtained from the nutch-default.xml - configuration file. This is inconvenient when testing different - indexes as one would have to edit the config file each time to - specify a different index to search. - - Now, the directory can be specified on the command line: - - nutchwax search -d <dir> <query> - ====================================================================== Issues ====================================================================== @@ -63,9 +45,12 @@ Issues resolved in this release: -WAX-51 Enhance index merging to combine parallel indexes. +WAX-55 NutchWaxBean's command-line searching should emit title along with other document metadata. -WAX-52 Add option to NutchWaxBean to specify directory where - index+segments are to be found. +WAX-56 Date-adder allows for duplicate dates to be added to a record. -WAX-53 IndexMerging parallel indexes fails when index is empty. +WAX-57 nutchwax command-driver doesn't properly enclose arguments in quotes. + +WAX-58 Need tool to update an existing index's norms based on pagerank information. + +WAX-59 Wrong log() function used in PageRankScoringFilter. This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2009-07-24 19:13:50
|
Revision: 2794 http://archive-access.svn.sourceforge.net/archive-access/?rev=2794&view=rev Author: binzino Date: 2009-07-24 19:13:40 +0000 (Fri, 24 Jul 2009) Log Message: ----------- Oops, had changed this to use the NutchBean for some temporary testing and forgot to set it back to NutchWaxBean. Modified Paths: -------------- tags/nutchwax-0_12_7/archive/bin/nutchwax Modified: tags/nutchwax-0_12_7/archive/bin/nutchwax =================================================================== --- tags/nutchwax-0_12_7/archive/bin/nutchwax 2009-07-24 18:45:29 UTC (rev 2793) +++ tags/nutchwax-0_12_7/archive/bin/nutchwax 2009-07-24 19:13:40 UTC (rev 2794) @@ -76,7 +76,7 @@ ;; search) shift - ${NUTCH_HOME}/bin/nutch org.apache.nutch.searcher.NutchBean "$@" + ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.NutchWaxBean "$@" ;; *) echo "" This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2009-07-24 18:45:51
|
Revision: 2793 http://archive-access.svn.sourceforge.net/archive-access/?rev=2793&view=rev Author: binzino Date: 2009-07-24 18:45:29 +0000 (Fri, 24 Jul 2009) Log Message: ----------- WAX-55. Added 'title' to the output. Modified Paths: -------------- tags/nutchwax-0_12_7/archive/src/java/org/archive/nutchwax/NutchWaxBean.java Modified: tags/nutchwax-0_12_7/archive/src/java/org/archive/nutchwax/NutchWaxBean.java =================================================================== --- tags/nutchwax-0_12_7/archive/src/java/org/archive/nutchwax/NutchWaxBean.java 2009-07-24 18:43:16 UTC (rev 2792) +++ tags/nutchwax-0_12_7/archive/src/java/org/archive/nutchwax/NutchWaxBean.java 2009-07-24 18:45:29 UTC (rev 2793) @@ -215,6 +215,8 @@ * <pre> * <listener> * <listener-class>org.apache.nutch.searcher.NutchBean$NutchBeanConstructor</listener-class> + * </listener> + * <listener> * <listener-class>org.archive.nutchwax.NutchWaxBean$NutchWaxBeanConstructor</listener-class> * </listener> * </pre> @@ -329,6 +331,8 @@ + java.util.Arrays.asList( details[i].getValues( "digest" ) ) + " " + java.util.Arrays.asList( details[i].getValues( "date" ) ) + + " " + + java.util.Arrays.asList( details[i].getValues( "title" ) ) + "\n" + summaries[i] ); } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2009-07-24 18:43:35
|
Revision: 2792 http://archive-access.svn.sourceforge.net/archive-access/?rev=2792&view=rev Author: binzino Date: 2009-07-24 18:43:16 +0000 (Fri, 24 Jul 2009) Log Message: ----------- WAX-56. Dates from all sources, including the text file are put into a Set before being added to the output directory. Modified Paths: -------------- tags/nutchwax-0_12_7/archive/src/java/org/archive/nutchwax/tools/DateAdder.java Modified: tags/nutchwax-0_12_7/archive/src/java/org/archive/nutchwax/tools/DateAdder.java =================================================================== --- tags/nutchwax-0_12_7/archive/src/java/org/archive/nutchwax/tools/DateAdder.java 2009-07-22 21:36:23 UTC (rev 2791) +++ tags/nutchwax-0_12_7/archive/src/java/org/archive/nutchwax/tools/DateAdder.java 2009-07-24 18:43:16 UTC (rev 2792) @@ -76,7 +76,7 @@ recordsStream = new FileInputStream( recordsFile ); } - // Read date-addition records from stdin. + // Read date-addition records. Map<String,String> dateRecords = new HashMap<String,String>( ); BufferedReader br = new BufferedReader( new InputStreamReader( recordsStream, "UTF-8" ) ); String line; @@ -89,7 +89,7 @@ continue; } - // Key is hash+url, value is String which is a " "-separated list of dates + // Key is url+hash, value is String which is a " "-separated list of dates String key = fields[0] + fields[1]; String dates = dateRecords.get( key ); if ( dates != null ) @@ -113,6 +113,7 @@ } IndexWriter writer = new IndexWriter( destIndexDir, new WhitespaceAnalyzer( ), true ); + writer.setUseCompoundFile(false); UrlCanonicalizer canonicalizer = getCanonicalizer( this.getConf( ) ); @@ -132,23 +133,20 @@ Collections.addAll( uniqueDates, dates ); } - for ( String date : uniqueDates ) - { - newDoc.add( new Field( NutchWax.DATE_KEY, date, Field.Store.YES, Field.Index.UN_TOKENIZED ) ); - } // Obtain the new dates for the document. - String newDates = null; try { // First, apply URL canonicalization from Wayback String canonicalizedUrl = canonicalizer.urlStringToKey( oldDoc.get( NutchWax.URL_KEY ) ); - // Now, get the digest+URL of the document, look for it in - // the updateRecords and if found, add the date. + // As above, they key is hash+url, value will bea a String which is a " "-separated list of dates String key = canonicalizedUrl + oldDoc.get( NutchWax.DIGEST_KEY ); - newDates = dateRecords.get( key ); + String newDates = dateRecords.get( key ); + + // If there are any new dates, add them to the set. + if ( newDates != null ) Collections.addAll( uniqueDates, newDates.split( "\\s+" ) ); } catch ( Exception e ) { @@ -157,13 +155,10 @@ System.err.println( "WARN: Not adding dates on malformed URI: " + oldDoc.get( NutchWax.URL_KEY ) ); } - // If there are any new dates, add them to the new document. - if ( newDates != null ) + // Add the updated list of uniqueDates, which the new (unique) ones. + for ( String date : uniqueDates ) { - for ( String date : newDates.split("\\s+") ) - { - newDoc.add( new Field( NutchWax.DATE_KEY, date, Field.Store.YES, Field.Index.UN_TOKENIZED ) ); - } + newDoc.add( new Field( NutchWax.DATE_KEY, date, Field.Store.YES, Field.Index.NO_NORMS ) ); } // Finally, add the new document to the new index. This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2009-07-22 21:36:24
|
Revision: 2791 http://archive-access.svn.sourceforge.net/archive-access/?rev=2791&view=rev Author: binzino Date: 2009-07-22 21:36:23 +0000 (Wed, 22 Jul 2009) Log Message: ----------- WAX-58. Initial revision. Added Paths: ----------- tags/nutchwax-0_12_7/archive/src/java/org/archive/nutchwax/tools/LengthNormUpdater.java Added: tags/nutchwax-0_12_7/archive/src/java/org/archive/nutchwax/tools/LengthNormUpdater.java =================================================================== --- tags/nutchwax-0_12_7/archive/src/java/org/archive/nutchwax/tools/LengthNormUpdater.java (rev 0) +++ tags/nutchwax-0_12_7/archive/src/java/org/archive/nutchwax/tools/LengthNormUpdater.java 2009-07-22 21:36:23 UTC (rev 2791) @@ -0,0 +1,333 @@ +package org.archive.nutchwax.tools; + +/** + * Copyright 2006 The Apache Software Foundation + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +import java.io.BufferedReader; +import java.io.InputStreamReader; +import java.io.FileInputStream; +import java.io.IOException; +import java.util.HashMap; +import java.util.Map; +import java.util.Set; +import java.util.Collection; +import java.util.HashSet; + +import org.apache.lucene.document.Document; +import org.apache.lucene.index.Term; +import org.apache.lucene.index.TermEnum; +import org.apache.lucene.index.TermDocs; +import org.apache.lucene.index.IndexReader; +import org.apache.lucene.search.Similarity; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.FSDirectory; + + +import org.apache.nutch.indexer.NutchSimilarity; + +/** + * This is heavily cribbed from org.apache.lucene.misc.LengthNormModifier + */ +public class LengthNormUpdater +{ + private static final String USAGE = + "Usage: LengthNormUpdater [OPTIONS] <pageranks> <index> [field1]...\n" + + "\n" + + "Update the norms of <index> with boosts based on values from <pageranks>\n" + + "\n" + + "Options:\n" + + "\t-s <classname> similarity implementation to use\n" + + "\t-v increase verbosity\n" + + "\n" + + "Reads the pagerank values from the <pageranks> file and calculates new\n" + + "norms for the documents based on the formula:\n" + + "\n" + + "\tnorm = similarity.lengthNorm * log10(pagerank)\n" + + "\n" + + "If fields are specified on the command-line, only they will be updated.\n" + + "If a specified field does not have norms, an error message is given and\n" + + "the program terminates without performing any updates.\n" + + "\n" + + "If no fields are given, all the fields in the index that have norms will\n" + + "be updated.\n" + + "\n" + + "The default similarity implementation is NutchSimilarity\n" + + "\n" + + "Examples:\n" + + "\n" + + "\tLengthNormUpdater pagerank.txt index\n" + + "\tLengthNormUpdater -v -v pagerank.txt index title content\n" + + "\n" + ; + + private static int VERBOSE = 0; + + /** + * + */ + public static void main( String[] args ) throws IOException + { + if ( args.length < 1 ) + { + System.err.print( USAGE ); + System.exit(1); + } + + Similarity s = new NutchSimilarity( ); + + int pos = 0; + for ( ; (pos < args.length) && args[pos].startsWith( "-" ) ; pos++ ) + { + if ( "-h".equals( args[pos] ) ) + { + System.out.println( USAGE ); + System.exit( 0 ); + } + else if ( "-v".equals( args[pos] ) ) + { + VERBOSE++; + } + else if ( "-s".equals( args[pos] ) ) + { + pos++; + + if ( pos == args.length ) + { + System.err.println( "Error: missing argument to option -s" ); + System.exit( 1 ); + } + + try + { + Class simClass = Class.forName(args[pos]); + s = (Similarity)simClass.newInstance(); + } + catch (Exception e) + { + System.err.println( "Couldn't instantiate similarity with empty constructor: " + args[pos] ); + e.printStackTrace(System.err); + System.exit( 1 ); + } + } + } + + if ( (pos + 2) > args.length ) + { + System.out.println( USAGE ); + System.exit( 1 ); + } + + String pagerankFile = args[pos++]; + + IndexReader reader = IndexReader.open( args[pos++] ); + + try + { + Set<String> fieldNames = new HashSet<String>( ); + if ( pos == args.length ) + { + // No fields specified on command-line, get a list of all + // fields in the index that have norms. + for ( String fieldName : (Collection<String>) reader.getFieldNames( IndexReader.FieldOption.ALL ) ) + { + if ( reader.hasNorms( fieldName ) ) + { + fieldNames.add( fieldName ); + } + } + } + else + { + // Verify all explicitly specified fields have norms. + for ( int i = pos ; i < args.length ; i++ ) + { + if ( ! reader.hasNorms( args[i] ) ) + { + System.err.println( "Error: No norms for field: " + args[i] ); + System.exit( 1 ); + } + + fieldNames.add( args[i] ); + } + } + + if ( fieldNames.isEmpty( ) ) + { + System.out.println( "No fields with norms to update" ); + System.exit( 2 ); + } + + Map<String,Integer> ranks = getPageRanks( pagerankFile ); + + for ( String fieldName : fieldNames ) + { + reSetNorms( reader, fieldName, ranks, s ); + } + + } + finally + { + if ( reader != null ) + { + reader.close( ); + } + + } + } + + + /** + * + */ + public static void reSetNorms( IndexReader reader, + String fieldName, + Map<String,Integer> ranks, + Similarity sim ) throws IOException + { + if ( VERBOSE > 0 ) System.out.println( "Updating field: " + fieldName ); + + int[] termCounts = new int[0]; + + TermEnum termEnum = null; + TermDocs termDocs = null; + + termCounts = new int[reader.maxDoc()]; + try + { + termEnum = reader.terms(new Term(fieldName,"")); + try + { + termDocs = reader.termDocs(); + do + { + Term term = termEnum.term(); + if (term != null && term.field().equals(fieldName)) + { + termDocs.seek(termEnum.term()); + while (termDocs.next()) + { + termCounts[termDocs.doc()] += termDocs.freq(); + } + } + } + while (termEnum.next()); + } + finally + { + if (null != termDocs) termDocs.close(); + } + } + finally + { + if (null != termEnum) termEnum.close(); + } + + for (int d = 0; d < termCounts.length; d++) + { + if ( ! reader.isDeleted(d) ) + { + Document doc = reader.document( d ); + + String url = doc.get( "url" ); + + if ( url != null ) + { + Integer rank = ranks.get( url ); + if ( rank == null ) continue; + + float originalNorm = sim.lengthNorm(fieldName, termCounts[d]); + byte encodedOrig = sim.encodeNorm(originalNorm); + float rankedNorm = originalNorm * (float) ( Math.log10( rank ) + 1 ); + byte encodedRank = sim.encodeNorm(rankedNorm); + + if ( VERBOSE > 1 ) System.out.println( fieldName + "\t" + d + "\t" + originalNorm + "\t" + encodedOrig + "\t" + rankedNorm + "\t" + encodedRank ); + + reader.setNorm(d, fieldName, encodedRank); + } + } + } + } + + /** + * Utility function to read a list of page-rank records from a file + * specified in the configuration. + */ + public static Map<String,Integer> getPageRanks( String filename ) + { + if ( VERBOSE > 0 ) System.out.println( "Reading pageranks from: " + filename ); + + Map<String,Integer> pageranks = new HashMap<String,Integer>( ); + + BufferedReader reader = null; + try + { + reader = new BufferedReader( new InputStreamReader( new FileInputStream( filename), "UTF-8" ) ); + + String line; + while ( (line = reader.readLine()) != null ) + { + String fields[] = line.split( "\\s+" ); + + if ( fields.length < 2 ) + { + System.err.println( "Malformed pagerank, not enough fields ("+fields.length+"): " + line ); + continue ; + } + + try + { + int rank = Integer.parseInt( fields[0] ); + String url = fields[1]; + + if ( rank < 0 ) + { + System.err.println( "Malformed pagerank, rank less than 0: " + line ); + } + + pageranks.put( url, rank ); + } + catch ( NumberFormatException nfe ) + { + System.err.println( "Malformed pagerank, rank not an integer: " + line ); + continue ; + } + } + } + catch ( IOException e ) + { + // Umm, what to do? + throw new RuntimeException( e ); + } + finally + { + try + { + if ( reader != null ) + { + reader.close( ); + } + } + catch ( IOException e ) + { + // Ignore it. + } + } + + return pageranks; + } + + +} This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2009-07-22 21:35:10
|
Revision: 2790 http://archive-access.svn.sourceforge.net/archive-access/?rev=2790&view=rev Author: binzino Date: 2009-07-22 21:35:08 +0000 (Wed, 22 Jul 2009) Log Message: ----------- WAX-57. Fixed, added quotes around $@. Modified Paths: -------------- tags/nutchwax-0_12_7/archive/bin/nutchwax Modified: tags/nutchwax-0_12_7/archive/bin/nutchwax =================================================================== --- tags/nutchwax-0_12_7/archive/bin/nutchwax 2009-07-22 21:22:02 UTC (rev 2789) +++ tags/nutchwax-0_12_7/archive/bin/nutchwax 2009-07-22 21:35:08 UTC (rev 2790) @@ -40,39 +40,43 @@ case "$1" in import) shift - ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.Importer $@ + ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.Importer "$@" ;; pagerankdb) shift - ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.PageRankDb $@ + ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.PageRankDb "$@" ;; pagerankdbmerger) shift - ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.PageRankDbMerger $@ + ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.PageRankDbMerger "$@" ;; pageranker) shift - ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.tools.PageRanker $@ + ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.tools.PageRanker "$@" ;; parsetextmerger) shift - ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.tools.ParseTextCombiner $@ + ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.tools.ParseTextCombiner "$@" ;; add-dates) shift - ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.tools.DateAdder $@ + ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.tools.DateAdder "$@" ;; merge) shift - ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.IndexMerger $@ + ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.IndexMerger "$@" ;; + reboost) + shift + ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.tools.LengthNormUpdater "$@" + ;; dumpindex) shift - ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.tools.DumpParallelIndex $@ + ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.tools.DumpParallelIndex "$@" ;; search) shift - ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.NutchWaxBean $@ + ${NUTCH_HOME}/bin/nutch org.apache.nutch.searcher.NutchBean "$@" ;; *) echo "" @@ -85,6 +89,7 @@ echo " parsetextmerger Merge segement parse_text/part-nnnnn directories." echo " add-dates Add dates to a parallel index" echo " merge Merge indexes or parallel indexes" + echo " reboost Update document boosts based on pagerank info" echo " dumpindex Dump an index or set of parallel indices to stdout" echo " search Query a search index" echo "" This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 2789 http://archive-access.svn.sourceforge.net/archive-access/?rev=2789&view=rev Author: binzino Date: 2009-07-22 21:22:02 +0000 (Wed, 22 Jul 2009) Log Message: ----------- WAX-59. Fixed to use log10() instead of log(). Modified Paths: -------------- tags/nutchwax-0_12_7/archive/src/plugin/scoring-nutchwax/src/java/org/archive/nutchwax/scoring/PageRankScoringFilter.java Modified: tags/nutchwax-0_12_7/archive/src/plugin/scoring-nutchwax/src/java/org/archive/nutchwax/scoring/PageRankScoringFilter.java =================================================================== --- tags/nutchwax-0_12_7/archive/src/plugin/scoring-nutchwax/src/java/org/archive/nutchwax/scoring/PageRankScoringFilter.java 2009-07-22 21:19:31 UTC (rev 2788) +++ tags/nutchwax-0_12_7/archive/src/plugin/scoring-nutchwax/src/java/org/archive/nutchwax/scoring/PageRankScoringFilter.java 2009-07-22 21:22:02 UTC (rev 2789) @@ -196,7 +196,7 @@ return initScore; } - float newScore = initScore * (float) ( Math.log( rank ) + 1 ); + float newScore = initScore * (float) ( Math.log10( rank ) + 1 ); LOG.info( "PageRankScoringFilter: initScore = " + newScore + " ; key = " + key ); This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2009-07-22 21:19:34
|
Revision: 2788 http://archive-access.svn.sourceforge.net/archive-access/?rev=2788&view=rev Author: binzino Date: 2009-07-22 21:19:31 +0000 (Wed, 22 Jul 2009) Log Message: ----------- Totally subsumed by DumpParallelIndex. This one has just been lying around for a long time, so I finally decided to remove it. Removed Paths: ------------- tags/nutchwax-0_12_7/archive/src/java/org/archive/nutchwax/tools/DumpIndex.java Deleted: tags/nutchwax-0_12_7/archive/src/java/org/archive/nutchwax/tools/DumpIndex.java =================================================================== --- tags/nutchwax-0_12_7/archive/src/java/org/archive/nutchwax/tools/DumpIndex.java 2009-07-22 20:07:16 UTC (rev 2787) +++ tags/nutchwax-0_12_7/archive/src/java/org/archive/nutchwax/tools/DumpIndex.java 2009-07-22 21:19:31 UTC (rev 2788) @@ -1,105 +0,0 @@ -/* - * Copyright (C) 2008 Internet Archive. - * - * This file is part of the archive-access tools project - * (http://sourceforge.net/projects/archive-access). - * - * The archive-access tools are free software; you can redistribute them and/or - * modify them under the terms of the GNU Lesser Public License as published by - * the Free Software Foundation; either version 2.1 of the License, or any - * later version. - * - * The archive-access tools are distributed in the hope that they will be - * useful, but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser - * Public License for more details. - * - * You should have received a copy of the GNU Lesser Public License along with - * the archive-access tools; if not, write to the Free Software Foundation, - * Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA - */ -package org.archive.nutchwax.tools; - -import java.io.File; -import java.util.Iterator; - -import org.apache.lucene.index.IndexReader; - -public class DumpIndex -{ - public static void main(String[] args) throws Exception - { - String option = ""; - String indexDir = ""; - - if (args.length == 1) - { - indexDir = args[0]; - } - else if (args.length == 2) - { - option = args[0]; - indexDir = args[1]; - } - - if (! (new File(indexDir)).exists()) - { - usageAndExit(); - } - - if (option.equals("-f")) - { - listFields(indexDir); - } - else - { - dumpIndex(indexDir); - } - } - - private static void dumpIndex(String indexDir) throws Exception - { - IndexReader reader = IndexReader.open(indexDir); - - Object[] fieldNames = reader.getFieldNames(IndexReader.FieldOption.ALL).toArray(); - - for (int i = 0; i < fieldNames.length; i++) - { - System.out.print(fieldNames[i] + "\t"); - } - - System.out.println(); - - int numDocs = reader.numDocs(); - - for (int i = 0; i < numDocs; i++) - { - for (int j = 0; j < fieldNames.length; j++) - { - System.out.print(reader.document(i).get((String) fieldNames[j]) + "\t"); - } - - System.out.println(); - } - } - - private static void listFields(String indexDir) throws Exception - { - IndexReader reader = IndexReader.open(indexDir); - - Iterator it = reader.getFieldNames(IndexReader.FieldOption.ALL).iterator(); - - while (it.hasNext()) - { - System.out.println(it.next()); - } - - reader.close(); - } - - private static void usageAndExit() - { - System.out.println("Usage: DumpIndex [-f] index"); - System.exit(1); - } -} This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2009-07-22 20:07:24
|
Revision: 2787 http://archive-access.svn.sourceforge.net/archive-access/?rev=2787&view=rev Author: binzino Date: 2009-07-22 20:07:16 +0000 (Wed, 22 Jul 2009) Log Message: ----------- Initial creation of NutchWAX 0.12.7 branch off of 0.12.6. Added Paths: ----------- tags/nutchwax-0_12_7/ This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2009-07-18 01:13:44
|
Revision: 2786 http://archive-access.svn.sourceforge.net/archive-access/?rev=2786&view=rev Author: bradtofel Date: 2009-07-18 01:13:24 +0000 (Sat, 18 Jul 2009) Log Message: ----------- updating versions in POM to 1.4.2 Modified Paths: -------------- branches/wayback-1_4_2/dist/pom.xml branches/wayback-1_4_2/pom.xml branches/wayback-1_4_2/wayback-core/pom.xml branches/wayback-1_4_2/wayback-mapreduce/pom.xml branches/wayback-1_4_2/wayback-mapreduce-prereq/pom.xml branches/wayback-1_4_2/wayback-webapp/pom.xml Modified: branches/wayback-1_4_2/dist/pom.xml =================================================================== --- branches/wayback-1_4_2/dist/pom.xml 2009-07-18 01:02:39 UTC (rev 2785) +++ branches/wayback-1_4_2/dist/pom.xml 2009-07-18 01:13:24 UTC (rev 2786) @@ -3,7 +3,7 @@ <parent> <groupId>org.archive</groupId> <artifactId>wayback</artifactId> - <version>1.4.1</version> + <version>1.4.2</version> </parent> <modelVersion>4.0.0</modelVersion> @@ -54,13 +54,13 @@ <dependency> <groupId>org.archive.wayback</groupId> <artifactId>wayback-webapp</artifactId> - <version>1.4.1</version> + <version>1.4.2</version> <type>war</type> </dependency> <dependency> <groupId>org.archive.wayback</groupId> <artifactId>wayback-mapreduce</artifactId> - <version>1.4.1</version> + <version>1.4.2</version> </dependency> </dependencies> Modified: branches/wayback-1_4_2/pom.xml =================================================================== --- branches/wayback-1_4_2/pom.xml 2009-07-18 01:02:39 UTC (rev 2785) +++ branches/wayback-1_4_2/pom.xml 2009-07-18 01:13:24 UTC (rev 2786) @@ -17,9 +17,9 @@ <groupId>org.archive</groupId> <artifactId>wayback</artifactId> <properties> - <globalVersion>1.4.1</globalVersion> + <globalVersion>1.4.2</globalVersion> </properties> - <version>1.4.1</version> + <version>1.4.2</version> <packaging>pom</packaging> <name>Wayback</name> Modified: branches/wayback-1_4_2/wayback-core/pom.xml =================================================================== --- branches/wayback-1_4_2/wayback-core/pom.xml 2009-07-18 01:02:39 UTC (rev 2785) +++ branches/wayback-1_4_2/wayback-core/pom.xml 2009-07-18 01:13:24 UTC (rev 2786) @@ -17,7 +17,7 @@ <parent> <groupId>org.archive</groupId> <artifactId>wayback</artifactId> - <version>1.4.1</version> + <version>1.4.2</version> </parent> <groupId>org.archive.wayback</groupId> <artifactId>wayback-core</artifactId> Modified: branches/wayback-1_4_2/wayback-mapreduce/pom.xml =================================================================== --- branches/wayback-1_4_2/wayback-mapreduce/pom.xml 2009-07-18 01:02:39 UTC (rev 2785) +++ branches/wayback-1_4_2/wayback-mapreduce/pom.xml 2009-07-18 01:13:24 UTC (rev 2786) @@ -12,7 +12,7 @@ <parent> <groupId>org.archive</groupId> <artifactId>wayback</artifactId> - <version>1.4.1</version> + <version>1.4.2</version> </parent> <groupId>org.archive.wayback</groupId> <artifactId>wayback-mapreduce</artifactId> Modified: branches/wayback-1_4_2/wayback-mapreduce-prereq/pom.xml =================================================================== --- branches/wayback-1_4_2/wayback-mapreduce-prereq/pom.xml 2009-07-18 01:02:39 UTC (rev 2785) +++ branches/wayback-1_4_2/wayback-mapreduce-prereq/pom.xml 2009-07-18 01:13:24 UTC (rev 2786) @@ -10,7 +10,7 @@ <parent> <groupId>org.archive</groupId> <artifactId>wayback</artifactId> - <version>1.4.1</version> + <version>1.4.2</version> </parent> <groupId>org.archive.wayback</groupId> <artifactId>wayback-mapreduce-prereq</artifactId> Modified: branches/wayback-1_4_2/wayback-webapp/pom.xml =================================================================== --- branches/wayback-1_4_2/wayback-webapp/pom.xml 2009-07-18 01:02:39 UTC (rev 2785) +++ branches/wayback-1_4_2/wayback-webapp/pom.xml 2009-07-18 01:13:24 UTC (rev 2786) @@ -3,7 +3,7 @@ <parent> <artifactId>wayback</artifactId> <groupId>org.archive</groupId> - <version>1.4.1</version> + <version>1.4.2</version> </parent> <modelVersion>4.0.0</modelVersion> <groupId>org.archive.wayback</groupId> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2009-07-18 01:02:50
|
Revision: 2785 http://archive-access.svn.sourceforge.net/archive-access/?rev=2785&view=rev Author: bradtofel Date: 2009-07-18 01:02:39 +0000 (Sat, 18 Jul 2009) Log Message: ----------- DOC: updated release notes with the blow-by-blow set of changes. Modified Paths: -------------- branches/wayback-1_4_2/dist/src/site/xdoc/release_notes.xml Modified: branches/wayback-1_4_2/dist/src/site/xdoc/release_notes.xml =================================================================== --- branches/wayback-1_4_2/dist/src/site/xdoc/release_notes.xml 2009-07-18 00:48:09 UTC (rev 2784) +++ branches/wayback-1_4_2/dist/src/site/xdoc/release_notes.xml 2009-07-18 01:02:39 UTC (rev 2785) @@ -14,6 +14,141 @@ to release 1.2.0. </p> </section> + <section name="Release 1.4.2"> + <subsection name="Features"> + <ul> + <li> + Added exactSchemeOnly configuration to AccessPoint, allowing + explicit distinction between http:// and https://(<i>ACC-32</i>) + </li> + <li> + Now times out requests to a slow/non-responsive RemoteResourceIndex + and remote(HTTP 1.1) ResourceStore nodes.(<i>ACC-38</i>) + </li> + <li> + experimental OpenSearchQuery .jsp implementations(<i>ACC-56</i>) + </li> + <li> + FileProxyServlet now accepts /OFFSET trailing path in addition to + Content-Range HTTP header.(<i>ACC-74</i>) + </li> + <li> + warc-indexer now has -all option to produce a CDX line for ALL + records, not just captures and revisits(<i>ACC-75</i>) + </li> + <li> + now includes file+offset for all records, keying off mime-time of + warc/revist to determine revisits at query time.(<i>ACC-76</i>) + </li> + <li> + Allow prefixing of original HTTP headers with a fixed string. + (<i>ACC-77</i>) + </li> + <li> + Now Wayback rewrites Content-Base HTTP headers.(<i>ACC-78</i>) + </li> + <li> + Timeline.jsp improvements which prevent Timeline from being severely + distorted on some pages. + </li> + <li> + Improvement to ArchivalUrl client-rewrite.js to preserve link text, + working around a bug in Internet Explorer. + </li> + </ul> + </subsection> + <subsection name="Bug Fixes"> + <ul> + <li> + Now all mime-types are escaped to prevent spaces from getting into + the CDX files.(<i>ACC-45</i>) + </li> + <li> + Some CSS URLs were being rewritten twice. (<i>ACC-53</i>) + </li> + <li> + No longer writing original pages Content-Length HTTP header to + output, which caused original pages with Lower-Case "L" in + "Content-length" to return wrong length, truncating replayed + documents. This caused some replayed pages to not have embedded + disclaimers, nor javascript rewriting of links and images. + (<i>ACC-60</i>) + </li> + <li> + Fixed severe problem with live web robots.txt retrieval where wrong + offset was being writting into the live web ResourceIndex. + (<i>ACC-62</i>) + </li> + <li> + Charset extraction from HTTP headers is now case-insensitive. + (<i>ACC-63</i>) + </li> + <li> + No longer adding content to HTML pages with FrameSet tags, as they + were being broken.(<i>ACC-65</i>) + </li> + <li> + No longer set GMT as default timezone for entire JVM.(<i>ACC-70</i>) + </li> + </ul> + </subsection> + </section> + + <section name="Release 1.4.1"> + <subsection name="Features"> + <ul> + <li> + Index filter which allows including/excluding records based on HTTP + response code field.(<i>ACC-43</i>) + </li> + <li> + Outputs log message instead of stack dump when failing to access + a Resource. + </li> + </ul> + </subsection> + <subsection name="Bug Fixes"> + <ul> + <li> + Some redirect records were not being located in index due to bad + logic in Duplicate record filter.(<i>ACC-30</i>) + </li> + <li> + Wayback was not throwing a NotInArchiveException when + Self-Redirect replay filter removes all records. (unreported) + </li> + <li> + Location HTTP header values were not being escaped before + placing in CDX, causing some records to have too many columns. + (<i>ACC-31</i>) + </li> + <li> + Search Result summary counts were incorrect in Url Prefix + searches.(<i>ACC-33</i>) + </li> + <li> + Implemented NoCache.jsp, a replay insert which adds a + <b>Cache-Control: no-cache</b> HTTP header to all replayed + documents.(<i>ACC-34</i>) + </li> + <li> + Timeline.jsp was using Request Date, not Capture date, which + caused Proxy Mode Timeline to show the wrong date. + (<i>ACC-36</i>) + </li> + <li> + Advanced Search reference implementation .jsp was broken. + (<i>ACC-37</i>) + </li> + <li> + AnchorDate and AnchorWindow functionality is now disabled by + default, and can be enabled via configuration on an AccessPoint. + (<i>ACC-46</i>) + </li> + </ul> + </subsection> + </section> + <section name="Release 1.4.0"> <subsection name="Features"> <ul> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2009-07-18 00:48:17
|
Revision: 2784 http://archive-access.svn.sourceforge.net/archive-access/?rev=2784&view=rev Author: bradtofel Date: 2009-07-18 00:48:09 +0000 (Sat, 18 Jul 2009) Log Message: ----------- FEATURE(ACC-56): new OpenSearch RSS query .jsp implementations. Modified Paths: -------------- branches/wayback-1_4_2/wayback-webapp/src/main/webapp/WEB-INF/wayback.xml Modified: branches/wayback-1_4_2/wayback-webapp/src/main/webapp/WEB-INF/wayback.xml =================================================================== --- branches/wayback-1_4_2/wayback-webapp/src/main/webapp/WEB-INF/wayback.xml 2009-07-18 00:41:19 UTC (rev 2783) +++ branches/wayback-1_4_2/wayback-webapp/src/main/webapp/WEB-INF/wayback.xml 2009-07-18 00:48:09 UTC (rev 2784) @@ -103,6 +103,24 @@ </bean> + <!-- + Experimental OpenSearch query .jsps: + --> + <bean name="8080:opensearch" parent="8080:wayback"> + <property name="urlRoot" value="http://localhost.archive.org:8080/opensearch/" /> + <property name="query"> + <bean class="org.archive.wayback.query.Renderer"> + <property name="captureJsp" value="/WEB-INF/query/OpenSearchCaptureResults.jsp" /> + <property name="urlJsp" value="/WEB-INF/query/OpenSearchUrlResults.jsp" /> + </bean> + </property> + <property name="exception"> + <bean class="org.archive.wayback.exception.BaseExceptionRenderer"> + <property name="xmlErrorJsp" value="/WEB-INF/exception/OpenSearchError.jsp" /> + <property name="errorJsp" value="/WEB-INF/exception/OpenSearchError.jsp" /> + </bean> + </property> + </bean> <!-- The following AccessPoint inherits all configuration from the 8080:wayback This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2009-07-18 00:41:21
|
Revision: 2783 http://archive-access.svn.sourceforge.net/archive-access/?rev=2783&view=rev Author: bradtofel Date: 2009-07-18 00:41:19 +0000 (Sat, 18 Jul 2009) Log Message: ----------- Improvement to ArchivalUrl client-rewrite.js to preserve link text, working around a bug in Internet Explorer. Modified Paths: -------------- branches/wayback-1_4_2/wayback-webapp/src/main/webapp/js/client-rewrite.js Modified: branches/wayback-1_4_2/wayback-webapp/src/main/webapp/js/client-rewrite.js =================================================================== --- branches/wayback-1_4_2/wayback-webapp/src/main/webapp/js/client-rewrite.js 2009-07-18 00:40:05 UTC (rev 2782) +++ branches/wayback-1_4_2/wayback-webapp/src/main/webapp/js/client-rewrite.js 2009-07-18 00:41:19 UTC (rev 2783) @@ -4,28 +4,41 @@ image.src = url; return image.src; } +var xWaybackIsIE = (navigator.appName=="Microsoft Internet Explorer"); function xLateUrl(aCollection, sProp) { var i = 0; for(i = 0; i < aCollection.length; i++) { if(aCollection[i].getAttribute(sProp) && (aCollection[i].getAttribute(sProp).length > 0) && - (typeof(aCollection[i][sProp]) == "string")) { + (typeof(aCollection[i][sProp]) == "string") && + (aCollection[i][sProp].indexOf("mailto:") == -1) && + (aCollection[i][sProp].indexOf("javascript:") == -1)) { - if(aCollection[i][sProp].indexOf("mailto:") == -1 && - aCollection[i][sProp].indexOf("javascript:") == -1) { - var wmSpecial = aCollection[i].getAttribute("wmSpecial"); if(wmSpecial && wmSpecial.length > 0) { } else { - if(aCollection[i][sProp].indexOf(sWayBackCGI) == -1) { - if(aCollection[i][sProp].indexOf("http") == 0) { - aCollection[i][sProp] = sWayBackCGI + aCollection[i][sProp]; - } else { - aCollection[i][sProp] = sWayBackCGI + xResolveUrl(aCollection[i][sProp]); - } - } + var newUrl; + if(aCollection[i][sProp].indexOf("http") == 0) { + newUrl = sWayBackCGI + aCollection[i][sProp]; + } else { + newUrl = sWayBackCGI + xResolveUrl(aCollection[i][sProp]); + } + if(navigator.appName=="Microsoft Internet Explorer") { + var inTmp = aCollection[i].innerHTML; + aCollection[i][sProp] = newUrl; + + if(inTmp && + ( (inTmp.indexOf("@") > 0) + || (inTmp.indexOf("www.") == 0) + || (inTmp.indexOf("http://") == 0) + ) + ) { + aCollection[i].innerHTML = inTmp; + } + } else { + aCollection[i][sProp] = newUrl; + } } - } } } } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2009-07-18 00:40:11
|
Revision: 2782 http://archive-access.svn.sourceforge.net/archive-access/?rev=2782&view=rev Author: bradtofel Date: 2009-07-18 00:40:05 +0000 (Sat, 18 Jul 2009) Log Message: ----------- FEATURE: Timeline.jsp improvements which prevent Timeline from being severely distorted on some pages. Modified Paths: -------------- branches/wayback-1_4_2/wayback-webapp/src/main/webapp/WEB-INF/replay/JSLessTimeline.jsp branches/wayback-1_4_2/wayback-webapp/src/main/webapp/WEB-INF/replay/Timeline.jsp Modified: branches/wayback-1_4_2/wayback-webapp/src/main/webapp/WEB-INF/replay/JSLessTimeline.jsp =================================================================== --- branches/wayback-1_4_2/wayback-webapp/src/main/webapp/WEB-INF/replay/JSLessTimeline.jsp 2009-07-18 00:38:09 UTC (rev 2781) +++ branches/wayback-1_4_2/wayback-webapp/src/main/webapp/WEB-INF/replay/JSLessTimeline.jsp 2009-07-18 00:40:05 UTC (rev 2782) @@ -21,8 +21,8 @@ StringFormatter fmt = wbRequest.getFormatter(); CaptureSearchResults cResults = results.getCaptureResults(); -String exactDateStr = wbRequest.getReplayTimestamp(); -Date exactDate = wbRequest.getReplayDate(); +String exactDateStr = results.getResult().getCaptureTimestamp(); +Date exactDate = results.getResult().getCaptureDate(); String searchUrl = wbRequest.getRequestUrl(); String resolution = wbRequest.getTimelineResolution(); @@ -149,7 +149,7 @@ <table cellspacing="0" border="0" cellpadding="0" width="100%"> <tr> <td width="48%" nowrap><span><%= firstDate %></span></td> - <td align="center" valign="bottom" nowrap><img wmSpecial="1" src="<%= contextRoot %>/images/mark.jpg"></td> + <td align="center" valign="bottom" nowrap><img style="display: inline;" wmSpecial="1" src="<%= contextRoot %>/images/mark.jpg"></td> <td width="48%" nowrap align="right"><span><%= lastDate %></span></td> </tr> </table> @@ -165,7 +165,7 @@ first.getCaptureDate()) + "\""; %><a wmSpecial="1" href="<%= results.resultToReplayUrl(first) %>"><% } - %><img <%= titleString %> wmSpecial="1" border=0 width=19 height=20 src="<%= contextRoot %>/images/first.jpg"><% + %><img style="display: inline;" <%= titleString %> wmSpecial="1" border=0 width=19 height=20 src="<%= contextRoot %>/images/first.jpg"><% if(first != null) { %></a><% } @@ -176,7 +176,7 @@ prev.getCaptureDate()) + "\""; %><a wmSpecial="1" href="<%= results.resultToReplayUrl(prev) %>"><% } - %><img <%= titleString %> wmSpecial="1" border=0 width=13 height=20 src="<%= contextRoot %>/images/prev.jpg"><% + %><img style="display: inline;" <%= titleString %> wmSpecial="1" border=0 width=13 height=20 src="<%= contextRoot %>/images/prev.jpg"><% if(first != null) { %></a><% } @@ -205,17 +205,17 @@ } if((i > 0) && (i < numPartitions)) { -%><img wmSpecial="1" border=0 width=1 height=16 src="<%= contextRoot %>/images/linemark.jpg"><% +%><img style="display: inline;" wmSpecial="1" border=0 width=1 height=16 src="<%= contextRoot %>/images/linemark.jpg"><% } if(replayUrl == null) { -%><img wmSpecial="1" border=0 width=7 height=16 src="<%= imageUrl %>"><% +%><img style="display: inline;" wmSpecial="1" border=0 width=7 height=16 src="<%= imageUrl %>"><% } else { -%><a wmSpecial="1" href="<%= replayUrl %>"><img wmSpecial="1" border=0 width=7 height=16 title="<%= prettyDateTime %>" src="<%= imageUrl %>"></a><% +%><a wmSpecial="1" href="<%= replayUrl %>"><img style="display: inline;" wmSpecial="1" border=0 width=7 height=16 title="<%= prettyDateTime %>" src="<%= imageUrl %>"></a><% } } @@ -229,7 +229,7 @@ next.getCaptureDate()) + "\""; %><a wmSpecial="1" href="<%= results.resultToReplayUrl(next) %>"><% } - %><img wmSpecial="1" <%= titleString %> border=0 width=13 height=20 src="<%= contextRoot %>/images/next.jpg"><% + %><img style="display: inline;" wmSpecial="1" <%= titleString %> border=0 width=13 height=20 src="<%= contextRoot %>/images/next.jpg"><% if(first != null) { %></a><% } @@ -240,7 +240,7 @@ last.getCaptureDate()) + "\""; %><a wmSpecial="1" href="<%= results.resultToReplayUrl(last) %>"><% } - %><img wmSpecial="1" <%= titleString %> border=0 width=19 height=20 src="<%= contextRoot %>/images/last.jpg"><% + %><img style="display: inline;" wmSpecial="1" <%= titleString %> border=0 width=19 height=20 src="<%= contextRoot %>/images/last.jpg"><% if(first != null) { %></a><% } @@ -283,7 +283,7 @@ %></a> </td> <td> - <img wmSpecial="1" alt='' height='1' src='<%= contextRoot %>/images/1px.gif' width='5'> + <img style="display: inline;" wmSpecial="1" alt='' height='1' src='<%= contextRoot %>/images/1px.gif' width='5'> </td> </tr> </table> Modified: branches/wayback-1_4_2/wayback-webapp/src/main/webapp/WEB-INF/replay/Timeline.jsp =================================================================== --- branches/wayback-1_4_2/wayback-webapp/src/main/webapp/WEB-INF/replay/Timeline.jsp 2009-07-18 00:38:09 UTC (rev 2781) +++ branches/wayback-1_4_2/wayback-webapp/src/main/webapp/WEB-INF/replay/Timeline.jsp 2009-07-18 00:40:05 UTC (rev 2782) @@ -171,7 +171,7 @@ <table cellspacing="0" border="0" cellpadding="0" width="100%"> <tr> <td width="48%" nowrap><span><%= firstDate %></span></td> - <td align="center" valign="bottom" nowrap><img wmSpecial="1" src="<%= contextRoot %>/images/mark.jpg"></td> + <td align="center" valign="bottom" nowrap><img style="display: inline;" wmSpecial="1" src="<%= contextRoot %>/images/mark.jpg"></td> <td width="48%" nowrap align="right"><span><%= lastDate %></span></td> </tr> </table> @@ -187,7 +187,7 @@ first.getCaptureDate()) + "\""; %><a wmSpecial="1" onclick="SetAnchorDate('<%= first.getCaptureTimestamp() %>');" href="<%= results.resultToReplayUrl(first) %>"><% } - %><img <%= titleString %> wmSpecial="1" border=0 width=19 height=20 src="<%= contextRoot %>/images/first.jpg"><% + %><img style="display: inline;" <%= titleString %> wmSpecial="1" border=0 width=19 height=20 src="<%= contextRoot %>/images/first.jpg"><% if(first != null) { %></a><% } @@ -198,7 +198,7 @@ prev.getCaptureDate()) + "\""; %><a wmSpecial="1" onclick="SetAnchorDate('<%= prev.getCaptureTimestamp() %>');" href="<%= results.resultToReplayUrl(prev) %>"><% } - %><img <%= titleString %> wmSpecial="1" border=0 width=13 height=20 src="<%= contextRoot %>/images/prev.jpg"><% + %><img style="display: inline;" <%= titleString %> wmSpecial="1" border=0 width=13 height=20 src="<%= contextRoot %>/images/prev.jpg"><% if(first != null) { %></a><% } @@ -230,17 +230,17 @@ } if((i > 0) && (i < numPartitions)) { -%><img wmSpecial="1" border=0 width=1 height=16 src="<%= contextRoot %>/images/linemark.jpg"><% +%><img style="display: inline;" wmSpecial="1" border=0 width=1 height=16 src="<%= contextRoot %>/images/linemark.jpg"><% } if(replayUrl == null) { -%><img wmSpecial="1" border=0 width=7 height=16 src="<%= imageUrl %>"><% +%><img style="display: inline;" wmSpecial="1" border=0 width=7 height=16 src="<%= imageUrl %>"><% } else { -%><a wmSpecial="1" onclick="SetAnchorDate('<%= ts %>');" href="<%= replayUrl %>"><img wmSpecial="1" border=0 width=7 height=16 title="<%= prettyDateTime %>" src="<%= imageUrl %>"></a><% +%><a wmSpecial="1" onclick="SetAnchorDate('<%= ts %>');" href="<%= replayUrl %>"><img style="display: inline;" wmSpecial="1" border=0 width=7 height=16 title="<%= prettyDateTime %>" src="<%= imageUrl %>"></a><% } } @@ -254,7 +254,7 @@ next.getCaptureDate()) + "\""; %><a wmSpecial="1" onclick="SetAnchorDate('<%= next.getCaptureTimestamp() %>');" href="<%= results.resultToReplayUrl(next) %>"><% } - %><img wmSpecial="1" <%= titleString %> border=0 width=13 height=20 src="<%= contextRoot %>/images/next.jpg"><% + %><img style="display: inline;" wmSpecial="1" <%= titleString %> border=0 width=13 height=20 src="<%= contextRoot %>/images/next.jpg"><% if(next != null) { %></a><% } @@ -265,7 +265,7 @@ last.getCaptureDate()) + "\""; %><a wmSpecial="1" onclick="SetAnchorDate('<%= last.getCaptureTimestamp() %>');" href="<%= results.resultToReplayUrl(last) %>"><% } - %><img wmSpecial="1" <%= titleString %> border=0 width=19 height=20 src="<%= contextRoot %>/images/last.jpg"><% + %><img style="display: inline;" wmSpecial="1" <%= titleString %> border=0 width=19 height=20 src="<%= contextRoot %>/images/last.jpg"><% if(last != null) { %></a><% } @@ -308,7 +308,7 @@ %></a> </td> <td> - <img wmSpecial="1" alt='' height='1' src='<%= contextRoot %>/images/1px.gif' width='5'> + <img style="display: inline;" wmSpecial="1" alt='' height='1' src='<%= contextRoot %>/images/1px.gif' width='5'> </td> </tr> </table> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2009-07-18 00:38:12
|
Revision: 2781 http://archive-access.svn.sourceforge.net/archive-access/?rev=2781&view=rev Author: bradtofel Date: 2009-07-18 00:38:09 +0000 (Sat, 18 Jul 2009) Log Message: ----------- FEATURE(ACC-56): new OpenSearch RSS query .jsp implementations. Added Paths: ----------- branches/wayback-1_4_2/wayback-webapp/src/main/webapp/WEB-INF/exception/OpenSearchError.jsp Added: branches/wayback-1_4_2/wayback-webapp/src/main/webapp/WEB-INF/exception/OpenSearchError.jsp =================================================================== --- branches/wayback-1_4_2/wayback-webapp/src/main/webapp/WEB-INF/exception/OpenSearchError.jsp (rev 0) +++ branches/wayback-1_4_2/wayback-webapp/src/main/webapp/WEB-INF/exception/OpenSearchError.jsp 2009-07-18 00:38:09 UTC (rev 2781) @@ -0,0 +1,29 @@ +<?xml version="1.0" encoding="UTF-8"?><%@ + page language="java" pageEncoding="utf-8" contentType="text/xml;charset=utf-8" +%><%@ + page import="org.archive.wayback.exception.WaybackException" +%><%@ + page import="org.archive.wayback.core.UIResults" +%><%@ + page import="org.archive.wayback.util.StringFormatter" +%><% + +UIResults results = UIResults.extractException(request); +WaybackException e = results.getException(); +StringFormatter fmt = results.getWbRequest().getFormatter(); + +%> +<rss version="2.0" xmlns:openSearch="http://a9.com/-/spec/opensearch/1.1/"> + <channel> + <title>Wayback OpenSearch Error</title> + <link>http://archive-access.sourceforge.net/projects/wayback</link> + <description>OpenSearch Error</description> + <openSearch:totalResults>1</openSearch:totalResults> + <openSearch:startIndex>1</openSearch:startIndex> + <openSearch:itemsPerPage>1</openSearch:itemsPerPage> + <item> + <title><%= UIResults.encodeXMLContent(fmt.format(e.getTitleKey())) %></title> + <description><%= UIResults.encodeXMLContent(fmt.format(e.getMessageKey())) %></description> + </item> + </channel> + </rss> \ No newline at end of file This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |