From: <bi...@us...> - 2008-12-16 03:00:15
|
Revision: 2671 http://archive-access.svn.sourceforge.net/archive-access/?rev=2671&view=rev Author: binzino Date: 2008-12-16 03:00:10 +0000 (Tue, 16 Dec 2008) Log Message: ----------- Updated documentation for 0.12.3 release. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/HOWTO-dedup.txt trunk/archive-access/projects/nutchwax/archive/HOWTO-xslt.txt trunk/archive-access/projects/nutchwax/archive/INSTALL.txt trunk/archive-access/projects/nutchwax/archive/README.txt trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt Added Paths: ----------- trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt Modified: trunk/archive-access/projects/nutchwax/archive/HOWTO-dedup.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/HOWTO-dedup.txt 2008-12-16 02:59:10 UTC (rev 2670) +++ trunk/archive-access/projects/nutchwax/archive/HOWTO-dedup.txt 2008-12-16 03:00:10 UTC (rev 2671) @@ -157,62 +157,36 @@ ====================================================================== -Index +Index and Index merging ====================================================================== -The only chage we make to the indexing step is the destination of the -index directory. +Perform the index step as normal, yielding an 'indexes' directory. -By default, Nutch expects the per-segment index directory to live in a -sub-directory called 'indexes' and the index command is accordingly +E.g. $ nutch index indexes crawldb linkdb segments/* -Resulting in an index directory structure of the form +Then, merge the 'indexes' directory into a single Lucene index by +invoking the Nutch 'merge' command - indexes/part-00000 + $ nutch merge index indexes -For de-duplication, we use a slightly different directory structure, -which will be used by a de-duplication-aware NutchWaxBean at -search-time. The directory structure we use is: - pindexes/<segment>/part-00000 - -Using the segment name is not strictly required, but it is a good -practice and is strongly recommended. This way the segment and its -corresponding index directory are easily matched. - -Let's assume that the segment directory created during the import is -named - - segments/20080703050349 - -In that case, our index command becomes: - - $ nutch index pindexes/20080703050349 crawldb linkdb segments/20080703050349 - -Upon completion, the Lucene index is created in - - pindexes/20080703050349/part-0000 - -This index is exactly the same as one normally created by Nutch, the -only difference is the location. - - ====================================================================== Add Revisit Dates ====================================================================== -Now that we have the Nutch index, we add the revisit dates to it. +Now that we have a single, merged index, we create a "parallel" index +directory which contains the additional revisit dates. Examine the "all.dup" file again, it has lines of the form - example.org/robots.txt sha1:4G3PAROKCYJNRGZIHJO5PVLZ724FX3GN 20080618133034 - example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080613194800 - example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080616061312 - example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080618132204 - example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080618132213 - example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080619132911 + example.org/robots.txt sha1:4G3PAROKCYJNRGZIHJO5PVLZ724FX3GN 20080618133034 + example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080613194800 + example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080616061312 + example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080618132204 + example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080618132213 + example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080619132911 These are the revisit dates that need to be added to the records in the Lucene index. When we generated the index, only the date of the @@ -220,35 +194,47 @@ As explained in README-dedup.txt, modifying the Lucene index to actually add these dates is infeasible. What we do is create a -parallel index next to the main index (the part-00000 created above) -that contains all the dates for each record. +parallel index next to the merged index that contains all the dates +for each record. The NutchWAX 'add-dates' command creates this parallel index for us. - $ nutchwax add-dates pindexes/20080703050349/part-0000 \ - pindexes/20080703050349/part-0000 \ - pindexes/20080703050349/dates \ + $ nutchwax add-dates index \ + index \ + dates \ all.dup -Yes, the part-0000 argument does appear twice. This is beacuse it is +Yes, the 'index' argument does appear twice. This is beacuse it is both the "key" index and the "source" index. - Suppose we did another crawl and had even more dates to add to the existing index. In that case we would run - $ nutchwax add-dates pindexes/20080703050349/part-0000 \ - pindexes/20080703050349/dates \ - pindexes/20080703050349/new-dates \ + $ nutchwax add-dates index \ + dates \ + new-dates \ new-crawl.dup - $ rm -r pindexes/20080703050349/dates - $ mv pindexes/20080703050349/new-dates pindexes/20080703050349/dates + $ rm -r dates + $ mv new-dates dates This copies the existing dates from "dates" to "new-dates" and adds additional ones from "new-crawl.dup" along the way. Then we replace the previous "dates" index with the new one. +Now, Nutch doesn't know what to do with the extra 'dates' parallel +index, but NutchWAX does and it requires them to be arranged +in a directory structure of the following form: + pindexes/<name>/dates + /index + +Where "name" is any name of your choosing. For example, + + $ mkdir -p pindexes/200812180000 + $ mv dates pindexes/200812180000/ + $ mv index pindexes/200812180000/ + + WARC ---- This step is the same for ARCs and WARCs. @@ -318,6 +304,8 @@ <listener> <listener-class>org.apache.nutch.searcher.NutchBean$NutchBeanConstructor</listener-class> + </listener> + <listener> <listener-class>org.archive.nutchwax.NutchWaxBean$NutchWaxBeanConstructor</listener-class> </listener> Added: trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt (rev 0) +++ trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt 2008-12-16 03:00:10 UTC (rev 2671) @@ -0,0 +1,129 @@ + +HOWTO-pagerank.txt +2008-12-18 +Aaron Binns + +Table of Contents + o Prerequisites + o Overview + o Generate PageRank + o PageRank Scoring and Boosting + o Configuration and Indexing + + +====================================================================== +Prerequisites +====================================================================== + +This HOWTO assumes you've already read the main NutchWAX HOWTO and are +familiar with importing and indexing archive files with NutchWAX. + +Also, we assume that you are familiar with deploying the Nutch(WAX) +web application into a servlet container such as Tomcat. + + +====================================================================== +Overview +====================================================================== + +NutchWAX provides a pair of tools for extracting and utilizing +simplistic "page rank" information for scoring and sorting documents +in the full-text search index. + +Nutch's 'invertlinks' step inverts links and stores them in the +'linkdb' directory. We use the inlinks to boost the Lucene score of +documents in proportion to the number of inlinks. + + +====================================================================== +Generate PageRank +====================================================================== + +After the Nutch 'invertlinks' step is performed, run the NutchWAX +'pagerank' command to extract inlink information from the 'linkdb' + +For example + + $ nutch invertlinks linkdb -dir segments + $ nutchwax pagerank pagerank.txt linkdb + +The resulting "pagerank.txt" file is a simple text file containing +a count of the number of inlinks followed by the URL. + + $ sort -n pagerank.txt | tail + 367762 http://informe.presidencia.gob.mx/ + 367809 http://comovamos.presidencia.gob.mx/ + 367852 http://ocho.presidencia.gob.mx/ + 372681 http://www.gob.mx/ + 398073 http://pnd.presidencia.gob.mx/ + 399321 http://zedillo.presidencia.gob.mx/ + 496993 http://www.google-analytics.com/urchin.js + 702448 http://www.elbalero.gob.mx/ + 703517 http://www.mexicoenlinea.gob.mx/ + 764195 http://www.brasil.gov.br + +In the above example, the most linked-to URL has 764195 inlinks. + + +====================================================================== +PageRank Scoring and Boosting +====================================================================== + +During indexing, the NutchWAX PageRankScoringFilter uses the page rank +information to boost the Lucene documents score in proportion to the +number of inlinks. + +The formula used for boosting the Lucene document score is a simple +log10()-based calculation + + boost = log10( # inlinks ) + 1 + +In Lucene, the boost is a multiplier where a boost of 1.0 means "no +change" or "no boost" for the document score. By default, all +documents have a boost of 1.0 unless a scoring filter changes it. + +Thus, we add 1 to the log10() value so that our boost scores start and +1.0 and go up from there. + +The use of log10() gives us a linear boost based on the order of +magnitude of the number of inlinks. Consider the following boost +scores as determined by our formula: + + # inlinks boost + 1 1.00 + 10 2.00 + 82 2.91 + 100 3.00 + 532 3.72 + 1000 4.00 + 14892 5.17 + +A document with 1000 inlinks will have it's score boosted 4x compared +to a document with 1 inlink. + + +====================================================================== +Configuration and Indexing +====================================================================== + +To use the PageRankScoringFilter during indexing, replace the Nutch +OPIC scoring filter in the Nutch(WAX) configuration: + +nutch-site.xml + <property> + <name>plugin.includes</name> + <value>protocol-http|parse-(text|html|js|pdf)|index-(basic|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-nutchwax|urlfilter-nutchwax</value> + </property> + +Where we change 'scoring-opic' to 'scoring-nutchwax'. + +Then, when we invoke the Nutch(WAX) 'index' command, we specify the +location of the page rank file. For example, + + $ nutch index \ + -Dnutchwax.scoringfilter.pagerank.ranks=pagerank.txt \ + indexes \ + linkdb \ + crawldb \ + segments/* + Modified: trunk/archive-access/projects/nutchwax/archive/HOWTO-xslt.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/HOWTO-xslt.txt 2008-12-16 02:59:10 UTC (rev 2670) +++ trunk/archive-access/projects/nutchwax/archive/HOWTO-xslt.txt 2008-12-16 03:00:10 UTC (rev 2671) @@ -1,13 +1,15 @@ HOWTO-xslt.txt -2008-07-25 +2008-12-18 Aaron Binns Table of Contents o Prerequisites - NutchWAX HOWTO.txt o Overview + o NutchWAX OpenSearchServlet o XSLTFilter and web.xml + o Sample ====================================================================== @@ -31,9 +33,10 @@ Servlet : OpenSearchServlet If you read the OpenSearchServlet.java source code and the search.jsp -page, you'll notice a lot of similarity, if not duplication of code. +page, you'll notice a lot of similarity, if not outright duplication +of code. -The Internet Archive Web Team plans to improve and expand upon the +The Internet Archive Web Team has improved and expanded upon the existing OpenSearchServlet interface as well as adding more XML-based capabilities, including replacements for the existing JSP pages. In short, moving away from JSP and toward XML. @@ -48,6 +51,21 @@ ====================================================================== +NutchWAX OpenSearchServlet +====================================================================== + +NutchWAX contains an enhanced OpenSearch servlet which is a drop-in +replacement for the default Nutch OpenSearch servlet. To use the +NutchWAX implementation, modify the 'web.xml' + +from: + <servlet-class>org.apache.nutch.searcher.OpenSearchServlet</servlet-class> + +to: + <servlet-class>org.archive.nutchwax.OpenSearchServlet</servlet-class> + + +====================================================================== XSLTFilter and web.xml ====================================================================== @@ -55,11 +73,11 @@ OpenSearchServlet is straightforward. Simply add the XSLTFilter to the servlet's path and specify the XSL transform to apply. -For example, consider the default Nutch web.xml +For example, consider the default NutchWAX web.xml <servlet> <servlet-name>OpenSearch</servlet-name> - <servlet-class>org.apache.nutch.searcher.OpenSearchServlet</servlet-class> + <servlet-class>org.archive.nutchwax.OpenSearchServlet</servlet-class> </servlet> <servlet-mapping> @@ -68,13 +86,13 @@ </servlet-mapping> Let's say we want to retain the '/opensearch' path for the XML output, -and add the human-friendly HTML page at '/coolsearch' +and add the human-friendly HTML page at '/search' First, we add an additional 'servlet-mapping' for our new path: <servlet-mapping> <servlet-name>OpenSearch</servlet-name> - <url-pattern>/coolsearch</url-pattern> + <url-pattern>/search</url-pattern> </servlet-mapping> Then, we add the XSLTFilter, passing it a URL to the XSLT file @@ -93,7 +111,7 @@ <filter-mapping> <filter-name>XSLT Filter</filter-name> - <url-pattern>/coolsearch</url-pattern> + <url-pattern>/search</url-pattern> </filter-mapping> This way, we have two URLs, which run the exact same @@ -101,11 +119,11 @@ output whereas the other produces human-friendly HTML output. OpenSearch XML : http://someserver/opensearch?query=foo - Human-friendly HTML : http://someserver/coolsearch?query=foo + Human-friendly HTML : http://someserver/search?query=foo ====================================================================== -Samples +Sample ====================================================================== You can find sample 'web.xml' and 'search.xsl' files in Modified: trunk/archive-access/projects/nutchwax/archive/INSTALL.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/INSTALL.txt 2008-12-16 02:59:10 UTC (rev 2670) +++ trunk/archive-access/projects/nutchwax/archive/INSTALL.txt 2008-12-16 03:00:10 UTC (rev 2671) @@ -1,6 +1,6 @@ INSTALL.txt -2008-10-01 +2008-12-18 Aaron Binns This installation guide assumes the reader is already familiar with @@ -43,7 +43,7 @@ ------------- As mentioned above, NutchWAX 0.12 is built against Nutch-1.0-dev. Nutch doesn't have a 1.0 release package yet, so we have to use the -Nutch SVN trunk. The specific SVN revision that NutchWAX 0.12.2 is +Nutch SVN trunk. The specific SVN revision that NutchWAX 0.12.3 is built against is: 701524 Modified: trunk/archive-access/projects/nutchwax/archive/README.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/README.txt 2008-12-16 02:59:10 UTC (rev 2670) +++ trunk/archive-access/projects/nutchwax/archive/README.txt 2008-12-16 03:00:10 UTC (rev 2671) @@ -1,9 +1,9 @@ README.txt -2008-10-01 +2008-12-18 Aaron Binns -Welcome to NutchWAX 0.12.2! +Welcome to NutchWAX 0.12.3! NutchWAX is a set of add-ons to Nutch in order to index and search archived web data. @@ -60,6 +60,15 @@ Filtering plugin which can be used to exclude URLs from import. It can be used as part of a NutchWAX de-duplication scheme. + plugins/scoring-nutchwax + + Scoring plugin for use at index-time which reads from an external + "pagerank.txt" file for scoring documents based on the log10 of the + number of inlinks to a document. + + The use of this plugin is optional but can improve the quality of + search results, especially for very large collections. + conf/nutch-site.xml Sample configuration properties file showing suggested settings for @@ -131,6 +140,4 @@ contentMetadata.set( NutchWax.DATE_KEY, meta.getDate() ); ... - ====================================================================== - Modified: trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt 2008-12-16 02:59:10 UTC (rev 2670) +++ trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt 2008-12-16 03:00:10 UTC (rev 2671) @@ -1,9 +1,9 @@ RELEASE-NOTES.TXT -2008-10-13 +2008-12-18 Aaron Binns -Release notes for NutchWAX 0.12.2 +Release notes for NutchWAX 0.12.3 For the most recent updates and information on NutchWAX, please visit the project wiki at: @@ -15,9 +15,14 @@ Overview ====================================================================== -NutchWAX 0.12.2 contains some minor enhancements and fixes to NutchWAX -0.12.1. +NutchWAX 0.12.3 contains numerous enhancements and fixes to 0.12.2 + o PageRank calculation and scoring + o Enhanced OpenSearchServlet + o Improved XSLT sample for OpenSearch + o System init.d script for searcher slaves + o Enhanced searcher slave aware of NutchWAX extensions + ====================================================================== Issues ====================================================================== @@ -28,23 +33,6 @@ Issues resolved in this release: -WAX-19 - Add strict/loose option to DateAdder for revisit lines with extra - data on end. - -WAX-21 - Allow for blank lines and comment lines in manifest file. - -WAX-22 - Various code clean-ups based on code review using PMD tool. - -WAX-23 - Add a "field setter" filter to set a field to a static value in the - Lucene document during indexing. - -WAX-24 - DateAdder fails due to uncaught exception in URL canonicalization - -WAX-25 - Add utility/tool to dump unique values of a field in an index. - +WAX-26 + Add XML elements containing all search URL params for self-link + generation This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |