From: Michael S. <sta...@us...> - 2005-11-29 21:43:53
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/conf In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv30555/conf Modified Files: nutch-site.xml nutch-site.xml.template Added Files: ia-parse-plugins.xml Log Message: Merge 'mapred' branch into HEAD. * .classpath * project.properties Update to point at new 0.8 nutch. * build.xml Merge in 'mapred'. Add job target. * conf/nutch-site.xml Cleanup. Removed unused properties or properties that have same values as nutch-default.xml (Except 'searcher.dir' -- keeping that here because we'll usually want to change it). Reordered so archive properties are towards the end. Brought forward descriptions from nutch-default where missing. * conf/nutch-site.xml.template Copy of nutch-site.xml but with the nutchwax defaults turned on. * src/plugin/build.xml Commented out parse-default. * src/plugin/parse-ext/plugin.xml Changed path to parse-pdf.sh. * src/web/search.jsp 'mapred' update. * bin/indexArcs.sh * conf/ia-parse-plugins.xml * lib/commons-codec-1.3.jar * src/java/org/archive/access/nutch/ImportArcs.java * src/java/org/archive/access/nutch/IndexArcs.java Added. * bin/arc2seg.sh * src/java/org/archive/access/nutch/Arc2Segment.java Removed. Index: nutch-site.xml.template =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/conf/nutch-site.xml.template,v retrieving revision 1.5 retrieving revision 1.6 diff -C2 -d -r1.5 -r1.6 *** nutch-site.xml.template 28 Nov 2005 21:23:22 -0000 1.5 --- nutch-site.xml.template 29 Nov 2005 21:43:42 -0000 1.6 *************** *** 1,17 **** <?xml version="1.0"?> ! <!--Internet Archive Nutch configuration. This config. is what gets built into ! nutchwax. Overrides a few Nutch defaults and adds nutchwax specific ! config (Such config. options have an 'archive' prefix). ! --> ! <nutch-conf> ! <!-- Enable parse-ext (parse-ext is a parser that calls the 'ext'ernal program ! xpdf to parse pdf files). Also enable parse-default and the ia plugins. ! --> <property> <name>plugin.includes</name> ! <value>urlfilter-regex|parse-(text|html|ext|default)|index-(basic|ia)|query-(basic|site|url|ia)</value> </property> --- 1,24 ---- <?xml version="1.0"?> ! <!--Internet Archive Nutch(WAX) configuration. Bulk of below is overrides ! for nutch-default.xml but on the end we add a few new properties with ! 'archive' prefix. ! This conf file is picked up and built into nutchwax distribution. Is ! mostly same as the nutch-site.xml but with nutchwax specific configurations ! added: i.e. we index all content rather than just a subset. ! --> ! <nutch-conf> <property> <name>plugin.includes</name> ! <value>urlfilter-regex|parse-(text|html|js|ext)|index-(basic|ia)|query-(basic|site|url|ia)</value> ! <description>Regular expression naming plugin directory names to ! include. Any plugin not matching this expression is excluded. ! In any case you need at least include the nutch-extensionpoints plugin. ! ! Override Nutch defaults to add nutchwax/ia (Internet Archive) plugins. ! The parse-ext is used to call the native pdftotext parsing application/pdf. ! </description> </property> *************** *** 27,40 **** <property> ! <name>indexer.boost.by.link.count</name> ! <value>true</value> ! <description>Use in-degree as poor-man's link analysis.</description> </property> <property> ! <name>indexer.max.tokens</name> ! <value>100000</value> ! <description>Don't truncate documents as much as by default. ! </description> </property> --- 34,62 ---- <property> ! <name>indexer.max.tokens</name> ! <value>100000</value> ! <description> ! The maximum number of tokens that will be indexed for a single field ! in a document. This limits the amount of memory required for ! indexing, so that collections with very large files will not crash ! the indexing process by running out of memory. ! ! Note that this effectively truncates large documents, excluding ! from the index tokens that occur further in the document. If you ! know your source documents are large, be sure to set this value ! high enough to accomodate the expected size. If you set it to ! Integer.MAX_VALUE, then the only limit is your memory, but you ! should anticipate an OutOfMemoryError. ! ! Make it ten times default. ! </description> </property> <property> ! <name>parse.plugin.file</name> ! <value>ia-parse-plugins.xml</value> ! <description>The name of the file that defines the associations between ! content-types and parsers. ! </description> </property> *************** *** 42,45 **** --- 64,74 ---- <name>http.content.limit</name> <value>10000000</value> + + <description>The length limit for downloaded content, in bytes. + If this value is nonnegative (>=0), content longer than it will be truncated; + otherwise, no truncation at all. + + Used limiting amount of an ARC Record indexing during ARC ingest time. + </description> </property> *************** *** 47,54 **** <name>io.map.index.skip</name> <value>7</value> ! <description>Use less RAM. Index files get read into memory. This config. ! says read 1/7th only in at a time. Random access is slower but use more ! memory. ! </description> </property> --- 76,83 ---- <name>io.map.index.skip</name> <value>7</value> ! <description>Use less RAM. ! Index files get read into memory. This config. says read 1/7th only in ! at a time. Random access is slower but use more memory. ! </description> </property> *************** *** 60,80 **** more memory but make searches somewhat faster. Larger values use less memory but make searches somewhat slower. ! For lucene indexes, normally. The default is 128. ! Write every 1024 entries rather than every 128, the default. </description> </property> <property> ! <name>indexer.maxMergeDocs</name> ! <value>2147483647</value> ! <description>This number determines the maximum number of Lucene ! Documents to be merged into a new Lucene segment. Larger values ! increase indexing speed and reduce the number of Lucene segments, ! which reduces the number of open file handles; however, this also ! increases RAM usage during indexing. ! Doug says: "There was a bogus value for indexer.maxMergeDocs in ! nutch-default.xml which made indexing really slow. The correct ! value is something really big (like Integer.MAX_VALUE)." </description> </property> --- 89,108 ---- more memory but make searches somewhat faster. Larger values use less memory but make searches somewhat slower. ! ! Default is 128. </description> </property> <property> ! <name>indexer.mergeFactor</name> ! <value>30</value> ! <description>The factor that determines the frequency of Lucene segment ! merges. This must not be less than 2, higher values increase indexing ! speed but lead to increased RAM usage, and increase the number of ! open file handles (which may lead to "Too many open files" errors). ! NOTE: the "segments" here have nothing to do with Nutch segments, they ! are a low-level data unit used by Lucene. ! Default is 50. </description> </property> *************** *** 86,90 **** The number of context terms to display preceding and following matching terms in a hit summary. ! Make summaries a little longer than the default. </description> </property> --- 114,119 ---- The number of context terms to display preceding and following matching terms in a hit summary. ! ! Make summaries a little longer than the default. </description> </property> *************** *** 95,98 **** --- 124,143 ---- <description> The total number of terms to display in a hit summary. + + Make summaries a little longer than the default. + </description> + </property> + + <property> + <name>searcher.dir</name> + <value>crawl</value> + <description> + Path to root of crawl. This directory is searched (in + order) for either the file search-servers.txt, containing a list of + distributed search servers, or the directory "index" containing + merged indexes, or the directory "segments" containing segment + indexes. + + Included here for convenience. </description> </property> *************** *** 101,140 **** <name>collections.host</name> <value>collections.example.org</value> ! <description>The name of the server hosting collections. ! </description> </property> - <!-- The name of this archive collection. - DEPRECATED. Now search.jsp uses the 'collection' returned by the search - result drawing up the wayback URL and at index time, use the - command-line 'collection' option. - <property> <name>archive.collection</name> ! <value>be05</value> ! </property> ! --> ! <!-- ! <property> ! <name>searcher.dir</name> ! <value>/home/stack/workspace/nutch-datadir</value> ! <description>Optionally, hardcode the nutch datadir location rather ! than rely on tomcat startup location. ! </description> </property> - --> <property> <name>archive.index.all</name> <value>true</value> - <description>If set to true, all contenttypes are indexed. - Otherwise we only index text/* and application/* - </description> </property> <property> <name>archive.skip.big.html</name> ! <value>-1</value> <description>If text/html is larger than value, just skip it completely. Use this setting to bypass problematic massive text/html (We were seeing --- 146,175 ---- <name>collections.host</name> <value>collections.example.org</value> ! <description>The name of the server hosting collections. ! Used by the webapp conjuring URLs that point to page renderor (e.g. wayback). ! </description> </property> <property> <name>archive.collection</name> ! <value>CHANGEME</value> ! <description>Name of collection being searched. Used at ARC ingest time to ! add a 'collection' field to the indexed document. ! Set this before starting an indexing. ! </description> </property> + <!--If set to true, all contenttypes are indexed. Otherwise we only + index text/* and application/* + --> <property> <name>archive.index.all</name> <value>true</value> </property> <property> <name>archive.skip.big.html</name> ! <value>10000000</value> <description>If text/html is larger than value, just skip it completely. Use this setting to bypass problematic massive text/html (We were seeing *************** *** 142,152 **** value is -1 which says don't skip text/html docs.</description> </property> <property> <name>archive.index.redirects</name> ! <value>-false</value> <description>If true, we index redirects (status code 30x). </description> </property> - </nutch-conf> --- 177,187 ---- value is -1 which says don't skip text/html docs.</description> </property> + <property> <name>archive.index.redirects</name> ! <value>-true</value> <description>If true, we index redirects (status code 30x). </description> </property> </nutch-conf> --- NEW FILE: ia-parse-plugins.xml --- <?xml version="1.0" encoding="UTF-8"?> <!-- Copyright 2005 The Apache Software Foundation Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. Author : mattmann Description: This xml file represents a natural ordering for which parsing plugin should get called for a particular mimeType. --> <parse-plugins> <!-- by default if the mimeType is set to *, or can't be determined, use parse-text --> <mimeType name="*"> <plugin id="parse-text" /> </mimeType> <mimeType name="application/java"> <plugin id="parse-text" /> </mimeType> <mimeType name="application/msword"> <plugin id="parse-text" /> </mimeType> <mimeType name="application/pdf"> <plugin id="parse-ext" /> </mimeType> <mimeType name="application/rss+xml"> <plugin id="parse-text" /> </mimeType> <mimeType name="application/vnd.ms-excel"> <plugin id="parse-msexcel" /> </mimeType> <mimeType name="application/vnd.ms-powerpoint"> <plugin id="parse-mspowerpoint" /> </mimeType> <mimeType name="application/vnd.wap.wbxml"> <plugin id="parse-text" /> </mimeType> <mimeType name="application/vnd.wap.wmlc"> <plugin id="parse-text" /> </mimeType> <mimeType name="application/vnd.wap.wmlscriptc"> <plugin id="parse-text" /> </mimeType> <mimeType name="application/xhtml+xml"> <plugin id="parse-text" /> </mimeType> <mimeType name="application/x-bzip2"> <!-- try and parse it with the zip parser --> <plugin id="parse-zip" /> </mimeType> <mimeType name="application/x-csh"> <plugin id="parse-text" /> </mimeType> <mimeType name="application/x-gzip"> <!-- try and parse it with the zip parser --> <plugin id="parse-zip" /> </mimeType> <mimeType name="application/x-javascript"> <plugin id="parse-js" /> <plugin id="parse-text" /> </mimeType> <mimeType name="application/x-kword"> <!-- try and parse it with the word parser --> <plugin id="parse-msword" /> </mimeType> <mimeType name="application/x-kspread"> <!-- try and parse it with the msexcel parser --> <plugin id="parse-msexcel" /> </mimeType> <mimeType name="application/x-latex"> <plugin id="parse-text" /> </mimeType> <mimeType name="application/x-netcdf"> <plugin id="parse-text" /> </mimeType> <mimeType name="application/x-sh"> <plugin id="parse-text" /> </mimeType> <mimeType name="application/x-tcl"> <plugin id="parse-text" /> </mimeType> <mimeType name="application/x-tex"> <plugin id="parse-text" /> </mimeType> <mimeType name="application/x-texinfo"> <plugin id="parse-text" /> </mimeType> <mimeType name="application/x-troff"> <plugin id="parse-text" /> </mimeType> <mimeType name="application/x-troff-man"> <plugin id="parse-text" /> </mimeType> <mimeType name="application/x-troff-me"> <plugin id="parse-text" /> </mimeType> <mimeType name="application/x-troff-ms"> <plugin id="parse-text" /> </mimeType> <mimeType name="application/zip"> <plugin id="parse-zip" /> </mimeType> <mimeType name="message/news"> <plugin id="parse-text" /> </mimeType> <mimeType name="message/rfc822"> <plugin id="parse-text" /> </mimeType> <mimeType name="text/css"> <plugin id="parse-text" /> </mimeType> <mimeType name="text/html"> <plugin id="parse-html" /> </mimeType> <mimeType name="text/plain"> <plugin id="parse-text" /> </mimeType> <mimeType name="text/richtext"> <plugin id="parse-rtf" /> <plugin id="parse-msword" /> </mimeType> <mimeType name="text/rtf"> <plugin id="parse-rtf" /> <plugin id="parse-msword" /> </mimeType> <mimeType name="text/sgml"> <plugin id="parse-html" /> <plugin id="parse-text" /> </mimeType> <mimeType name="text/tab-separated-values"> <plugin id="parse-msexcel" /> <plugin id="parse-text" /> </mimeType> <mimeType name="text/vnd.wap.wml"> <plugin id="parse-text" /> </mimeType> <mimeType name="text/vnd.wap.wmlscript"> <plugin id="parse-text" /> </mimeType> <mimeType name="text/xml"> <plugin id="parse-text" /> <plugin id="parse-html" /> <plugin id="parse-rss" /> </mimeType> <mimeType name="text/x-setext"> <plugin id="parse-text" /> </mimeType> <!-- Types for parse-ext plugin: required for unit tests to pass. --> <mimeType name="application/vnd.nutch.example.cat"> <plugin id="parse-ext" /> </mimeType> <mimeType name="application/vnd.nutch.example.md5sum"> <plugin id="parse-ext" /> </mimeType> </parse-plugins> Index: nutch-site.xml =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/conf/nutch-site.xml,v retrieving revision 1.27 retrieving revision 1.28 diff -C2 -d -r1.27 -r1.28 *** nutch-site.xml 30 Sep 2005 21:07:07 -0000 1.27 --- nutch-site.xml 29 Nov 2005 21:43:42 -0000 1.28 *************** *** 1,40 **** <?xml version="1.0"?> ! <!-- Internet Archive Nutch configuration --> <nutch-conf> - - <!-- Override a few Nutch defaults --> - - - <!-- Enable parse-ext (parse-ext is a parser that calls the 'ext'ernal program - xpdf to parse pdf files. Also enable parse-default and the ia plugins. - --> <property> <name>plugin.includes</name> ! <value>urlfilter-regex|parse-(text|html|ext|default)|index-(basic|ia)|query-(basic|site|url|ia)</value> </property> - <!-- keep all links, not just inter-host --> - <!-- db updates will be FASTER if set to true. - Downside is that link text from same site won't be included. - (More valuable to take anchor text from other hosts). Use this - if wide variety of sites to index. - --> <property> <name>db.ignore.internal.links</name> <value>false</value> </property> - <!-- use in-degree as poor-man's link analysis --> <property> ! <name>indexer.boost.by.link.count</name> ! <value>true</value> </property> - <!-- don't truncate documents as much as by default --> <property> ! <name>indexer.max.tokens</name> ! <value>100000</value> </property> --- 1,58 ---- <?xml version="1.0"?> ! <!--Internet Archive Nutch(WAX) configuration. Bulk of below is overrides ! for nutch-default.xml but on the end we add a few new properties with ! 'archive' prefix. ! --> <nutch-conf> <property> <name>plugin.includes</name> ! <value>urlfilter-regex|parse-(text|html|js|ext)|index-(basic|ia)|query-(basic|site|url|ia)</value> ! <description>Regular expression naming plugin directory names to ! include. Any plugin not matching this expression is excluded. ! In any case you need at least include the nutch-extensionpoints plugin. ! ! Override Nutch defaults to add nutchwax/ia (Internet Archive) plugins. ! The parse-ext is used to call the native pdftotext parsing application/pdf. ! </description> </property> <property> <name>db.ignore.internal.links</name> <value>false</value> + <description>Keep all links, not just inter-host. db updates will be + FASTER if set to true. Downside is that link text from same site won't + be included (More valuable to take anchor text from other hosts). Use + this if wide variety of sites to index. + </description> </property> <property> ! <name>indexer.max.tokens</name> ! <value>100000</value> ! <description> ! The maximum number of tokens that will be indexed for a single field ! in a document. This limits the amount of memory required for ! indexing, so that collections with very large files will not crash ! the indexing process by running out of memory. ! ! Note that this effectively truncates large documents, excluding ! from the index tokens that occur further in the document. If you ! know your source documents are large, be sure to set this value ! high enough to accomodate the expected size. If you set it to ! Integer.MAX_VALUE, then the only limit is your memory, but you ! should anticipate an OutOfMemoryError. ! ! Make it ten times default. ! </description> </property> <property> ! <name>parse.plugin.file</name> ! <value>ia-parse-plugins.xml</value> ! <description>The name of the file that defines the associations between ! content-types and parsers. ! </description> </property> *************** *** 42,60 **** <name>http.content.limit</name> <value>10000000</value> </property> - <!-- use less RAM --> - <!-- Index files get read into memory. This config. says read 1/7th only in - at a time. Random access is slower but use more memory. --> <property> <name>io.map.index.skip</name> <value>7</value> </property> - - - <!-- For lucene indexes, normally. The default is 128. - Write every 1024 entries rather than every 128, the default. - --> <property> <name>indexer.termIndexInterval</name> --- 60,81 ---- <name>http.content.limit</name> <value>10000000</value> + + <description>The length limit for downloaded content, in bytes. + If this value is nonnegative (>=0), content longer than it will be truncated; + otherwise, no truncation at all. + + Used limiting amount of an ARC Record indexing during ARC ingest time. + </description> </property> <property> <name>io.map.index.skip</name> <value>7</value> + <description>Use less RAM. + Index files get read into memory. This config. says read 1/7th only in + at a time. Random access is slower but use more memory. + </description> </property> <property> <name>indexer.termIndexInterval</name> *************** *** 64,87 **** more memory but make searches somewhat faster. Larger values use less memory but make searches somewhat slower. </description> </property> <property> ! <name>indexer.maxMergeDocs</name> ! <value>2147483647</value> ! <description>This number determines the maximum number of Lucene ! Documents to be merged into a new Lucene segment. Larger values ! increase indexing speed and reduce the number of Lucene segments, ! which reduces the number of open file handles; however, this also ! increases RAM usage during indexing. ! Doug says: "There was a bogus value for indexer.maxMergeDocs in ! nutch-default.xml which made indexing really slow. The correct ! value is something really big (like Integer.MAX_VALUE)." </description> </property> - - <!-- make summaries a little longer than the default --> <property> <name>searcher.summary.context</name> --- 85,107 ---- more memory but make searches somewhat faster. Larger values use less memory but make searches somewhat slower. + + Default is 128. </description> </property> <property> ! <name>indexer.mergeFactor</name> ! <value>30</value> ! <description>The factor that determines the frequency of Lucene segment ! merges. This must not be less than 2, higher values increase indexing ! speed but lead to increased RAM usage, and increase the number of ! open file handles (which may lead to "Too many open files" errors). ! NOTE: the "segments" here have nothing to do with Nutch segments, they ! are a low-level data unit used by Lucene. ! Default is 50. </description> </property> <property> <name>searcher.summary.context</name> *************** *** 90,93 **** --- 110,115 ---- The number of context terms to display preceding and following matching terms in a hit summary. + + Make summaries a little longer than the default. </description> </property> *************** *** 98,128 **** <description> The total number of terms to display in a hit summary. </description> </property> - <!-- the name of the server hosting collections.--> <property> <name>collections.host</name> <value>collections.example.org</value> </property> - <!-- The name of this archive collection. - DEPRECATED. Now search.jsp uses the 'collection' returned by the search - result drawing up the wayback URL and at index time, use the - command-line 'collection' option. - <property> <name>archive.collection</name> ! <value>be05</value> ! </property> ! --> ! <!--Optionally, hardcode the nutch datadir location rather ! than rely on tomcat startup location. ! <property> ! <name>searcher.dir</name> ! <value>/home/stack/workspace/nutch-datadir</value> </property> - --> <!--If set to true, all contenttypes are indexed. Otherwise we only --- 120,159 ---- <description> The total number of terms to display in a hit summary. + + Make summaries a little longer than the default. + </description> + </property> + + <property> + <name>searcher.dir</name> + <value>crawl</value> + <description> + Path to root of crawl. This directory is searched (in + order) for either the file search-servers.txt, containing a list of + distributed search servers, or the directory "index" containing + merged indexes, or the directory "segments" containing segment + indexes. + + Included here for convenience. </description> </property> <property> <name>collections.host</name> <value>collections.example.org</value> + <description>The name of the server hosting collections. + Used by the webapp conjuring URLs that point to page renderor (e.g. wayback). + </description> </property> <property> <name>archive.collection</name> ! <value>CHANGEME</value> ! <description>Name of collection being searched. Used at ARC ingest time to ! add a 'collection' field to the indexed document. ! Set this before starting an indexing. ! </description> </property> <!--If set to true, all contenttypes are indexed. Otherwise we only *************** *** 131,135 **** <property> <name>archive.index.all</name> ! <value>true</value> </property> </nutch-conf> --- 162,183 ---- <property> <name>archive.index.all</name> ! <value>false</value> ! </property> ! ! <property> ! <name>archive.skip.big.html</name> ! <value>-1</value> ! <description>If text/html is larger than value, just skip it completely. ! Use this setting to bypass problematic massive text/html (We were seeing ! the text/html parser hang for hours in bad, big html docs). Default ! value is -1 which says don't skip text/html docs.</description> ! </property> ! ! <property> ! <name>archive.index.redirects</name> ! <value>-false</value> ! <description>If true, we index redirects (status code 30x). ! </description> </property> + </nutch-conf> |