You can subscribe to this list here.
2005 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
(10) |
Sep
(36) |
Oct
(339) |
Nov
(103) |
Dec
(152) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2006 |
Jan
(141) |
Feb
(102) |
Mar
(125) |
Apr
(203) |
May
(57) |
Jun
(30) |
Jul
(139) |
Aug
(46) |
Sep
(64) |
Oct
(105) |
Nov
(34) |
Dec
(162) |
2007 |
Jan
(81) |
Feb
(57) |
Mar
(141) |
Apr
(72) |
May
(9) |
Jun
(1) |
Jul
(144) |
Aug
(88) |
Sep
(40) |
Oct
(43) |
Nov
(34) |
Dec
(20) |
2008 |
Jan
(44) |
Feb
(45) |
Mar
(16) |
Apr
(36) |
May
(8) |
Jun
(77) |
Jul
(177) |
Aug
(66) |
Sep
(8) |
Oct
(33) |
Nov
(13) |
Dec
(37) |
2009 |
Jan
(2) |
Feb
(5) |
Mar
(8) |
Apr
|
May
(36) |
Jun
(19) |
Jul
(46) |
Aug
(8) |
Sep
(1) |
Oct
(66) |
Nov
(61) |
Dec
(10) |
2010 |
Jan
(13) |
Feb
(16) |
Mar
(38) |
Apr
(76) |
May
(47) |
Jun
(32) |
Jul
(35) |
Aug
(45) |
Sep
(20) |
Oct
(61) |
Nov
(24) |
Dec
(16) |
2011 |
Jan
(22) |
Feb
(34) |
Mar
(11) |
Apr
(8) |
May
(24) |
Jun
(23) |
Jul
(11) |
Aug
(42) |
Sep
(81) |
Oct
(48) |
Nov
(21) |
Dec
(20) |
2012 |
Jan
(30) |
Feb
(25) |
Mar
(4) |
Apr
(6) |
May
(1) |
Jun
(5) |
Jul
(5) |
Aug
(8) |
Sep
(6) |
Oct
(6) |
Nov
|
Dec
|
From: Michael S. <sta...@us...> - 2005-09-15 22:18:15
|
Update of /cvsroot/archive-access/archive-access/src/docs/warc In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv32472 Modified Files: warc_file_format.html warc_file_format.txt warc_file_format.xml Log Message: * warc_file_format.xml Added Appendix C of collection ABNF (Needs work still). * warc_file_format.html * warc_file_format.txt Generated from warc_file_format.xml Index: warc_file_format.html =================================================================== RCS file: /cvsroot/archive-access/archive-access/src/docs/warc/warc_file_format.html,v retrieving revision 1.7 retrieving revision 1.8 diff -C2 -d -r1.7 -r1.8 *** warc_file_format.html 28 Aug 2005 18:55:30 -0000 1.7 --- warc_file_format.html 15 Sep 2005 22:18:05 -0000 1.8 *************** *** 263,266 **** --- 263,268 ---- <a href="#anchor40">Appendix B.8.</a> Example of 'continuation' Record<br /> + <a href="#anchor41">Appendix C.</a> + Collected BNF for WARC<br /> <a href="#rfc.references1">14.</a> References<br /> *************** *** 1531,1534 **** --- 1533,1579 ---- the set, the one with the "Segment-Number: 1" named field. </p> + <a name="anchor41"></a><br /><hr /> + <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> + <a name="rfc.section.C"></a><h3>Appendix C. Collected BNF for WARC</h3> + <pre> + warc-file = 1*warc-record + warc-record = header block CRLF CRLF + header = header-line CRLF *anvl-field CRLF + block = *OCTET + + header-line = warc-id tsp data-length tsp record-type tsp + subject-uri tsp creation-date tsp + content-type tsp record-id + tsp = 1*WSP + + warc-id = "warc/" DIGIT "." DIGIT + data-length = 1*DIGIT + record-type = "warcinfo" / "response" / "request" / "metadata" / + "revisit" / "conversion" / "continuation" / + future-type + future-type = 1*VCHAR + subject-uri = uri + uri = <'URI' per RFC3986> + creation-date = timestamp + timestamp = <date per below> + content-type = type "/" subtype + type = <'type' per RFC2045> + subtype = <'subtype' per RFC2045> + record-id = uri + + anvl-field = field-name ":" [ field-body ] CRLF + field-name = 1*<any CHAR, excluding control-chars and ":"> + field-body = text [CRLF LWSP-char field-body] + text = 1*<any UTF-8 character, including bare + CR and bare LF, but NOT including CRLF> + ; (Octal, Decimal.) + CHAR = <any ASCII/UTF-8 character> ; (0-177, 0.-127.) + CR = <ASCII CR, carriage return> ; ( 15, 13.) + LF = <ASCII LF, linefeed> ; ( 12, 10.) + SPACE = <ASCII SP, space> ; ( 40, 32.) + HTAB = <ASCII HT, horizontal-tab> ; ( 11, 9.) + CRLF = CR LF + LWSP-char = SPACE / HTAB ; semantics = SPACE + </pre> <a name="rfc.references1"></a><br /><hr /> <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> Index: warc_file_format.xml =================================================================== RCS file: /cvsroot/archive-access/archive-access/src/docs/warc/warc_file_format.xml,v retrieving revision 1.11 retrieving revision 1.12 diff -C2 -d -r1.11 -r1.12 *** warc_file_format.xml 28 Aug 2005 18:55:30 -0000 1.11 --- warc_file_format.xml 15 Sep 2005 22:18:06 -0000 1.12 *************** *** 17,21 **** <!ENTITY rfc2540 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.RFC.2540.xml'> <!ENTITY rfc4027 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.RFC.4027.xml'> - ]> <?rfc symrefs="yes"?> --- 17,20 ---- *************** *** 1397,1401 **** --- 1396,1452 ---- </appendix> + </appendix> + + <appendix title="Collected BNF for WARC"> + <!-- + TODO: Bring in the definitions for OCTET, etc., from RFC2234. + TODO: Whats the slash mean? Others have |. + TODO: Timestamp, mimetype. + TODO: The dot after in ANVL zero? + TODO: Do all abnf as entity includes so not repeated. + --> + <figure> + <artwork> + warc-file = 1*warc-record + warc-record = header block CRLF CRLF + header = header-line CRLF *anvl-field CRLF + block = *OCTET + + header-line = warc-id tsp data-length tsp record-type tsp + subject-uri tsp creation-date tsp + content-type tsp record-id + tsp = 1*WSP + + warc-id = "warc/" DIGIT "." DIGIT + data-length = 1*DIGIT + record-type = "warcinfo" / "response" / "request" / "metadata" / + "revisit" / "conversion" / "continuation" / + future-type + future-type = 1*VCHAR + subject-uri = uri + uri = <'URI' per RFC3986> + creation-date = timestamp + timestamp = <date per below> + content-type = type "/" subtype + type = <'type' per RFC2045> + subtype = <'subtype' per RFC2045> + record-id = uri + + anvl-field = field-name ":" [ field-body ] CRLF + field-name = 1*<any CHAR, excluding control-chars and ":"> + field-body = text [CRLF LWSP-char field-body] + text = 1*<any UTF-8 character, including bare + CR and bare LF, but NOT including CRLF> + ; (Octal, Decimal.) + CHAR = <any ASCII/UTF-8 character> ; (0-177, 0.-127.) + CR = <ASCII CR, carriage return> ; ( 15, 13.) + LF = <ASCII LF, linefeed> ; ( 12, 10.) + SPACE = <ASCII SP, space> ; ( 40, 32.) + HTAB = <ASCII HT, horizontal-tab> ; ( 11, 9.) + CRLF = CR LF + LWSP-char = SPACE / HTAB ; semantics = SPACE + </artwork> + </figure> </appendix> Index: warc_file_format.txt =================================================================== RCS file: /cvsroot/archive-access/archive-access/src/docs/warc/warc_file_format.txt,v retrieving revision 1.6 retrieving revision 1.7 diff -C2 -d -r1.6 -r1.7 *** warc_file_format.txt 28 Aug 2005 18:55:30 -0000 1.6 --- warc_file_format.txt 15 Sep 2005 22:18:06 -0000 1.7 *************** *** 157,164 **** Appendix B.7. Example of 'conversion' Record . . . . . . . . . . . 32 Appendix B.8. Example of 'continuation' Record . . . . . . . . . . 32 ! 14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 33 ! Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 35 ! Intellectual Property and Copyright Statements . . . . . . . . . . 36 ! --- 157,164 ---- Appendix B.7. Example of 'conversion' Record . . . . . . . . . . . 32 Appendix B.8. Example of 'continuation' Record . . . . . . . . . . 32 ! Appendix C. Collected BNF for WARC . . . . . . . . . . . . . . . 34 ! 14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 34 ! Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 37 ! Intellectual Property and Copyright Statements . . . . . . . . . . 38 *************** *** 1812,1815 **** --- 1812,1895 ---- set, the one with the "Segment-Number: 1" named field. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Kunze, et al. Expires January 2, 2006 [Page 33] + + Internet-Draft WARC File Format, 0.8revB July 2005 + + + Appendix C. Collected BNF for WARC + + warc-file = 1*warc-record + warc-record = header block CRLF CRLF + header = header-line CRLF *anvl-field CRLF + block = *OCTET + + header-line = warc-id tsp data-length tsp record-type tsp + subject-uri tsp creation-date tsp + content-type tsp record-id + tsp = 1*WSP + + warc-id = "warc/" DIGIT "." DIGIT + data-length = 1*DIGIT + record-type = "warcinfo" / "response" / "request" / "metadata" / + "revisit" / "conversion" / "continuation" / + future-type + future-type = 1*VCHAR + subject-uri = uri + uri = <'URI' per RFC3986> + creation-date = timestamp + timestamp = <date per below> + content-type = type "/" subtype + type = <'type' per RFC2045> + subtype = <'subtype' per RFC2045> + record-id = uri + + anvl-field = field-name ":" [ field-body ] CRLF + field-name = 1*<any CHAR, excluding control-chars and ":"> + field-body = text [CRLF LWSP-char field-body] + text = 1*<any UTF-8 character, including bare + CR and bare LF, but NOT including CRLF> + ; (Octal, Decimal.) + CHAR = <any ASCII/UTF-8 character> ; (0-177, 0.-127.) + CR = <ASCII CR, carriage return> ; ( 15, 13.) + LF = <ASCII LF, linefeed> ; ( 12, 10.) + SPACE = <ASCII SP, space> ; ( 40, 32.) + HTAB = <ASCII HT, horizontal-tab> ; ( 11, 9.) + CRLF = CR LF + LWSP-char = SPACE / HTAB ; semantics = SPACE + + 14. References *************** *** 1818,1821 **** --- 1898,1909 ---- [ARC] Burner, M. and B. Kahle, "The ARC File Format", + + + + Kunze, et al. Expires January 2, 2006 [Page 34] + + Internet-Draft WARC File Format, 0.8revB July 2005 + + September 1996. *************** *** 1842,1853 **** [RFC1884] Hinden, R. and S. Deering, "IP Version 6 Addressing - - - - Kunze, et al. Expires January 2, 2006 [Page 33] - - Internet-Draft WARC File Format, 0.8revB July 2005 - - Architecture", RFC 1884, December 1995. --- 1930,1933 ---- *************** *** 1874,1877 **** --- 1954,1965 ---- [RFC2540] Eastlake, D., "Detached Domain Name System (DNS) + + + + Kunze, et al. Expires January 2, 2006 [Page 35] + + Internet-Draft WARC File Format, 0.8revB July 2005 + + Information", RFC 2540, March 1999. *************** *** 1901,1905 **** ! Kunze, et al. Expires January 2, 2006 [Page 34] Internet-Draft WARC File Format, 0.8revB July 2005 --- 1989,2017 ---- ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! Kunze, et al. Expires January 2, 2006 [Page 36] Internet-Draft WARC File Format, 0.8revB July 2005 *************** *** 1957,1961 **** ! Kunze, et al. Expires January 2, 2006 [Page 35] Internet-Draft WARC File Format, 0.8revB July 2005 --- 2069,2073 ---- ! Kunze, et al. Expires January 2, 2006 [Page 37] Internet-Draft WARC File Format, 0.8revB July 2005 *************** *** 2013,2016 **** ! Kunze, et al. Expires January 2, 2006 [Page 36] --- 2125,2128 ---- ! Kunze, et al. Expires January 2, 2006 [Page 38] |
From: Michael S. <sta...@us...> - 2005-09-15 21:57:35
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/xdocs/oswir In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv25329 Added Files: 2005-oswir-wacsearch.ppt Log Message: * 2005-oswir-wacsearch.ppt Slides. --- NEW FILE: 2005-oswir-wacsearch.ppt --- (This appears to be a binary file; contents omitted.) |
From: Michael S. <sta...@us...> - 2005-09-15 18:23:06
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/bin In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv9560/bin Modified Files: nutch Log Message: * bin/nutch Call the nutchwax merge. * src/java/org/archive/access/nutch/NutchwaxIndexMerger.java Adds being able to pass dir of segments (For Dan). Index: nutch =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/bin/nutch,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** nutch 5 Sep 2005 20:04:31 -0000 1.1 --- nutch 15 Sep 2005 18:22:53 -0000 1.2 *************** *** 145,149 **** CLASS=org.apache.nutch.indexer.IndexSegment elif [ "$COMMAND" = "merge" ] ; then ! CLASS=org.apache.nutch.indexer.IndexMerger elif [ "$COMMAND" = "dedup" ] ; then CLASS=org.apache.nutch.indexer.DeleteDuplicates --- 145,152 ---- CLASS=org.apache.nutch.indexer.IndexSegment elif [ "$COMMAND" = "merge" ] ; then ! # Use the nutchwax merger. It adds being able to take a dir of segments. ! # TODO: Make this a subclass rather than a copy. Looks like I can. But ! # am in a bit of hurry at moment. ! CLASS=org.archive.access.nutch.NutchwaxIndexMerger elif [ "$COMMAND" = "dedup" ] ; then CLASS=org.apache.nutch.indexer.DeleteDuplicates *************** *** 153,159 **** CLASS=org.apache.nutch.tools.UpdateSegmentsFromDb elif [ "$COMMAND" = "mergesegs" ] ; then ! # Copy over the nutchwax version of segment merge. ! # It will work w/ segments made by nutchwax. Also ! # does not do a merge. CLASS=org.archive.access.nutch.NutchwaxSegmentMergeTool elif [ "$COMMAND" = "readdb" ] ; then --- 156,161 ---- CLASS=org.apache.nutch.tools.UpdateSegmentsFromDb elif [ "$COMMAND" = "mergesegs" ] ; then ! # Use the merge fom nutchwax. It doesn't expect content to be in place ! # and it disables deduping. CLASS=org.archive.access.nutch.NutchwaxSegmentMergeTool elif [ "$COMMAND" = "readdb" ] ; then |
From: Michael S. <sta...@us...> - 2005-09-15 18:23:01
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/src/java/org/archive/access/nutch In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv9560/src/java/org/archive/access/nutch Added Files: NutchwaxIndexMerger.java Log Message: * bin/nutch Call the nutchwax merge. * src/java/org/archive/access/nutch/NutchwaxIndexMerger.java Adds being able to pass dir of segments (For Dan). --- NEW FILE: NutchwaxIndexMerger.java --- /** * Copyright 2005 The Apache Software Foundation * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ package org.archive.access.nutch; import java.io.*; import java.text.*; import java.util.*; import java.util.logging.*; import org.apache.nutch.fs.*; import org.apache.nutch.util.*; import org.apache.nutch.indexer.*; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; import org.apache.lucene.index.IndexWriter; /************************************************************************* * NutchwaxIndexMerger creates an index for the output corresponding to a * single fetcher run. * * Based on the nutch IndexMerger. Adds being able to pass a directory that * holds Segments. St.Ack on 09/14/2005. * * @author Doug Cutting * @author Mike Cafarella *************************************************************************/ public class NutchwaxIndexMerger { public static final Logger LOG = LogFormatter.getLogger("org.apache.nutch.indexer.NutchwaxIndexMerger"); public static final String DONE_NAME = "merge.done"; private int MERGE_FACTOR = NutchConf.get().getInt("indexer.mergeFactor", IndexWriter.DEFAULT_MERGE_FACTOR); private int MIN_MERGE_DOCS = NutchConf.get().getInt("indexer.minMergeDocs", IndexWriter.DEFAULT_MIN_MERGE_DOCS); private int MAX_MERGE_DOCS = NutchConf.get().getInt("indexer.maxMergeDocs", IndexWriter.DEFAULT_MAX_MERGE_DOCS); private int TERM_INDEX_INTERVAL = NutchConf.get().getInt("indexer.termIndexInterval", IndexWriter.DEFAULT_TERM_INDEX_INTERVAL); private NutchFileSystem nfs; private File outputIndex; private File localWorkingDir; private File[] segments; /** * Merge all of the segments given */ public NutchwaxIndexMerger(NutchFileSystem nfs, File[] segments, File outputIndex, File localWorkingDir) throws IOException { this.nfs = nfs; this.segments = segments; this.outputIndex = outputIndex; this.localWorkingDir = localWorkingDir; } /** * Load all input segment indices, then add to the single output index */ public void merge() throws IOException { // // Open local copies of NFS indices // Directory[] dirs = new Directory[segments.length]; File[] localSegments = new File[segments.length]; for (int i = 0; i < segments.length; i++) { File tmpFile = new File(localWorkingDir, "indexmerge-" + new SimpleDateFormat("yyyMMddHHmmss").format(new Date(System.currentTimeMillis()))); localSegments[i] = nfs.startLocalInput(new File(segments[i], "index"), tmpFile); dirs[i] = FSDirectory.getDirectory(localSegments[i], false); } // // Get local output target // File tmpLocalOutput = new File(localWorkingDir, "merge-output"); File localOutput = nfs.startLocalOutput(outputIndex, tmpLocalOutput); // // Merge indices // IndexWriter writer = new IndexWriter(localOutput, null, true); writer.mergeFactor = MERGE_FACTOR; writer.minMergeDocs = MIN_MERGE_DOCS; writer.maxMergeDocs = MAX_MERGE_DOCS; writer.setTermIndexInterval(TERM_INDEX_INTERVAL); writer.infoStream = LogFormatter.getLogStream(LOG, Level.FINE); writer.setUseCompoundFile(false); writer.setSimilarity(new NutchSimilarity()); writer.addIndexes(dirs); writer.close(); // // Put target back // nfs.completeLocalOutput(outputIndex, tmpLocalOutput); // // Delete all local inputs, if necessary // for (int i = 0; i < localSegments.length; i++) { nfs.completeLocalInput(localSegments[i]); } localWorkingDir.delete(); } /** * Create an index for the input files in the named directory. */ public static void main(String[] args) throws Exception { String usage = "NutchwaxIndexMerger (-local | -ndfs <nameserver:port>) [-workingdir <workingdir>] outputIndex (-dir <input_segments_dir> | segments...)"; if (args.length < 2) { System.err.println("Usage: " + usage); return; } // // Parse args, read all segment directories to be processed // NutchFileSystem nfs = NutchFileSystem.parseArgs(args, 0); try { File workingDir = new File(new File("").getCanonicalPath()); Vector segments = new Vector(); int i = 0; if ("-workingdir".equals(args[i])) { i++; workingDir = new File(new File(args[i++]).getCanonicalPath()); } File outputIndex = new File(args[i++]); if (args[i].equals("-dir")) { // We've been passed a directory to look into. i++; // Move past the '-dir' File dir = new File(args[i]); File [] segs = dir.listFiles(new FileFilter() { public boolean accept(final File f) { // Only accept directories. Assume all dirs are segments. return f.isDirectory(); } }); for (int j = 0; j < segs.length; j++) { segments.add(segs[j]); } } else { for (; i < args.length; i++) { if (args[i] != null) { segments.add(new File(args[i])); } } } workingDir = new File(workingDir, "indexmerger-workingdir"); // // Merge the indices // File[] segmentFiles = (File[]) segments.toArray(new File[segments.size()]); LOG.info("merging segment indexes to: " + outputIndex); if (workingDir.exists()) { FileUtil.fullyDelete(workingDir); } workingDir.mkdirs(); NutchwaxIndexMerger merger = new NutchwaxIndexMerger(nfs, segmentFiles, outputIndex, workingDir); merger.merge(); LOG.info("done merging"); FileUtil.fullyDelete(workingDir); } finally { nfs.close(); } } } |
From: Michael S. <sta...@us...> - 2005-09-15 18:23:01
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/xdocs In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv9560/xdocs Added Files: 2005-oswir-wacsearch.sxi Log Message: * bin/nutch Call the nutchwax merge. * src/java/org/archive/access/nutch/NutchwaxIndexMerger.java Adds being able to pass dir of segments (For Dan). --- NEW FILE: 2005-oswir-wacsearch.sxi --- (This appears to be a binary file; contents omitted.) |
From: stack <st...@ar...> - 2005-09-14 21:52:42
|
Lukas Matejka wrote: >i downloaded new version of nutch from cvs and i think that script >indexarc.sh stil doesn't work well. > >(in previous version i had to use absolute paths and no links in directories) > > Links should be fine. Works for me. >with relative paths same result... > >in dir archive are slinks to arcs. > > > The below looks like its not finding any arcs in /home/nwa/nutchwax/archive. Are there files with a '.arc.gz' ending in /home/nwa/nutchwax/archive? We're just skipping through the segmenting step w/o indexing anything. We then get to the update from db step but no segments were created at the indexing stage. St.Ack >./bin/indexarcs.sh -s /home/nwa/nutchwax/archive -d /home/nwa/nutchwax/data -c >test >St zář 14 23:12:36 CEST 2005 Checking environment variables. >St zář 14 23:12:36 CEST 2005 Cleaning up all /home/nwa/nutchwax/data content. >St zář 14 23:12:36 CEST 2005 Creating new queue, and segments. >St zář 14 23:12:36 CEST 2005 Started segmenting. >St zář 14 23:12:36 CEST 2005 Started build of link database. >050914 231237 parsing file:/home/nwa/nutchwax/conf/nutch-default.xml >050914 231238 parsing file:/home/nwa/nutchwax/conf/nutch-site.xml >050914 231238 No FS indicated, using default:local >050914 231238 Created webdb at LocalFS,/home/nwa/nutchwax/data/db >050914 231239 parsing file:/home/nwa/nutchwax/conf/nutch-default.xml >050914 231240 parsing file:/home/nwa/nutchwax/conf/nutch-site.xml >050914 231240 No FS indicated, using default:local >050914 231240 Updating /home/nwa/nutchwax/data/db >050914 231240 Updating for /home/nwa/nutchwax/data/segments/* >Exception in thread "main" >java.io.FileNotFoundException: /home/nwa/nutchwax/data/segments/*/fetcher/data > at org.apache.nutch.fs.LocalFileSystem.open(LocalFileSystem.java:93) > at >org.apache.nutch.io.SequenceFile$Reader.<init>(SequenceFile.java:194) > at >org.apache.nutch.io.SequenceFile$Reader.<init>(SequenceFile.java:187) > at org.apache.nutch.io.MapFile$Reader.<init>(MapFile.java:190) > at org.apache.nutch.io.MapFile$Reader.<init>(MapFile.java:179) > at org.apache.nutch.io.ArrayFile$Reader.<init>(ArrayFile.java:50) > at >org.apache.nutch.tools.UpdateDatabaseTool.updateForSegment(UpdateDatabaseTool.java:92) > at >org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:366) >050914 231242 parsing file:/home/nwa/nutchwax/conf/nutch-default.xml > >l. > > |
From: Lukas M. <mat...@ce...> - 2005-09-14 21:17:04
|
i downloaded new version of nutch from cvs and i think that script indexarc.sh stil doesn't work well. (in previous version i had to use absolute paths and no links in directorie= s) with relative paths same result... in dir archive are slinks to arcs. =2E/bin/indexarcs.sh -s /home/nwa/nutchwax/archive -d /home/nwa/nutchwax/da= ta -c=20 test St z=E1=F8 14 23:12:36 CEST 2005 Checking environment variables. St z=E1=F8 14 23:12:36 CEST 2005 Cleaning up all /home/nwa/nutchwax/data co= ntent. St z=E1=F8 14 23:12:36 CEST 2005 Creating new queue, and segments. St z=E1=F8 14 23:12:36 CEST 2005 Started segmenting. St z=E1=F8 14 23:12:36 CEST 2005 Started build of link database. 050914 231237 parsing file:/home/nwa/nutchwax/conf/nutch-default.xml 050914 231238 parsing file:/home/nwa/nutchwax/conf/nutch-site.xml 050914 231238 No FS indicated, using default:local 050914 231238 Created webdb at LocalFS,/home/nwa/nutchwax/data/db 050914 231239 parsing file:/home/nwa/nutchwax/conf/nutch-default.xml 050914 231240 parsing file:/home/nwa/nutchwax/conf/nutch-site.xml 050914 231240 No FS indicated, using default:local 050914 231240 Updating /home/nwa/nutchwax/data/db 050914 231240 Updating for /home/nwa/nutchwax/data/segments/* Exception in thread "main"=20 java.io.FileNotFoundException: /home/nwa/nutchwax/data/segments/*/fetcher/d= ata at org.apache.nutch.fs.LocalFileSystem.open(LocalFileSystem.java:93) at=20 org.apache.nutch.io.SequenceFile$Reader.<init>(SequenceFile.java:194) at=20 org.apache.nutch.io.SequenceFile$Reader.<init>(SequenceFile.java:187) at org.apache.nutch.io.MapFile$Reader.<init>(MapFile.java:190) at org.apache.nutch.io.MapFile$Reader.<init>(MapFile.java:179) at org.apache.nutch.io.ArrayFile$Reader.<init>(ArrayFile.java:50) at=20 org.apache.nutch.tools.UpdateDatabaseTool.updateForSegment(UpdateDatabaseTo= ol.java:92) at=20 org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:366) 050914 231242 parsing file:/home/nwa/nutchwax/conf/nutch-default.xml l. |
From: Michael S. <sta...@us...> - 2005-09-09 23:43:18
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/xdocs/iwaw In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv5861/iwaw Added Files: figure1.jpg figure2.jpg iwaw-wacsearch-tables.doc iwaw-wacsearch.doc iwaw-wacsearch.pdf Log Message: * faq.fml Point to new location. * iwaw/figure1.jpg iwaw/figure2.jpg * iwaw/iwaw-wacsearch-tables.doc iwaw/iwaw-wacsearch.doc * iwaw/iwaw-wacsearch.pdf oswir/wacs-oswir.pdf * oswir/wacs-oswir3.doc Added submitted versions of papers. * google_ratzinger.jpg nutch_ratzinger.jpg oswir.html * wacs-oswir.doc wacs-oswir.pdf web-collection-search.html * web-collection-search2.doc Replaced by above final versinos. --- NEW FILE: figure1.jpg --- (This appears to be a binary file; contents omitted.) --- NEW FILE: iwaw-wacsearch.doc --- (This appears to be a binary file; contents omitted.) --- NEW FILE: figure2.jpg --- (This appears to be a binary file; contents omitted.) --- NEW FILE: iwaw-wacsearch.pdf --- (This appears to be a binary file; contents omitted.) --- NEW FILE: iwaw-wacsearch-tables.doc --- (This appears to be a binary file; contents omitted.) |
From: Michael S. <sta...@us...> - 2005-09-09 23:43:18
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/xdocs In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv5861 Modified Files: faq.fml Removed Files: google_ratzinger.jpg nutch_ratzinger.jpg oswir.html wacs-oswir.doc wacs-oswir.pdf web-collection-search.html web-collection-search2.doc Log Message: * faq.fml Point to new location. * iwaw/figure1.jpg iwaw/figure2.jpg * iwaw/iwaw-wacsearch-tables.doc iwaw/iwaw-wacsearch.doc * iwaw/iwaw-wacsearch.pdf oswir/wacs-oswir.pdf * oswir/wacs-oswir3.doc Added submitted versions of papers. * google_ratzinger.jpg nutch_ratzinger.jpg oswir.html * wacs-oswir.doc wacs-oswir.pdf web-collection-search.html * web-collection-search2.doc Replaced by above final versinos. --- wacs-oswir.pdf DELETED --- --- web-collection-search2.doc DELETED --- --- nutch_ratzinger.jpg DELETED --- Index: faq.fml =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/xdocs/faq.fml,v retrieving revision 1.6 retrieving revision 1.7 diff -C2 -d -r1.6 -r1.7 *** faq.fml 2 Aug 2005 18:34:02 -0000 1.6 --- faq.fml 9 Sep 2005 23:43:10 -0000 1.7 *************** *** 15,20 **** are known issues running against large collections). </p> ! <p>See <a href="web-collection-search.html">Full Text Searching of ! Web Archive Collections Using Nutch</a> for a fuller treatment of the problems this project addresses.</p> </answer> --- 15,20 ---- are known issues running against large collections). </p> ! <p>See <a href="iwaw/iwaw-wacsearch.pdf">Full Text Search of ! Web Archive Collections</a> for a fuller treatment of the problems this project addresses.</p> </answer> --- web-collection-search.html DELETED --- --- oswir.html DELETED --- --- google_ratzinger.jpg DELETED --- --- wacs-oswir.doc DELETED --- |
From: Michael S. <sta...@us...> - 2005-09-09 23:43:18
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/xdocs/oswir In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv5861/oswir Added Files: wacs-oswir.pdf wacs-oswir3.doc Log Message: * faq.fml Point to new location. * iwaw/figure1.jpg iwaw/figure2.jpg * iwaw/iwaw-wacsearch-tables.doc iwaw/iwaw-wacsearch.doc * iwaw/iwaw-wacsearch.pdf oswir/wacs-oswir.pdf * oswir/wacs-oswir3.doc Added submitted versions of papers. * google_ratzinger.jpg nutch_ratzinger.jpg oswir.html * wacs-oswir.doc wacs-oswir.pdf web-collection-search.html * web-collection-search2.doc Replaced by above final versinos. --- NEW FILE: wacs-oswir3.doc --- (This appears to be a binary file; contents omitted.) --- NEW FILE: wacs-oswir.pdf --- (This appears to be a binary file; contents omitted.) |
From: Michael S. <sta...@us...> - 2005-09-09 23:39:20
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/xdocs/oswir In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv5281/oswir Log Message: Directory /cvsroot/archive-access/archive-access/projects/nutch/xdocs/oswir added to the repository |
From: Michael S. <sta...@us...> - 2005-09-09 23:32:56
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/xdocs/iwaw In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv3982/iwaw Log Message: Directory /cvsroot/archive-access/archive-access/projects/nutch/xdocs/iwaw added to the repository |
From: Michael S. <sta...@us...> - 2005-09-08 17:48:56
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/xdocs In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv23127 Added Files: steps_indexing_katrina.txt Log Message: * steps_indexing_katrina.txt Added. Notes on how I did indexing of Katrina. --- NEW FILE: steps_indexing_katrina.txt --- $Id: steps_indexing_katrina.txt,v 1.1 2005/09/08 17:48:48 stack-sf Exp $ Two crawls of Hurricane Katrina. 00 and 01. Will start by indexing part of 00. Here are all of the backup hosts w/ katrina crawl 00 ARCs on them: $ ~webcrawl/crawl-arc-cfg/db-arc-info \ -like HURRICANE-KATRINA-2005-00%arc.gz | \ awk '{print $2$4}' |grep -e -bu|sort|uniq crawldata0034a-bu.archive.org/1 crawldata0035a-bu.archive.org/3 crawldata0036a-bu.archive.org/0 crawldata0037a-bu.archive.org/0 Now to mount these hosts. Here's a little script to do it: #!/bin/sh # Pass name of file that hosts and name of collection to use as dir under # /mnt. if [ $# != 2 ] then echo "Usage: $0 HOSTS_FILE DIR_UNDER_MNT" exit 1 fi for i in `cat $1` do mntpoint="/mnt/$2/$i" mkdir -p $mntpoint dev=`echo $i|sed -n -e 's/\//:\//p'` mount -t nfs -o ro,rsize=8192,wsize=8192,intr,nfsvers=2 $dev $mntpoint done Counting ARCs: $ ~webcrawl/crawl-arc-cfg/db-arc-info \ -like HURRICANE-KATRINA-2005-00%arc.gz | \ awk '{print $2 " " $6}'|grep -e -bu|uniq|wc -l There are 1010 in crawl 00 (uniq'ing, there are 1008). Here is how I got a list of all files sorted: $ ~webcrawl/crawl-arc-cfg/db-arc-info \ -like HURRICANE-KATRINA-2005-00%arc.gz | \ awk '{print $2 " " $6}'|grep -e -bu| \ awk '{print $2}'|sort|uniq> 00arcs.txt I'll do first 100 for now (One segment). $ head -100 00arcs.txt > 00arcs.0-99.txt I then made a directory to hold symlinks to the first 100: $ mkdir 00arcs.0-99 $ for i in `cat ../00arcs.0-99.txt`; do find /mnt/katrina/ -type f \ -name $i -exec ln -s {} \;; done Don't forget to edit the parse-ext plugin.xml so it points to the pdf parser wrapper script. I ran the indexing like this: $ nohup ./bin/indexarcs.sh -c katrina -s ~/katrina/00arcs.0-99/ \ -d /2/katrina/nutch-data &> /2/katrina/indexing`date +%FT%H:%M`.log \ < /dev/null & |
From: Michael S. <sta...@us...> - 2005-09-07 23:05:41
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/src/web In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv15852/src/web Modified Files: search.jsp Log Message: * src/web/search.jsp Fix for 'On Search results, hit range is not updating properly as you move to next page'. We weren't passing the 'start' to the rss servlet. Index: search.jsp =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/src/web/search.jsp,v retrieving revision 1.19 retrieving revision 1.20 diff -C2 -d -r1.19 -r1.20 *** search.jsp 7 Sep 2005 15:51:59 -0000 1.19 --- search.jsp 7 Sep 2005 23:05:29 -0000 1.20 *************** *** 92,96 **** String rss = request.getContextPath() + "/opensearch?query=" + ! htmlQueryString + "&hitsPerDup=" + hitsPerDup+params; %><!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> --- 92,98 ---- String rss = request.getContextPath() + "/opensearch?query=" + ! htmlQueryString + "&hitsPerDup=" + hitsPerDup + ! ((start != 0)? "&start=" + start: "") + params; ! %><!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> |
From: Michael S. <sta...@us...> - 2005-09-07 15:52:08
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/src/web In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv6397/src/web Modified Files: search.jsp Log Message: Fix for '[ 1281697 ] searching czech words not working'. Patch from Lukas. * src/web/search.jsp Convert parameter string to utf-8. Index: search.jsp =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/src/web/search.jsp,v retrieving revision 1.18 retrieving revision 1.19 diff -C2 -d -r1.18 -r1.19 *** search.jsp 17 Aug 2005 21:47:24 -0000 1.18 --- search.jsp 7 Sep 2005 15:51:59 -0000 1.19 *************** *** 35,38 **** --- 35,42 ---- queryString = ""; } + // Why do we have to do this? We've set the character encoding for the + // request above with request.setCharacterEncoding? But Lukas and Oskar + // say this works. + queryString = new String(queryString.getBytes("ISO-8859-1"), "UTF-8"); String htmlQueryString = Entities.encode(queryString); |
From:
<mat...@ce...> - 2005-09-07 11:19:33
|
_____________________________________________________________ > Od: st...@ar... > Komu: mat...@ce... > CC: arc...@li... > Datum: 07.09.2005 00:33 > P=F8edm=ECt: Re: [Archive-access-cvs] searching special characters > > Lukas Matejka wrote: >=20 > >Searching of czech word doesn't work in WERA and in NutchWax too. > >i put on... > >https://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D128169= 7&group_id=3D118427&atid=3D681137 > > > >I fixed this problem in previous version of WERA(NWA) by changing fi= le > >ParameterUtils.java(which i send to St.ack). Maybe it would help.(i > hope:)) > > > > Where's ParameterUtils Lukas? Is it in the ARC Retreiver? > St.Ack I checked it and ParameterUtils was used by nwa(WERA), but the problem = is same. I changed file search.jsp and it works(harvester.nkp.cz:8080/nutchwax) just conversion from to... String parameter =3D request.getParameter("query"); if (parameter =3D=3D null) parameter =3D ""; String queryString =3D new String(parameter.getBytes("ISO-8859-1"), "U= TF-8"); -lm >=20 > >-lm > > > > > >------------------------------------------------------- > >SF.Net email is Sponsored by the Better Software Conference & EXPO > >September 19-22, 2005 * San Francisco, CA * Development Lifecycle > Practices > >Agile & Plan-Driven Development * Managing Projects & Teams * Testin= g & > QA > >Security * Process Improvement & Measurement * http://www.sqe.com/bs= ce5sf > >_______________________________________________ > >Archive-access-cvs mailing list > >Arc...@li... > >https://lists.sourceforge.net/lists/listinfo/archive-access-cvs > > > >=20 > |
From: stack <st...@ar...> - 2005-09-06 22:16:49
|
Lukas Matejka wrote: >Searching of czech word doesn't work in WERA and in NutchWax too. >i put on... >https://sourceforge.net/tracker/index.php?func=detail&aid=1281697&group_id=118427&atid=681137 > >I fixed this problem in previous version of WERA(NWA) by changing file >ParameterUtils.java(which i send to St.ack). Maybe it would help.(i hope:)) > > Where's ParameterUtils Lukas? Is it in the ARC Retreiver? St.Ack >-lm > > >------------------------------------------------------- >SF.Net email is Sponsored by the Better Software Conference & EXPO >September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices >Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA >Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf >_______________________________________________ >Archive-access-cvs mailing list >Arc...@li... >https://lists.sourceforge.net/lists/listinfo/archive-access-cvs > > |
From: Michael S. <sta...@us...> - 2005-09-05 20:04:39
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/bin In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv13610/bin Added Files: nutch Log Message: * bin/nutch Version that calls the nutchwax merge segments tool instead of native nutch's. --- NEW FILE: nutch --- #!/bin/sh # # The Nutch command script # # Environment Variables # # NUTCH_JAVA_HOME The java implementation to use. Overrides JAVA_HOME. # # NUTCH_HEAPSIZE The maximum amount of heap to use, in MB. # Default is 1000. # # NUTCH_OPTS Extra Java runtime options. # # resolve links - $0 may be a softlink THIS="$0" while [ -h "$THIS" ]; do ls=`ls -ld "$THIS"` link=`expr "$ls" : '.*-> \(.*\)$'` if expr "$link" : '.*/.*' > /dev/null; then THIS="$link" else THIS=`dirname "$THIS"`/"$link" fi done # if no args specified, show usage if [ $# = 0 ]; then echo "Usage: nutch COMMAND" echo "where COMMAND is one of:" echo " crawl one-step crawler for intranets" echo " admin database administration, including creation" echo " inject inject new urls into the database" echo " generate generate new segments to fetch" echo " fetchlist print the fetchlist of a segment" echo " fetch fetch a segment's pages" echo " parse parse a segment's pages" echo " index run the indexer on a segment's fetcher output" echo " merge merge several segment indexes" echo " dedup remove duplicates from a set of segment indexes" echo " updatedb update db from segments after fetching" echo " updatesegs update segments with link data from the db" echo " mergesegs merge multiple segments into a single segment" echo " readdb examine arbitrary fields of the database" echo " analyze adjust database link-analysis scoring" echo " prune prune segment index(es) of unwanted content" echo " segread read, fix and dump segment data" echo " segslice append, join and slice segment data" echo " server run a search server" echo " namenode run the NDFS namenode" echo " datanode run an NDFS datanode" echo " ndfs run an NDFS admin client" echo " jobtracker run the MapReduce job Tracker node" echo " tasktracker run a MapReduce task Tracker node" echo " or" echo " CLASSNAME run the class named CLASSNAME" echo "Most commands print help when invoked w/o parameters." exit 1 fi # get arguments COMMAND=$1 shift # some directories THIS_DIR=`dirname "$THIS"` NUTCH_HOME=`cd "$THIS_DIR/.." ; pwd` # some Java parameters if [ "$NUTCH_JAVA_HOME" != "" ]; then echo "run java in $NUTCH_JAVA_HOME" JAVA_HOME=$NUTCH_JAVA_HOME fi if [ "$JAVA_HOME" = "" ]; then echo "Error: JAVA_HOME is not set." exit 1 fi JAVA=$JAVA_HOME/bin/java JAVA_HEAP_MAX=-Xmx1000m # check envvars which might override default args if [ "$NUTCH_HEAPSIZE" != "" ]; then echo "run with heapsize $NUTCH_HEAPSIZE" JAVA_HEAP_MAX="-Xmx""$NUTCH_HEAPSIZE""m" echo $JAVA_HEAP_MAX fi # CLASSPATH initially contains $NUTCH_CONF_DIR, or defaults to $NUTCH_HOME/conf CLASSPATH=${NUTCH_CONF_DIR:=$NUTCH_HOME/conf} # for developers, add Nutch classes to CLASSPATH if [ -d "$NUTCH_HOME/build/classes" ]; then CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build/classes fi if [ -d "$NUTCH_HOME/build/plugins" ]; then CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build fi if [ -d "$NUTCH_HOME/build/test/classes" ]; then CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build/test/classes fi # so that filenames w/ spaces are handled correctly in loops below IFS= # for releases, add Nutch jar to CLASSPATH for f in $NUTCH_HOME/nutch-*.jar; do CLASSPATH=${CLASSPATH}:$f; done # add plugins to classpath if [ -d "$NUTCH_HOME/plugins" ]; then CLASSPATH=${CLASSPATH}:$NUTCH_HOME fi # add libs to CLASSPATH for f in $NUTCH_HOME/lib/*.jar; do CLASSPATH=${CLASSPATH}:$f; done for f in $NUTCH_HOME/lib/jettyext/*.jar; do CLASSPATH=${CLASSPATH}:$f; done # restore ordinary behaviour unset IFS # figure out which class to run if [ "$COMMAND" = "crawl" ] ; then CLASS=org.apache.nutch.tools.CrawlTool elif [ "$COMMAND" = "admin" ] ; then CLASS=org.apache.nutch.tools.WebDBAdminTool elif [ "$COMMAND" = "inject" ] ; then CLASS=org.apache.nutch.db.WebDBInjector elif [ "$COMMAND" = "generate" ] ; then CLASS=org.apache.nutch.tools.FetchListTool elif [ "$COMMAND" = "fetchlist" ] ; then CLASS=org.apache.nutch.pagedb.FetchListEntry elif [ "$COMMAND" = "fetch" ] ; then CLASS=org.apache.nutch.fetcher.Fetcher elif [ "$COMMAND" = "parse" ] ; then CLASS=org.apache.nutch.tools.ParseSegment elif [ "$COMMAND" = "index" ] ; then CLASS=org.apache.nutch.indexer.IndexSegment elif [ "$COMMAND" = "merge" ] ; then CLASS=org.apache.nutch.indexer.IndexMerger elif [ "$COMMAND" = "dedup" ] ; then CLASS=org.apache.nutch.indexer.DeleteDuplicates elif [ "$COMMAND" = "updatedb" ] ; then CLASS=org.apache.nutch.tools.UpdateDatabaseTool elif [ "$COMMAND" = "updatesegs" ] ; then CLASS=org.apache.nutch.tools.UpdateSegmentsFromDb elif [ "$COMMAND" = "mergesegs" ] ; then # Copy over the nutchwax version of segment merge. # It will work w/ segments made by nutchwax. Also # does not do a merge. CLASS=org.archive.access.nutch.NutchwaxSegmentMergeTool elif [ "$COMMAND" = "readdb" ] ; then CLASS=org.apache.nutch.db.WebDBReader elif [ "$COMMAND" = "prune" ] ; then CLASS=org.apache.nutch.tools.PruneIndexTool elif [ "$COMMAND" = "segread" ] ; then CLASS=org.apache.nutch.segment.SegmentReader elif [ "$COMMAND" = "segslice" ] ; then CLASS=org.apache.nutch.segment.SegmentSlicer elif [ "$COMMAND" = "analyze" ] ; then CLASS=org.apache.nutch.tools.LinkAnalysisTool elif [ "$COMMAND" = "server" ] ; then CLASS='org.apache.nutch.searcher.DistributedSearch$Server' elif [ "$COMMAND" = "namenode" ] ; then CLASS='org.apache.nutch.ndfs.NDFS$NameNode' elif [ "$COMMAND" = "datanode" ] ; then CLASS='org.apache.nutch.ndfs.NDFS$DataNode' elif [ "$COMMAND" = "ndfs" ] ; then CLASS=org.apache.nutch.fs.TestClient elif [ "$COMMAND" = "jobtracker" ] ; then CLASS=org.apache.nutch.mapReduce.JobTracker elif [ "$COMMAND" = "tasktracker" ] ; then CLASS=org.apache.nutch.mapReduce.TaskTracker else CLASS=$COMMAND fi # cygwin path translation if expr `uname` : 'CYGWIN*' > /dev/null; then CLASSPATH=`cygpath -p -w "$CLASSPATH"` fi # run it exec "$JAVA" $JAVA_HEAP_MAX $NUTCH_OPTS -classpath "$CLASSPATH" $CLASS "$@" |
From: Lukas M. <mat...@ce...> - 2005-09-04 20:48:55
|
Searching of czech word doesn't work in WERA and in NutchWax too. i put on... https://sourceforge.net/tracker/index.php?func=detail&aid=1281697&group_id=118427&atid=681137 I fixed this problem in previous version of WERA(NWA) by changing file ParameterUtils.java(which i send to St.ack). Maybe it would help.(i hope:)) -lm |
From: Michael S. <sta...@us...> - 2005-09-02 01:08:34
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/bin In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv25769/bin Modified Files: arcs2segs.sh indexarcs.sh Log Message: Make mergesegs work with our segments by providing our own versin of SegmentMergeTool and our own version of nutch script that invokes our tool instead of standard nutch's. * .classpath Changed the nutch jar to refer to 0.7 release. * maven.xml Copy over the nutch bins first then ours. Overwrite. This way our version of nutch script sits on top of theirs. * project.properties * project.xml Reference lucene. * bin/arcs2segs.sh * bin/indexarcs.sh Add in setting of logging level. Index: arcs2segs.sh =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/bin/arcs2segs.sh,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** arcs2segs.sh 9 Aug 2005 01:00:25 -0000 1.4 --- arcs2segs.sh 2 Sep 2005 01:08:18 -0000 1.5 *************** *** 2,17 **** # Check that we got right arguments. ! usage="$0 DIR_OF_ARCS DIR_FOR_SEGMENTS COLLECTION_NAME [#ARCS]" ! if [ $# -lt 3 ] then echo $usage exit 1 fi ! if [ $# -gt 4 ] then echo $usage exit 1 fi ! queue=$1 if [ ! -d $queue ] then --- 2,18 ---- # Check that we got right arguments. ! usage="$0 LOG_LEVEL DIR_OF_ARCS DIR_FOR_SEGMENTS COLLECTION_NAME [#ARCS]" ! if [ $# -lt 4 ] then echo $usage exit 1 fi ! if [ $# -gt 5 ] then echo $usage exit 1 fi ! level=$1 ! queue=$2 if [ ! -d $queue ] then *************** *** 20,29 **** exit 1 fi ! segments=$2 ! collection_name=$3 arc_count=100 ! if [ ! -z "$4" ] then ! arc_count="$4" fi if [ ! -d $segments ] --- 21,30 ---- exit 1 fi ! segments=$3 ! collection_name=$4 arc_count=100 ! if [ ! -z "$5" ] then ! arc_count="$5" fi if [ ! -d $segments ] *************** *** 42,46 **** fi seg=$segments/${hostname_prefix}`/bin/date +%F-%H%M%S` ! $arc2seg $seg $collection_name $arcs mkdir -p $seg/arcs mv $arcs $seg/arcs --- 43,47 ---- fi seg=$segments/${hostname_prefix}`/bin/date +%F-%H%M%S` ! $arc2seg -logLevel ${level} $seg $collection_name $arcs mkdir -p $seg/arcs mv $arcs $seg/arcs Index: indexarcs.sh =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/bin/indexarcs.sh,v retrieving revision 1.9 retrieving revision 1.10 diff -C2 -d -r1.9 -r1.10 *** indexarcs.sh 9 Aug 2005 01:00:25 -0000 1.9 --- indexarcs.sh 2 Sep 2005 01:08:18 -0000 1.10 *************** *** 56,59 **** --- 56,60 ---- echo " (Does not turn-off cmdline checking). Optional." echo " -a How many arcs to do per segment. Default is 100." + echo " -l Java logging level. Default: info. Options: info, warning, etc." echo "This runs through all steps nutch indexing ARCs so their content is" echo "searchable by nutch. This script is for use against small collections" *************** *** 143,147 **** return fi ! ${BASEDIR}/bin/arcs2segs.sh ${DATADIR}/queue/ \ ${DATADIR}/segments ${COLLECTION_NAME} ${arcs_per_segment} } --- 144,148 ---- return fi ! ${BASEDIR}/bin/arcs2segs.sh ${level} ${DATADIR}/queue/ \ ${DATADIR}/segments ${COLLECTION_NAME} ${arcs_per_segment} } *************** *** 198,204 **** noop= expert= arcname_filter="*.arc.gz" arcs_per_segment=100 ! while getopts "hnte:m:s:d:c:f:a:" opt do if [ "$opt" = "?" ] --- 199,206 ---- noop= expert= + level="info" arcname_filter="*.arc.gz" arcs_per_segment=100 ! while getopts "hnte:m:s:d:c:f:a:l:" opt do if [ "$opt" = "?" ] *************** *** 216,224 **** 's') ARCSDIR=${OPTARG} ! if [ ! -e ${arcsdir} ] then echo "ERROR: ${arcsdir} does not exist." usage fi ;; 'd') --- 218,230 ---- 's') ARCSDIR=${OPTARG} ! if [ ! -e ${ARCSDIR} ] then echo "ERROR: ${arcsdir} does not exist." usage fi + if [ `dirname ${ARCSDIR}` = '.' ] + then + ARCSDIR=`pwd`/`basename ${ARCSDIR}` + fi ;; 'd') *************** *** 229,232 **** --- 235,242 ---- usage fi + if [ `dirname ${DATADIR}` = '.' ] + then + DATADIR=`pwd`/`basename ${DATADIR}` + fi ;; 'c') *************** *** 249,252 **** --- 259,266 ---- arcs_per_segment=${OPTARG} ;; + 'l') + # Java logging level. + level=${OPTARG} + ;; *) usage |
From: Michael S. <sta...@us...> - 2005-09-02 01:08:34
|
Update of /cvsroot/archive-access/archive-access/projects/nutch In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv25769 Modified Files: .classpath maven.xml project.properties project.xml Log Message: Make mergesegs work with our segments by providing our own versin of SegmentMergeTool and our own version of nutch script that invokes our tool instead of standard nutch's. * .classpath Changed the nutch jar to refer to 0.7 release. * maven.xml Copy over the nutch bins first then ours. Overwrite. This way our version of nutch script sits on top of theirs. * project.properties * project.xml Reference lucene. * bin/arcs2segs.sh * bin/indexarcs.sh Add in setting of logging level. Index: maven.xml =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/maven.xml,v retrieving revision 1.8 retrieving revision 1.9 diff -C2 -d -r1.8 -r1.9 *** maven.xml 29 Jul 2005 22:12:23 -0000 1.8 --- maven.xml 2 Sep 2005 01:08:18 -0000 1.9 *************** *** 91,100 **** file="${basedir}/conf/nutch-site.xml.all" filtering="true" /> ! <!--Fill the bin dir.--> ! <copy todir="${maven.dist.bin.assembly.dir}/bin" filtering="true"> ! <fileset dir="${basedir}/bin"> <include name="*"/> </fileset> ! <fileset dir="${nutch.dir}/bin"> <include name="*"/> </fileset> --- 91,103 ---- file="${basedir}/conf/nutch-site.xml.all" filtering="true" /> ! <!--Fill the bin dir. Fill from nutch first. Then from nutchwax ! because we want to overwrite the nutch script with the nutchwax ! version.--> ! <copy todir="${maven.dist.bin.assembly.dir}/bin" ! filtering="true" overwrite="true" > ! <fileset dir="${nutch.dir}/bin"> <include name="*"/> </fileset> ! <fileset dir="${basedir}/bin"> <include name="*"/> </fileset> Index: project.properties =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/project.properties,v retrieving revision 1.12 retrieving revision 1.13 diff -C2 -d -r1.12 -r1.13 *** project.properties 19 Aug 2005 21:38:36 -0000 1.12 --- project.properties 2 Sep 2005 01:08:18 -0000 1.13 *************** *** 18,22 **** # Local jars to add to classpath. maven.jar.override = on ! maven.jar.corenutch = ${basedir}/nutch/build/nutch-0.7-dev.jar maven.jar.arc = ${basedir}/lib/arc-1.5.1-200508191341.jar maven.jar.servlet-api = ${basedir}/nutch/lib/servlet-api.jar --- 18,23 ---- # Local jars to add to classpath. maven.jar.override = on ! maven.jar.corenutch = ${basedir}/nutch/build/nutch-0.7.jar ! maven.jar.lucene = ${basedir}/nutch/lib/lucene-1.9-rc1-dev.jar maven.jar.arc = ${basedir}/lib/arc-1.5.1-200508191341.jar maven.jar.servlet-api = ${basedir}/nutch/lib/servlet-api.jar Index: project.xml =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/project.xml,v retrieving revision 1.18 retrieving revision 1.19 diff -C2 -d -r1.18 -r1.19 *** project.xml 2 Aug 2005 22:11:08 -0000 1.18 --- project.xml 2 Sep 2005 01:08:18 -0000 1.19 *************** *** 174,177 **** --- 174,189 ---- </dependency> <dependency> + <id>lucene</id> + <version>1_9-rc1-dev</version> + <url>http://nutch.org/</url> + <properties> + <war.bundle>true</war.bundle> + <description>Search library from nutch. + </description> + <license>Apache 2.0 + http://www.apache.org/licenses/LICENSE-2.0</license> + </properties> + </dependency> + <dependency> <id>servlet-api</id> <version>2.3</version> Index: .classpath =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/.classpath,v retrieving revision 1.10 retrieving revision 1.11 diff -C2 -d -r1.10 -r1.11 *** .classpath 19 Aug 2005 21:38:35 -0000 1.10 --- .classpath 2 Sep 2005 01:08:18 -0000 1.11 *************** *** 7,11 **** <classpathentry kind="con" path="org.eclipse.jdt.launching.JRE_CONTAINER"/> <classpathentry kind="lib" path="nutch/lib/lucene-1.9-rc1-dev.jar"/> ! <classpathentry kind="lib" path="nutch/build/nutch-0.7-dev.jar"/> <classpathentry kind="lib" path="lib/arc-1.5.1-200508191341.jar"/> <classpathentry kind="lib" path="lib/commons-httpclient-3.0-alpha2.jar"/> --- 7,11 ---- <classpathentry kind="con" path="org.eclipse.jdt.launching.JRE_CONTAINER"/> <classpathentry kind="lib" path="nutch/lib/lucene-1.9-rc1-dev.jar"/> ! <classpathentry kind="lib" path="nutch/build/nutch-0.7.jar"/> <classpathentry kind="lib" path="lib/arc-1.5.1-200508191341.jar"/> <classpathentry kind="lib" path="lib/commons-httpclient-3.0-alpha2.jar"/> *************** *** 14,17 **** --- 14,19 ---- <classpathentry kind="lib" path="nutch/lib/servlet-api.jar"/> <classpathentry sourcepath="ECLIPSE_HOME/plugins/org.eclipse.jdt.source_3.1.0/src/org.junit_3.8.1/junitsrc.zip" kind="var" path="JUNIT_HOME/junit.jar"/> + <classpathentry kind="lib" path="nutch/conf"/> + <classpathentry kind="lib" path="nutch/lib/jakarta-oro-2.0.7.jar"/> <classpathentry kind="output" path="target"/> </classpath> |
From: Michael S. <sta...@us...> - 2005-09-02 01:08:34
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/src/java/org/archive/access/nutch In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv25769/src/java/org/archive/access/nutch Added Files: NutchwaxSegmentMergeTool.java Log Message: Make mergesegs work with our segments by providing our own versin of SegmentMergeTool and our own version of nutch script that invokes our tool instead of standard nutch's. * .classpath Changed the nutch jar to refer to 0.7 release. * maven.xml Copy over the nutch bins first then ours. Overwrite. This way our version of nutch script sits on top of theirs. * project.properties * project.xml Reference lucene. * bin/arcs2segs.sh * bin/indexarcs.sh Add in setting of logging level. --- NEW FILE: NutchwaxSegmentMergeTool.java --- /** * Copyright 2005 The Apache Software Foundation * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ package org.archive.access.nutch; import java.io.File; import java.io.FileFilter; import java.util.ArrayList; import java.util.Arrays; import java.util.HashMap; import java.util.Iterator; import java.util.List; import java.util.Vector; import java.util.logging.Logger; import org.apache.nutch.fetcher.FetcherOutput; import org.apache.nutch.indexer.IndexSegment; import org.apache.nutch.io.MD5Hash; import org.apache.nutch.fs.*; import org.apache.nutch.parse.ParseData; import org.apache.nutch.parse.ParseText; import org.apache.nutch.protocol.Content; import org.apache.nutch.segment.SegmentReader; import org.apache.nutch.segment.SegmentWriter; import org.apache.nutch.util.LogFormatter; import org.apache.nutch.util.NutchConf; import org.apache.lucene.analysis.WhitespaceAnalyzer; import org.apache.lucene.document.DateField; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.Term; import org.apache.lucene.index.TermDocs; import org.apache.lucene.index.TermEnum; /** * This class cleans up accumulated segments data, and merges them into a single * (or optionally multiple) segment(s), with no duplicates in it. * * <p> * COPIED FROM NUTCH SO I CAN PUT IN PLACE ALTERNATE TOOLS. * St.Ack * <p> * There are no prerequisites for its correct * operation except for a set of already fetched segments (they don't have to * contain parsed content, only fetcher output is required). This tool does not * use DeleteDuplicates, but creates its own "master" index of all pages in all * segments. Then it walks sequentially through this index and picks up only * most recent versions of pages for every unique value of url or hash. * </p> * <p>If some of the input segments are corrupted, this tool will attempt to * repair them, using * {@link org.apache.nutch.segment.SegmentReader#fixSegment(NutchFileSystem, File, boolean, boolean, boolean, boolean)} method.</p> * <p>Output segment can be optionally split on the fly into several segments of fixed * length.</p> * <p> * The newly created segment(s) can be then optionally indexed, so that it can be * either merged with more new segments, or used for searching as it is. * </p> * <p> * Old segments may be optionally removed, because all needed data has already * been copied to the new merged segment. NOTE: this tool will remove also all * corrupted input segments, which are not useable anyway - however, this option * may be dangerous if you inadvertently included non-segment directories as * input...</p> * <p> * You may want to run SegmentMergeTool instead of following the manual procedures, * with all options turned on, i.e. to merge segments into the output segment(s), * index it, and then delete the original segments data. * </p> * * @author Andrzej Bialecki <ab...@ge...> */ public class NutchwaxSegmentMergeTool implements Runnable { public static final Logger LOG = LogFormatter.getLogger("org.apache.nutch.tools.NutchwaxSegmentMergeTool"); /** Log progress update every LOG_STEP items. */ public static int LOG_STEP = 20000; /** Temporary de-dup index size. Larger indexes tend to slow down indexing. * Too many indexes slow down the subsequent index merging. It's a tradeoff value... */ public static int INDEX_SIZE = 250000; public static int INDEX_MERGE_FACTOR = 30; public static int INDEX_MIN_MERGE_DOCS = 100; private boolean boostByLinkCount = NutchConf.get().getBoolean("indexer.boost.by.link.count", false); private float scorePower = NutchConf.get().getFloat("indexer.score.power", 0.5f); private NutchFileSystem nfs = null; private File[] segments = null; private int stage = SegmentMergeStatus.STAGE_OPENING; private long totalRecords = 0L; private long processedRecords = 0L; private long start = 0L; private long maxCount = Long.MAX_VALUE; private File output = null; private List segdirs = null; private List allsegdirs = null; private boolean runIndexer = false; private boolean delSegs = false; private HashMap readers = new HashMap(); /** * Create a NutchwaxSegmentMergeTool. * @param nfs filesystem * @param segments list of input segments * @param output output directory, where output segments will be created * @param maxCount maximum number of records per output segment. If this * value is 0, then the default value {@link Long#MAX_VALUE} is used. * @param runIndexer run indexer on output segment(s) * @param delSegs delete input segments when finished * @throws Exception */ public NutchwaxSegmentMergeTool(NutchFileSystem nfs, File[] segments, File output, long maxCount, boolean runIndexer, boolean delSegs) throws Exception { this.nfs = nfs; this.segments = segments; this.runIndexer = runIndexer; this.delSegs = delSegs; if (maxCount > 0) this.maxCount = maxCount; allsegdirs = Arrays.asList(segments); this.output = output; if (nfs.exists(output)) { if (!nfs.isDirectory(output)) throw new Exception("Output is not a directory: " + output); } else nfs.mkdirs(output); } public static class SegmentMergeStatus { public static final int STAGE_OPENING = 0; public static final int STAGE_MASTERIDX = 1; public static final int STAGE_MERGEIDX = 2; public static final int STAGE_DEDUP = 3; public static final int STAGE_WRITING = 4; public static final int STAGE_INDEXING = 5; public static final int STAGE_DELETING = 6; public static final String[] stages = { "opening input segments", "creating master index", "merging sub-indexes", "deduplicating", "writing output segment(s)", "indexing output segment(s)", "deleting input segments" }; public int stage; public File[] inputSegments; public long startTime, curTime; public long totalRecords; public long processedRecords; public SegmentMergeStatus() {}; public SegmentMergeStatus(int stage, File[] inputSegments, long startTime, long totalRecords, long processedRecords) { this.stage = stage; this.inputSegments = inputSegments; this.startTime = startTime; this.curTime = System.currentTimeMillis(); this.totalRecords = totalRecords; this.processedRecords = processedRecords; } } public SegmentMergeStatus getStatus() { SegmentMergeStatus status = new SegmentMergeStatus(stage, segments, start, totalRecords, processedRecords); return status; } /** Run the tool, periodically reporting progress. */ public void run() { start = System.currentTimeMillis(); stage = SegmentMergeStatus.STAGE_OPENING; long delta; LOG.info("* Opening " + allsegdirs.size() + " segments:"); try { segdirs = new ArrayList(); // open all segments for (int i = 0; i < allsegdirs.size(); i++) { File dir = (File) allsegdirs.get(i); SegmentReader sr = null; try { // try to autofix it if corrupted... sr = new SegmentReader(nfs, dir, false, true, true, false); } catch (Exception e) { // this segment is hosed beyond repair, don't use it continue; } segdirs.add(dir); totalRecords += sr.size; LOG.info(" - segment " + dir.getName() + ": " + sr.size + " records."); readers.put(dir.getName(), sr); } long total = totalRecords; LOG.info("* TOTAL " + total + " input records in " + segdirs.size() + " segments."); LOG.info("* Creating master index..."); stage = SegmentMergeStatus.STAGE_MASTERIDX; // XXX Note that Lucene indexes don't work with NutchFileSystem for now. // XXX For now always assume LocalFileSystem here... Vector masters = new Vector(); File fsmtIndexDir = new File(output, ".fastmerge_index"); File masterDir = new File(fsmtIndexDir, "0"); if (!masterDir.mkdirs()) { LOG.severe("Could not create a master index dir: " + masterDir); return; } masters.add(masterDir); IndexWriter iw = new IndexWriter(masterDir, new WhitespaceAnalyzer(), true); iw.setUseCompoundFile(false); iw.mergeFactor = INDEX_MERGE_FACTOR; iw.minMergeDocs = INDEX_MIN_MERGE_DOCS; long s1 = System.currentTimeMillis(); Iterator it = readers.values().iterator(); processedRecords = 0L; delta = System.currentTimeMillis(); while (it.hasNext()) { SegmentReader sr = (SegmentReader) it.next(); String name = sr.segmentDir.getName(); FetcherOutput fo = new FetcherOutput(); for (long i = 0; i < sr.size; i++) { try { if (!sr.get(i, fo, null, null, null)) break; Document doc = new Document(); // compute boost float boost = IndexSegment.calculateBoost(fo.getFetchListEntry().getPage().getScore(), scorePower, boostByLinkCount, fo.getAnchors().length); doc.add(new Field("sd", name + "|" + i, true, false, false)); doc.add(new Field("uh", MD5Hash.digest(fo.getUrl().toString()).toString(), true, true, false)); doc.add(new Field("ch", fo.getMD5Hash().toString(), true, true, false)); doc.add(new Field("time", DateField.timeToString(fo.getFetchDate()), true, false, false)); doc.add(new Field("score", boost + "", true, false, false)); doc.add(new Field("ul", fo.getUrl().toString().length() + "", true, false, false)); iw.addDocument(doc); processedRecords++; if (processedRecords > 0 && (processedRecords % LOG_STEP == 0)) { LOG.info(" Processed " + processedRecords + " records (" + (float)(LOG_STEP * 1000)/(float)(System.currentTimeMillis() - delta) + " rec/s)"); delta = System.currentTimeMillis(); } if (processedRecords > 0 && (processedRecords % INDEX_SIZE == 0)) { iw.optimize(); iw.close(); LOG.info(" - creating next subindex..."); masterDir = new File(fsmtIndexDir, "" + masters.size()); if (!masterDir.mkdirs()) { LOG.severe("Could not create a master index dir: " + masterDir); return; } masters.add(masterDir); iw = new IndexWriter(masterDir, new WhitespaceAnalyzer(), true); iw.setUseCompoundFile(false); iw.mergeFactor = INDEX_MERGE_FACTOR; iw.minMergeDocs = INDEX_MIN_MERGE_DOCS; } } catch (Throwable t) { // we can assume the data is invalid from now on - break here LOG.info(" - segment " + name + " truncated to " + (i + 1) + " records"); break; } } } iw.optimize(); LOG.info("* Creating index took " + (System.currentTimeMillis() - s1) + " ms"); s1 = System.currentTimeMillis(); // merge all other indexes using the latest IndexWriter (still open): if (masters.size() > 1) { LOG.info(" - merging subindexes..."); stage = SegmentMergeStatus.STAGE_MERGEIDX; IndexReader[] ireaders = new IndexReader[masters.size() - 1]; for (int i = 0; i < masters.size() - 1; i++) ireaders[i] = IndexReader.open((File)masters.get(i)); iw.addIndexes(ireaders); for (int i = 0; i < masters.size() - 1; i++) { ireaders[i].close(); FileUtil.fullyDelete((File)masters.get(i)); } } iw.close(); LOG.info("* Optimizing index took " + (System.currentTimeMillis() - s1) + " ms"); LOG.info("* Skipping deduplicate step..."); // LOG.info("* Removing duplicate entries..."); // stage = SegmentMergeStatus.STAGE_DEDUP; IndexReader ir = IndexReader.open(masterDir); // int i = 0; // long cnt = 0L; // processedRecords = 0L; // s1 = System.currentTimeMillis(); // delta = s1; // TermEnum te = ir.terms(); // while(te.next()) { // Term t = te.term(); // if (t == null) continue; // if (!(t.field().equals("ch") || t.field().equals("uh"))) continue; // cnt++; // processedRecords = cnt / 2; // if (cnt > 0 && (cnt % (LOG_STEP * 2) == 0)) { // LOG.info(" Processed " + processedRecords + " records (" + // (float)(LOG_STEP * 1000)/(float)(System.currentTimeMillis() - delta) + " rec/s)"); // delta = System.currentTimeMillis(); // } // // Enumerate all docs with the same URL hash or content hash // TermDocs td = ir.termDocs(t); // if (td == null) continue; // if (t.field().equals("uh")) { // // Keep only the latest version of the document with // // the same url hash. Note: even if the content // // hash is identical, other metadata may be different, so even // // in this case it makes sense to keep the latest version. // int id = -1; // String time = null; // Document doc = null; // while (td.next()) { // int docid = td.doc(); // if (!ir.isDeleted(docid)) { // doc = ir.document(docid); // if (time == null) { // time = doc.get("time"); // id = docid; // continue; // } // String dtime = doc.get("time"); // // "time" is a DateField, and can be compared lexicographically // if (dtime.compareTo(time) > 0) { // if (id != -1) { // ir.delete(id); // } // time = dtime; // id = docid; // } else { // ir.delete(docid); // } // } // } // } else if (t.field().equals("ch")) { // // Keep only the version of the document with // // the highest score, and then with the shortest url. // int id = -1; // int ul = 0; // float score = 0.0f; // Document doc = null; // while (td.next()) { // int docid = td.doc(); // if (!ir.isDeleted(docid)) { // doc = ir.document(docid); // if (ul == 0) { // try { // ul = Integer.parseInt(doc.get("ul")); // score = Float.parseFloat(doc.get("score")); // } catch (Exception e) {}; // id = docid; // continue; // } // int dul = 0; // float dscore = 0.0f; // try { // dul = Integer.parseInt(doc.get("ul")); // dscore = Float.parseFloat(doc.get("score")); // } catch (Exception e) {}; // int cmp = Float.compare(dscore, score); // if (cmp == 0) { // // equal scores, select the one with shortest url // if (dul < ul) { // if (id != -1) { // ir.delete(id); // } // ul = dul; // id = docid; // } else { // ir.delete(docid); // } // } else if (cmp < 0) { // ir.delete(docid); // } else { // if (id != -1) { // ir.delete(id); // } // ul = dul; // id = docid; // } // } // } // } // } // // // // keep the IndexReader open... // // // // LOG.info("* Deduplicating took " + (System.currentTimeMillis() - s1) + " ms"); stage = SegmentMergeStatus.STAGE_WRITING; processedRecords = 0L; Vector outDirs = new Vector(); File outDir = new File(output, SegmentWriter.getNewSegmentName()); outDirs.add(outDir); LOG.info("* Merging all segments into " + output.getName()); s1 = System.currentTimeMillis(); delta = s1; nfs.mkdirs(outDir); SegmentWriter sw = new SegmentWriter(nfs, outDir, false, true, false, true, true); LOG.fine(" - opening first output segment in " + outDir.getName()); FetcherOutput fo = new FetcherOutput(); Content co = new Content(); ParseText pt = new ParseText(); ParseData pd = new ParseData(); int outputCnt = 0; for (int n = 0; n < ir.maxDoc(); n++) { if (ir.isDeleted(n)) { //System.out.println("-del"); continue; } Document doc = ir.document(n); String segDoc = doc.get("sd"); int idx = segDoc.indexOf('|'); String segName = segDoc.substring(0, idx); String docName = segDoc.substring(idx + 1); SegmentReader sr = (SegmentReader) readers.get(segName); long docid; try { docid = Long.parseLong(docName); } catch (Exception e) { continue; } try { // get data from the reader sr.get(docid, fo, co, pt, pd); } catch (Throwable thr) { // don't break the loop, because only one of the segments // may be corrupted... LOG.fine(" - corrupt record no. " + docid + " in segment " + sr.segmentDir.getName() + " - skipping."); continue; } sw.append(fo, co, pt, pd); outputCnt++; processedRecords++; if (processedRecords > 0 && (processedRecords % LOG_STEP == 0)) { LOG.info(" Processed " + processedRecords + " records (" + (float)(LOG_STEP * 1000)/(float)(System.currentTimeMillis() - delta) + " rec/s)"); delta = System.currentTimeMillis(); } if (processedRecords % maxCount == 0) { sw.close(); outDir = new File(output, SegmentWriter.getNewSegmentName()); LOG.fine(" - starting next output segment in " + outDir.getName()); nfs.mkdirs(outDir); sw = new SegmentWriter(nfs, outDir, true); outDirs.add(outDir); } } LOG.info("* Merging took " + (System.currentTimeMillis() - s1) + " ms"); ir.close(); sw.close(); FileUtil.fullyDelete(fsmtIndexDir); for (Iterator iter = readers.keySet().iterator(); iter.hasNext();) { SegmentReader sr = (SegmentReader) readers.get(iter.next()); sr.close(); } if (runIndexer) { stage = SegmentMergeStatus.STAGE_INDEXING; totalRecords = outDirs.size(); processedRecords = 0L; LOG.info("* Creating new segment index(es)..."); File workingDir = new File(output, "indexsegment-workingdir"); for (int k = 0; k < outDirs.size(); k++) { processedRecords++; if (workingDir.exists()) { FileUtil.fullyDelete(workingDir); } IndexSegment indexer = new IndexSegment(nfs, Integer.MAX_VALUE, (File)outDirs.get(k), workingDir); indexer.indexPages(); FileUtil.fullyDelete(workingDir); } } if (delSegs) { // This deletes also all corrupt segments, which are // unusable anyway stage = SegmentMergeStatus.STAGE_DELETING; totalRecords = allsegdirs.size(); processedRecords = 0L; LOG.info("* Deleting old segments..."); for (int k = 0; k < allsegdirs.size(); k++) { processedRecords++; FileUtil.fullyDelete((File) allsegdirs.get(k)); } } delta = System.currentTimeMillis() - start; float eps = (float) total / (float) (delta / 1000); LOG.info("Finished NutchwaxSegmentMergeTool: INPUT: " + total + " -> OUTPUT: " + outputCnt + " entries in " + ((float) delta / 1000f) + " s (" + eps + " entries/sec)."); } catch (Exception e) { e.printStackTrace(); LOG.severe(e.getMessage()); } } public static void main(String[] args) throws Exception { if (args.length < 1) { System.err.println("Too few arguments.\n"); usage(); System.exit(-1); } NutchFileSystem nfs = NutchFileSystem.parseArgs(args, 0); boolean runIndexer = false; boolean delSegs = false; long maxCount = Long.MAX_VALUE; String segDir = null; File output = null; Vector dirs = new Vector(); for (int i = 0; i < args.length; i++) { if (args[i] == null) continue; if (args[i].equals("-o")) { if (args.length > i + 1) { output = new File(args[++i]); continue; } else { LOG.severe("Required value of '-o' argument missing.\n"); usage(); return; } } else if (args[i].equals("-i")) { runIndexer = true; } else if (args[i].equals("-cm")) { LOG.warning("'-cm' option obsolete - ignored."); } else if (args[i].equals("-max")) { String cnt = args[++i]; try { maxCount = Long.parseLong(cnt); } catch (Exception e) { LOG.warning("Invalid count '" + cnt + "', setting to Long.MAX_VALUE."); } } else if (args[i].equals("-ds")) { delSegs = true; } else if (args[i].equals("-dir")) { segDir = args[++i]; } else dirs.add(new File(args[i])); } if (segDir != null) { File sDir = new File(segDir); if (!sDir.exists() || !sDir.isDirectory()) { LOG.warning("Invalid path: " + sDir); } else { File[] files = sDir.listFiles(new FileFilter() { public boolean accept(File f) { return f.isDirectory(); } }); if (files != null && files.length > 0) { for (int i = 0; i < files.length; i++) dirs.add(files[i]); } } } if (dirs.size() == 0) { LOG.severe("No input segments."); return; } if (output == null) output = ((File)dirs.get(0)).getParentFile(); NutchwaxSegmentMergeTool st = new NutchwaxSegmentMergeTool(nfs, (File[])dirs.toArray(new File[0]), output, maxCount, runIndexer, delSegs); st.run(); } private static void usage() { System.err.println("NutchwaxSegmentMergeTool (-local | -nfs ...) (-dir <input_segments_dir> | seg1 seg2 ...) [-o <output_segments_dir>] [-max count] [-i] [-ds]"); System.err.println("\t-dir <input_segments_dir>\tpath to directory containing input segments"); System.err.println("\tseg1 seg2 seg3\t\tindividual paths to input segments"); System.err.println("\t-o <output_segment_dir>\t(optional) path to directory which will\n\t\t\t\tcontain output segment(s).\n\t\t\tNOTE: If not present, the original segments path will be used."); System.err.println("\t-max count\t(optional) output multiple segments, each with maximum 'count' entries"); System.err.println("\t-i\t\t(optional) index the output segment when finished merging."); System.err.println("\t-ds\t\t(optional) delete the original input segments when finished."); System.err.println(); } } |
From: Michael S. <sta...@us...> - 2005-09-01 21:22:18
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/xdocs In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv11377/xdocs Modified Files: requirements.xml Log Message: * xdocs/requirements.xml Removed nutch. Belongs in src doc. Index: requirements.xml =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/xdocs/requirements.xml,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** requirements.xml 1 Sep 2005 21:19:27 -0000 1.4 --- requirements.xml 1 Sep 2005 21:22:09 -0000 1.5 *************** *** 31,38 **** </section> <section name="Build from src Requirements" > - <subsection name="Nutch"> - <p>Nutch 0.7 src - </p> - </subsection> <subsection name="Ant"> <p>Tested working with version 1.6.2. --- 31,34 ---- |
From: Michael S. <sta...@us...> - 2005-09-01 21:19:37
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/xdocs In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv9693/xdocs Modified Files: requirements.xml Log Message: * xdocs/requirements.xml Added nutch 0.7 to requirements. Index: requirements.xml =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/xdocs/requirements.xml,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** requirements.xml 29 Jul 2005 22:12:23 -0000 1.3 --- requirements.xml 1 Sep 2005 21:19:27 -0000 1.4 *************** *** 8,24 **** <body> ! <section name="System Runtime Requirements"> <subsection name="JAVA"> <p>Tested working with SUN v1.5.0_01 and 1.4.2_03. </p> </subsection> ! <subsection name="Ant"> ! <p>Tested working with version 1.6.2. ! </p> ! </subsection> <subsection name="Tomcat"> <p>Tested working with version 5.0.28. </p> </subsection> <subsection name="xpdf: pdftotext"> <p>If parsing PDFs, you'll need <a href="http://www.foolabs.com/xpdf/">xpdf</a> --- 8,23 ---- <body> ! <section name="Runtime Requirements"> <subsection name="JAVA"> <p>Tested working with SUN v1.5.0_01 and 1.4.2_03. </p> </subsection> ! ! <subsection name="Tomcat"> <p>Tested working with version 5.0.28. </p> </subsection> + <subsection name="xpdf: pdftotext"> <p>If parsing PDFs, you'll need <a href="http://www.foolabs.com/xpdf/">xpdf</a> *************** *** 31,34 **** --- 30,47 ---- </subsection> </section> + <section name="Build from src Requirements" > + <subsection name="Nutch"> + <p>Nutch 0.7 src + </p> + </subsection> + <subsection name="Ant"> + <p>Tested working with version 1.6.2. + </p> + </subsection> + <subsection name="Maven"> + <p>If you want to build distributions and the website, you'll need Maven. + </p> + </subsection> + </section> </body> |
From: Michael S. <sta...@us...> - 2005-09-01 20:58:32
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/xdocs In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv26986/xdocs Modified Files: gettingstarted.xml Log Message: * xdocs/gettingstarted.xml Edits from Sverre. Index: gettingstarted.xml =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/xdocs/gettingstarted.xml,v retrieving revision 1.8 retrieving revision 1.9 diff -C2 -d -r1.8 -r1.9 *** gettingstarted.xml 29 Jul 2005 22:12:23 -0000 1.8 --- gettingstarted.xml 1 Sep 2005 20:58:24 -0000 1.9 *************** *** 28,33 **** indexing step. It takes a bunch of options. To do the most basic indexing operation, point it a few ARC files and let it run: ! <pre>% ./bin/indexarcs -s ${HOME}/arcs/ -d ${HOME}/nutch-data</pre> ! This will build an index for you in <code>${HOME}/nutch-data</code>. </p> <p> --- 28,36 ---- indexing step. It takes a bunch of options. To do the most basic indexing operation, point it a few ARC files and let it run: ! <pre>% ./bin/indexarcs.sh -s ${HOME}/arcs/ -d ${HOME}/nutch-data -c COLLECTION_NAME</pre> ! This will build an index for you in <code>${HOME}/nutch-data</code> (-n says ! do not run the deduplication step -- necessary if you are using nutchwax ! with wera -- and the '-c' is the name of the collection the indexed ! content will belong to). </p> <p> |