|
From: <st...@ar...> - 2005-10-28 15:20:58
|
arc...@li... wrote: >Message: 1 >Date: Thu, 27 Oct 2005 08:52:10 +0300 (EEST) >From: Kaisa Kaunonen <kau...@cs...> >To: arc...@li... >Subject: Re: [Archive-access-discuss] Path to parse-pdf.sh > > >Yes, editing plugin.xml is of course the right thing to do.. >just suggesting that default value of this path should >vary according to the local $NUTCHWAX variable, if people install >indexer out of the box. > Yes Kaisa, it should just work. It shouldn't be necessary tinkering with this one path before you go about indexing (There is an FAQ on this -- http://archive-access.sourceforge.net/projects/nutch/faq.html#pdf -- but I should have put something on this into the setup instructions... I'll fix this). Setting the path is a little tough. Its a java variable so its awkward exploiting environment settings such as a NUTCHWAX. I should likely pass in the NUTCHWAX setting into java as a system property and then make use of that composing the path to parse-pdf.sh. I was also thinking though of redoing the pdf parser since the one we have has a couple of issues: 1. If pdf > 10megs, not indexed; and 2. If http content-length header does not exactly match actual content length, we skip the document (This happens quite frequently). I was thinking of doing a parse-xpdf plugin to use in place of parse-ext. It wouldn't do things like try to find an external script -- parse-pdf.sh -- to run but would just use the environments xpdf (though you could override this of course) and it would try to do a better job with big pdf and perhaps incomplete pdf (To be explored). St.Ack |