arc...@li... wrote:
>Message: 1
>Date: Thu, 27 Oct 2005 08:52:10 +0300 (EEST)
>From: Kaisa Kaunonen <kau...@cs...>
>To: arc...@li...
>Subject: Re: [Archive-access-discuss] Path to parse-pdf.sh
>
>
>Yes, editing plugin.xml is of course the right thing to do..
>just suggesting that default value of this path should
>vary according to the local $NUTCHWAX variable, if people install
>indexer out of the box.
>
Yes Kaisa, it should just work. It shouldn't be necessary tinkering
with this one path before you go about indexing (There is an FAQ on this
-- http://archive-access.sourceforge.net/projects/nutch/faq.html#pdf --
but I should have put something on this into the setup instructions...
I'll fix this).
Setting the path is a little tough. Its a java variable so its awkward
exploiting environment settings such as a NUTCHWAX. I should likely
pass in the NUTCHWAX setting into java as a system property and then
make use of that composing the path to parse-pdf.sh.
I was also thinking though of redoing the pdf parser since the one we
have has a couple of issues: 1. If pdf > 10megs, not indexed; and 2. If
http content-length header does not exactly match actual content length,
we skip the document (This happens quite frequently). I was thinking of
doing a parse-xpdf plugin to use in place of parse-ext. It wouldn't do
things like try to find an external script -- parse-pdf.sh -- to run but
would just use the environments xpdf (though you could override this of
course) and it would try to do a better job with big pdf and perhaps
incomplete pdf (To be explored).
St.Ack
|