[Archive-access-discuss] Re: Archive-access-discuss digest, Vol 1 #15 - 1 msg

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

arc...@li... wrote:

>Message: 1
>Date: Thu, 27 Oct 2005 08:52:10 +0300 (EEST)
>From: Kaisa Kaunonen <kau...@cs...>
>To: arc...@li...
>Subject: Re: [Archive-access-discuss] Path to parse-pdf.sh
>
>
>Yes, editing plugin.xml is of course the right thing to do..
>just suggesting that default value of this path should
>vary according to the local $NUTCHWAX variable, if people install
>indexer out of the box.
>
Yes Kaisa, it should just work.  It shouldn't be necessary tinkering 
with this one path before you go about indexing (There is an FAQ on this 
-- http://archive-access.sourceforge.net/projects/nutch/faq.html#pdf -- 
but I should have put something on this into the setup instructions... 
I'll fix this).

Setting the path is a little tough.  Its a java variable so its awkward 
exploiting environment settings such as a NUTCHWAX.  I should likely 
pass in the NUTCHWAX setting into java as a system property and then 
make use of that composing the path to parse-pdf.sh.

I was also thinking though of redoing the pdf parser since the one we 
have has a couple of issues: 1. If pdf > 10megs, not indexed; and 2. If 
http content-length header does not exactly match actual content length, 
we skip the document (This happens quite frequently).  I was thinking of 
doing a parse-xpdf plugin to use in place of parse-ext.  It wouldn't do 
things like try to find an external script -- parse-pdf.sh -- to run but 
would just use the environments xpdf  (though you could override this of 
course) and it would try to do a better job with big pdf and perhaps 
incomplete pdf (To be explored).

St.Ack