[Archive-access-discuss] On nutchwax not indexing images

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Below is a snippet from the mail Charlie Foetz sent to this list last week.

Comments inline.

> ==================================================================
> PART 2 - INDEXED WITH THE CVS HEAD OF NUTCHWAX (BUILT FROM SOURCE)
> ==================================================================
> 
> 8) No images indexed?
> =====================
> 

I just downloaded HEAD and it seems to be indexing images fine.

....

> 
> So I look in the indexarcs output file and notice I have plenty of entries
> like this:
> 
> (...)
> 050929 115748 adding 4223 bytes of mimetype image/jpeg
> http://www.greng.lu/files/images/20050610-Bouton-MeyersRene.jpg
> 050929 115748 Failed parse: Content-Type not text/html: image/jpeg
> (...)
> 
When I read the above, it makes me think that you the configuration is incorrect. Its tricky getting it right. The above seems to imply that the html parser is the last parser plugin to run whereas HEAD goes out of its way to run the default-parser last (It looks like the config. is the default nutch config. rather than the nutchwax config.).

Checkout this FAQ: http://archive-access.sourceforge.net/projects/nutch/faq.html#default_parser

Try using one of the bundles from our continuous build server.  It has most recent builds of nutchwax on it.  Checkout under the 'build artifacts' link on this page: http://crawltools.archive.org:8080/cruisecontrol/buildresults/HEAD-archive-access.

(I'm adding link to continuous build server up on nutchwax site).

St.Ack