Re: [VuFind-Tech] File Not Indexed by Aperture in Website Indexing

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

I tried manually running Aperture on the problem PDF.  I just switched to the "bin" subdirectory of my Aperture installation and typed:

./webcrawler.sh -x -v -o tmp.xml -depth 0 http://americanjewisharchives.org/journal/PDF/2009_61_01_00_Kalman.pdf

This creates a "tmp.xml" file containing details about the PDF; this is what the indexer uses to load full text.  The "-depth 0" tells it to only index the requested file, not to crawl deeper into its links.

Sometimes running a manual crawl like this will help identify a problem... though in this particular case, Aperture seemed to succeed for me.  This means indicates one of two possible problems:

1.)    Intermittent network issues on your end that are causing links to occasionally fail.

2.)    A strange character in the full text that is preventing VuFind from parsing the XML from Aperture during indexing.

If you have access to your web logs, you might want to check for errors that might be related to #1.  Or you might want to use some kind of external tool to poll your web server periodically and see if it occasionally fails to serve certain content.  Unfortunately, it appears that Aperture isn't too particular about server errors - for example, if I give it an invalid URL that returns a 404 error, it just treats that as an empty page but otherwise reports success (even if I enable the "--validate" parameter).  Perhaps we can address this particular problem by switching to Tika or a different tool (see http://vufind.org/jira/browse/VUFIND-600), but for now, the burden is on your server to perform correctly.

If the problem is #2, you might learn more about the issue by using the "--test-only" parameter of the XSL importer.  Go to the import directory of your VuFind directory and run:

php import-xsl.php --test-only sitemap.xml sitemap.properties | more

(where sitemap.xml is a file containing the URL(s) you wish to test).

This will output the Solr document generated by the import procedure.  You can see whether or not the proper full text has been inserted.

When I tried this with the problem PDF, it appears that the full text is being truncated.  This seems to be caused by a NULL byte in the Aperture output.  As a workaround, I tried changing the line that reads in the file in import/xsl/VuFindSitemap.php to this:

        // Extract and decode the full text from the XML:
        $xml = preg_replace('/[\\x0000]/', '', file_get_contents($xmlFile));

That seemed to help...

- Demian

From: Nathan Tallman [mailto:nta...@gm...]
Sent: Monday, July 02, 2012 5:00 PM
To: Demian Katz
Cc: vufind-tech
Subject: Re: [VuFind-Tech] File Not Indexed by Aperture in Website Indexing

I created a test sitemap with three URLs: two problem HTML pages and one problem PDF. The two HTML files were picked up fine, but the PDF still wasn't indexed.

Not sure where to proceed from here. My sitemap is already broken up into 50-URL chunks and there's definitely enough memory.

The sitemap index is located at http://americanjewisharchives.org/sitemaps/sitemap.xml (be warned, there are bout 8750 URLs listed!). Two of the problem URLs are http://www.americanjewisharchives.org/aja/FindingAids/ms0778/ms0778.html and http://americanjewisharchives.org/aja/FindingAids/ms0603.html. problem PDF at http://americanjewisharchives.org/journal/PDF/2009_61_01_00_Kalman.pdf.

Thanks Demian!

Nathan
On Fri, Jun 29, 2012 at 12:55 PM, Demian Katz <dem...@vi...<mailto:dem...@vi...>> wrote:
A good starting point for troubleshooting would be to create a small sitemap.xml file that only lists a couple of the known problem files -- it would be interesting to see if indexing them on their own works better than indexing them in the context of the full sitemap.  If the URLs are public, you could send me the sitemap file so I can check for different results on my test server.

- Demian
________________________________
From: Nathan Tallman [nta...@gm...<mailto:nta...@gm...>]
Sent: Friday, June 29, 2012 12:18 PM
To: vufind-tech
Subject: [VuFind-Tech] File Not Indexed by Aperture in Website Indexing
I'm using using VUFIND-454 <http://vufind.org/jira/browse/VUFIND-454> to index our institutional website. There are some webpages (HTML) and PDFs that are listed in the sitemap, yet not getting indexed. The HTML is standard and pages that have identical coding with different text get indexed fine. The PDFs are generated from InDesign and searchable in Acrobat/Reader, so the text should be easy to scrape. Again, it indexes similar files without a problem.

Aperture isn't outputting anything that looks like it's missing files, no PHP fatal errors about memory (which was once a problem, now solved.)

Any ideas on what might be causing this or how to troubleshoot?

Thanks,
Nathan