If you can build a PDFBox.jar from the tip of the trunk, then you could simply try to substitute it for the old jar. I would like to know, if that would solve the problem.
There is no exact plan for a next release of GSearch. The current one also works with Fedora 3.2.1. The intention is to go for the next release, when Lucene and Solr release again, then including the newest PDFBox and probably other new ones together with a few minor improvements.
> -----Original Message-----
> From: Ben Ranker [mailto:branker@...]
> Sent: Friday, August 21, 2009 9:51 PM
> To: fedora-commons-users@...
> Subject: [Fedora-commons-users] GSearch: NPE in getDatastreamText (from
> I’m using Fedora 3.2.1 and GSearch 2.2 to index some PDF documents. The
> documents were recently created with a new version of Acrobat Pro.
> In my GSearch index, my defaultUpdateIndexDocXslt calls
> getDatastreamText() to get the text from my PDF. Whenever GSearch calls
> that function on my PDF streams I get a NullPointerException (no
> backtrace) in my tomcat catalina.log file, and the resultant solr XML
> (which is valid somehow) simply has no output from getDatastreamText().
> There’s no interesting information in my fedoragsearch.log.
> I was able to track the problem down to a bug in PDFBox. The bug is
> still present in 0.7.4 (the most recently released version AFAICT), but
> it is fixed in the tip of PDFBox trunk. I strongly suspect that it’s
> the following
> Has anyone else run into this problem? Does anyone have a patch handy?
> Are there plans to release a version of GSearch incorporating newer
> versions of PDFBox? Are there plans to release a version of GSearch
> with official support for Fedora 3.2.1? (GSearch 2.2 advertizes only
> Fedora 3.1 support.)
> Thanks in advance. Unless I hear a solution pretty soon I’m going to
> start backporting the referenced PDFBox bugfix to 0.7.2 (the version
> incorporated in GSearch 2.2). I’ll post a patch when I have one.
> Ben Ranker <branker@...>
> Emory University Libraries