We've noticed these too for several hundred of these.  There are two ways to treat this:

1.  Use the "skip list" and skip those in question during filter media.

2.  We are noticing problems with Acrobat 8 and Acrobat 9 in that those versions are
adding internal taggings that the PDFBox.jar cannot handle.  So far, we have done a
"save as" and changed the settings to almost strip the document of internal tagging
and other features.  Next week (after the Thanksgiving Holiday) we will continue
our experimentation and study of this issue to document it for our staff so that they
can make sure all PDFs will extract correctly.


Jeffrey Trimble
System LIbrarian
William F.  Maag Library
Youngstown State University
330.941.2483 (Office)
"I must not fear.  Fear is the mind-killer.
I will permit it to pass over me and through me..."
--Litany against fear....

On Nov 25, 2009, at 5:31 AM, Louw Venter wrote:

Anyone have any ideas please?

Vrywaringsklousule / Disclaimer: http://www.nwu.ac.za/it/gov-man/disclaimer.html

>>> On 03 November 2009 at 12:40 PM, "Louw Venter" <Louw.Venter@nwu.ac.za> wrote:
Hello all,
I made a bit of a mess.
A while back I uploaded some PDF documents to DSpace and ran Filter media to extract the text. Recently the creators of the pdf files s! ent me a numbers etc to replace the existing ones already on the server. So I simply removed the items and added new bitstreams.
Now when I run the filter media process again the text doesn't get extracted - could this be because the checksums don't match or because the original was located in one assetstore and the new one in another?
Thank you in advance for any help in this regard,
ERROR filtering, skipping bitstream:
        Item Handle: 10394/1886
        Bundle Name: ORIGINAL
        File Size: 287223
        Checksum: 6de2597a7cabd6ca3a995c355d9301f1 (MD5)
        Asset Store: 1
    &! nbsp;&nbs model.PDPageNode.getAllKids(PDPageNode.java:194)
        at org.pdfbox.pdmodel.PDPageNode.getAllKids(PDPageNode.java:182)
        at org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages(PDDocumentCatalog.java:226)
        at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
        at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:141)
        at org.dspace.app.mediafilter.MediaFilterManager.processBitstream(MediaFilterManager.java:668)
        at org.dspace.app.mediafilter.MediaFilterManager.filterBitstream(MediaFilterManager.java:570)
        at org.dspace.app.mediafilter.MediaFilterManager.filterItem(MediaFilterManager.java:520)
      &nbs! p; at org ediaFilterManager.applyFiltersItem(MediaFilterManager.java:488)
        at org.dspace.app.mediafilter.MediaFilterManager.applyFiltersAllItems(MediaFilterManager.java:427)
        at org.dspace.app.mediafilter.MediaFilterManager.main(MediaFilterManager.java:359)
Louw Venter
Dspace-general mailing list