We've noticed these too for several hundred of these. There are two ways to treat this:
1. Use the "skip list" and skip those in question during filter media.
2. We are noticing problems with Acrobat 8 and Acrobat 9 in that those versions are
adding internal taggings that the PDFBox.jar cannot handle. So far, we have done a
"save as" and changed the settings to almost strip the document of internal tagging
and other features. Next week (after the Thanksgiving Holiday) we will continue
our experimentation and study of this issue to document it for our staff so that they
can make sure all PDFs will extract correctly.
William F. Maag Library
Youngstown State University
"I must not fear. Fear is the mind-killer.
I will permit it to pass over me and through me..."
--Litany against fear....
On Nov 25, 2009, at 5:31 AM, Louw Venter wrote:
Anyone have any ideas please?
I made a bit of a mess.
A while back I uploaded some PDF documents to DSpace and ran Filter media to extract the text. Recently the creators of the pdf files s!
ent me a
numbers etc to replace the existing ones already on the server. So I simply removed the items and added new bitstreams.
Now when I run the filter media process again the text doesn't get extracted - could this be because the checksums don't match or because the original was located in one assetstore and the new one in another?
Thank you in advance for any help in this regard,
ERROR filtering, skipping bitstream:
Item Handle: 10394/1886
Bundle Name: ORIGINAL
File Size: 287223
Checksum: 6de2597a7cabd6ca3a995c355d9301f1 (MD5)
Asset Store: 1
p; at org
Dspace-general mailing list