From: Keith B. <kbe...@bb...> - 2007-11-07 16:22:58
|
Antoni - Unfortunately, the documents are proprietary. If I encounter it =20 again with other documents, I'll be happy to send them on. However, =20 it looks like all of the documents that had problems were specified =20 as being created in version 4 format. That is, if I open the =20 document in Acrobat Reader, and I go to the Documents menu, then =20 select Security Settings, then the Description tab, the "PDF Version" =20= field says "(Acrobat 4.x)". Other documents in version 4 format work ok, as do other documents =20 with version 3 through 8. We haven't tested enough documents to =20 verify that all formats other than 4 always work, though. Do you know of any commercial PDF parsers that I could wrap in =20 Aperture that might work better? Thanks, Keith On Nov 6, 2007, at 5:44 PM, Antoni My=C5=82ka wrote: > Keith Bennett pisze: >> Hello, all. We have a strange problem parsing a PDF file. In some >> cases, when there are two of the same letters together, or even >> separated by a space, one of the letters is dropped from the parsed >> text. Some of the double letters come through ok, but others don't. >> We don't yet know how many documents this happens on, or if it's only >> the one we noticed. I looked at the document but could not find >> anything unusual about it. >> >> Have you seen anything like this before? Do you have any idea what >> might be going on? I realize whatever is happening is probably not >> caused by Aperture itself, but by the document, or PDFBox, or >> something we're doing wrong, but I was hoping you could help us >> understand this. >> >> Thanks, >> Keith Bennett >> > > We've had (and actually still have :) problems with OutOfMemoryErrors > and weird unicode non-ascii characters in the PdfExtractor. Eating up > double letters is something new. I'd be grateful if you could send the > document to me if it doesn't contain any sensitive information. > > Antoni My=C5=82ka |