Re: [Aperture-devel] PDF Parse Dropping Double Letters

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Antoni -

Unfortunately, the documents are proprietary.  If I encounter it =20
again with other documents, I'll be happy to send them on. However, =20
it looks like all of the documents that had problems were specified =20
as being created in version 4 format.  That is, if I open the =20
document in Acrobat Reader, and I go to the Documents menu, then =20
select Security Settings, then the Description tab, the "PDF Version" =20=

field says "(Acrobat 4.x)".

Other documents in version 4 format work ok, as do other documents =20
with version 3 through 8.  We haven't tested enough documents to =20
verify that all formats other than 4 always work, though.

Do you know of any commercial PDF parsers that I could wrap in =20
Aperture that might work better?

Thanks,
Keith

On Nov 6, 2007, at 5:44 PM, Antoni My=C5=82ka wrote:

> Keith Bennett pisze:
>> Hello, all.  We have a strange problem parsing a PDF file.  In some
>> cases, when there are two of the same letters together, or even
>> separated by a space, one of the letters is dropped from the parsed
>> text.  Some of the double letters come through ok, but others don't.
>> We don't yet know how many documents this happens on, or if it's only
>> the one we noticed.  I looked at the document but could not find
>> anything unusual about it.
>>
>> Have you seen anything like this before?  Do you have any idea what
>> might be going on?  I realize whatever is happening is probably not
>> caused by Aperture itself, but by the document, or PDFBox, or
>> something we're doing wrong, but I was hoping you could help us
>> understand this.
>>
>> Thanks,
>> Keith Bennett
>>
>
> We've had (and actually still have :) problems with OutOfMemoryErrors
> and weird unicode non-ascii characters in the PdfExtractor. Eating up
> double letters is something new. I'd be grateful if you could send the
> document to me if it doesn't contain any sensitive information.
>
> Antoni My=C5=82ka