Re: [Aperture-devel] PDF Parse Dropping Double Letters

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Keith Bennett pisze:
> Hello, all.  We have a strange problem parsing a PDF file.  In some
> cases, when there are two of the same letters together, or even
> separated by a space, one of the letters is dropped from the parsed
> text.  Some of the double letters come through ok, but others don't.
> We don't yet know how many documents this happens on, or if it's only
> the one we noticed.  I looked at the document but could not find
> anything unusual about it.
> 
> Have you seen anything like this before?  Do you have any idea what
> might be going on?  I realize whatever is happening is probably not
> caused by Aperture itself, but by the document, or PDFBox, or
> something we're doing wrong, but I was hoping you could help us
> understand this.
> 
> Thanks,
> Keith Bennett
> 

We've had (and actually still have :) problems with OutOfMemoryErrors 
and weird unicode non-ascii characters in the PdfExtractor. Eating up 
double letters is something new. I'd be grateful if you could send the 
document to me if it doesn't contain any sensitive information.

Antoni Myłka
ant...@gm...