Re: [gscan2pdf-help] Unicode in PDF

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On 29 August 2010 21:11, John Fingerhut <and...@gm...> wrote:
> Did my earlier attempt at sending an email get through, with an attached
> slightly modified version of your Perl script, with a few Greek characters
> added to the string, and my comments about how the text is visible, but not
> searchable or pdftotext-able?

To be honest, I didn't try it, because I had already done something
similar myself, with identical results. The only additional
information I gleaned was that evince (or more probably poppler)
complains on the command line that the PDFs are corrupt.

> Are you thinking of trying to fix whatever limitations exist in PDF::API2
> that make the text unable to be searched?  Without that capability, there
> isn't much point in using that method in gscan2pdf.

Given that the Unicode text is displayed correctly, I am hoping that
it won't require too much work to patch PDF::API2 to create valid PDF
that pdftotext can read.

Note that is doesn't seem to be a problem with Unicode itself, but
with the handling of the DejaVu font (or maybe all TTF). When I tried
standard ASCII in the same manner, I also got a corrupt PDF.

Regards

Jeff