I have another problem - when I extract text from DjVu file sI have a problem with diacritic (country specific) characters. It looks that these characters are encoded in two bytes (UTF?) but while extracting the text these two bytes are treated as two separates characters. Is there a possibility to define charset/text encoding that will used during text extraction?
The DjVu specification requires the text layer to be encoded in UTF8. Unfortunately, this specification is not always followed. In those cases the text is returned one byte per character. If you know what locale the text is written for you can convert it with the Java API by copying it to a ByteStream and reading it back as the respective encoding.
Log in to post a comment.
Sign up for the SourceForge newsletter:
You seem to have CSS turned off.
Please don't fill out this field.