Menu

#25 PDFTextStripper not handling some Japanese

open
nobody
None
5
2007-11-29
2007-11-29
sflaumen
No

Using this code sequence:

PDDocument document = PDDocument.load(stream);
PDFTextStripper stripper = new PDFTextStripper();
String contents = stripper.getText(document);

some Japanese documents are handled properly. This is shown by viewing the chars in the String "contents".
However, other Japanese documents produce garbage non-Japanese characters as viewed in the String contents.

The ones that are not handled properly in PDFTextStripper display a prompt when opened in Acrobat Reader which says that a Japanese language support pack needs to be installed to view the document properly. The ones that are handled properly display Japanese characters fine when viewed through Acrobat Reader. Installing the language support pack is not a solution since it would only resolve the display in Acrobat Reader. This code needs to run on a Unix server so even if the support pack would provide help on a PC (unlikely) it would have no affect on the task when run in Unix.

This appears to be an encoding issue however, unlike similar issues that have been reported, the above code completes successfully. It is just that the results are as described above.

Attached is an example of a PDF file that is not handled properly by PDFTextStripper and requires a Japanese language pack to view in Acrobat Reader.

Discussion

  • sflaumen

    sflaumen - 2007-11-29

    Not handled properly by PDFTextStripper

     
  • sflaumen

    sflaumen - 2007-12-01

    Logged In: YES
    user_id=1948467
    Originator: YES

    After looking over the code in PDFBox, I would like to suggest that this problem is caused by not having the latest cmap files in the PDFBox cmap folder. Specifically, this folder contains cmap files through the Adobe-Japan1-4 Character Collection. However, additional character collections have been added by Adobe since then. Specifically, they now contain collections for Adobe-Japan1-5 and Adobe-Japan1-6. See Adobe Technical Note #5078.

    Also, I downloaded the japanese font support pack for Acrobat Reader 8.0 which did resolve the display issue for viewing this pdf document. You can find the list of cmap files in the Resources folder for Acrobat after the download. However, copying these into the one for PDFBox did not solve the problem. I think it is because the identity cmap files are missing which are need to do the conversion. See the 00_ReadMe.pdf in the PDFBox cmaps folder. Please let me know if I'm on the right track. This technology is new to me. Thanks, Steve

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.