PDFBox / Support Requests / #25 PDFTextStripper not handling some Japanese

#25 PDFTextStripper not handling some Japanese

Status: open

Owner: nobody

Labels: None

Priority: 5

Updated: 2007-11-29

Created: 2007-11-29

Creator: sflaumen

Private: No

Using this code sequence:

PDDocument document = PDDocument.load(stream);
PDFTextStripper stripper = new PDFTextStripper();
String contents = stripper.getText(document);

some Japanese documents are handled properly. This is shown by viewing the chars in the String "contents".
However, other Japanese documents produce garbage non-Japanese characters as viewed in the String contents.

The ones that are not handled properly in PDFTextStripper display a prompt when opened in Acrobat Reader which says that a Japanese language support pack needs to be installed to view the document properly. The ones that are handled properly display Japanese characters fine when viewed through Acrobat Reader. Installing the language support pack is not a solution since it would only resolve the display in Acrobat Reader. This code needs to run on a Unix server so even if the support pack would provide help on a PC (unlikely) it would have no affect on the task when run in Unix.

This appears to be an encoding issue however, unlike similar issues that have been reported, the above code completes successfully. It is just that the results are as described above.

Attached is an example of a PDF file that is not handled properly by PDFTextStripper and requires a Japanese language pack to view in Acrobat Reader.

Discussion

sflaumen - 2007-11-29

Not handled properly by PDFTextStripper

JS51ZX3PWT1G.pdf

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

sflaumen - 2007-12-01

Logged In: YES
user_id=1948467
Originator: YES

After looking over the code in PDFBox, I would like to suggest that this problem is caused by not having the latest cmap files in the PDFBox cmap folder. Specifically, this folder contains cmap files through the Adobe-Japan1-4 Character Collection. However, additional character collections have been added by Adobe since then. Specifically, they now contain collections for Adobe-Japan1-5 and Adobe-Japan1-6. See Adobe Technical Note #5078.

Also, I downloaded the japanese font support pack for Acrobat Reader 8.0 which did resolve the display issue for viewing this pdf document. You can find the list of cmap files in the Resources folder for Acrobat after the download. However, copying these into the one for PDFBox did not solve the problem. I think it is because the identity cmap files are missing which are need to do the conversion. See the 00_ReadMe.pdf in the PDFBox cmaps folder. Please let me know if I'm on the right track. This technology is new to me. Thanks, Steve

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

PDFTextStripper not handling some Japanese

Group

Searches

Help

#25 PDFTextStripper not handling some Japanese

Discussion