Hi Ben,
FontBox couldn't parse some cmap files which contain "cidrange" property (ex. 90ms-RKSJ-H) correctly. Extract text from PDF which contain such encoding by PDFBox, it will fail as follows.
java.io.IOException: Unknown encoding for '90ms-RKSJ-H'
at org.pdfbox.encoding.EncodingManager.getEncoding(EncodingManager.java:82)
at org.pdfbox.pdmodel.font.PDFont.getEncoding(PDFont.java:618)
at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:471)
at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:325)
at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)
at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)
at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)
at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)
at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)
at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)
at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)
at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149)
at TempTestMain.main(TempTestMain.java:56)
Regards,
Yuta