The current RtfExtractor uses javax.swing.text.rtf.RTFEditorKit to create a data structure out of a RTF stream and extract the text from it. This RTF parser seems to be very buggy, in practice we get lots of Exceptions, e.g.
java.lang.NullPointerException
at java.util.Hashtable.put(Unknown Source)
at javax.swing.text.rtf.RTFReader$AttributeTrackingDestination.handleKeyword(Unknown Source)
at javax.swing.text.rtf.RTFReader.handleKeyword(Unknown Source)
at javax.swing.text.rtf.RTFParser.write(Unknown Source)
at javax.swing.text.rtf.RTFParser.writeSpecial(Unknown Source)
at javax.swing.text.rtf.AbstractFilter.write(Unknown Source)
at javax.swing.text.rtf.AbstractFilter.readFromStream(Unknown Source)
at javax.swing.text.rtf.RTFEditorKit.read(Unknown Source)
at ...
For a proprietary system that needed to extract the text from RTF fragments, we uses a different approach based on (hopefully robust) regular expressions. For these RTF fragments they seemed to work flawlessly. We need to see if this approach also works for other RTF documents: is the implementation reliable and does it scale to larger documents? If so, it can replace the current implementation.
I have attached the static utility method for extracting the text from the RTF string. It can probably be optimized a little, e.g. work on an InputStream, prevent unnecessary trim's, etc.
utility class for extracting text from RTF
This age-old issue got fixed with the latest TIka-powered RTFExtractor improvements committed in rev2578