#78 improved RtfExtractor

1.6.0 - features

The current RtfExtractor uses javax.swing.text.rtf.RTFEditorKit to create a data structure out of a RTF stream and extract the text from it. This RTF parser seems to be very buggy, in practice we get lots of Exceptions, e.g.

at java.util.Hashtable.put(Unknown Source)
at javax.swing.text.rtf.RTFReader$AttributeTrackingDestination.handleKeyword(Unknown Source)
at javax.swing.text.rtf.RTFReader.handleKeyword(Unknown Source)
at javax.swing.text.rtf.RTFParser.write(Unknown Source)
at javax.swing.text.rtf.RTFParser.writeSpecial(Unknown Source)
at javax.swing.text.rtf.AbstractFilter.write(Unknown Source)
at javax.swing.text.rtf.AbstractFilter.readFromStream(Unknown Source)
at javax.swing.text.rtf.RTFEditorKit.read(Unknown Source)
at ...

For a proprietary system that needed to extract the text from RTF fragments, we uses a different approach based on (hopefully robust) regular expressions. For these RTF fragments they seemed to work flawlessly. We need to see if this approach also works for other RTF documents: is the implementation reliable and does it scale to larger documents? If so, it can replace the current implementation.

I have attached the static utility method for extracting the text from the RTF string. It can probably be optimized a little, e.g. work on an InputStream, prevent unnecessary trim's, etc.


  • Christiaan Fluit

    utility class for extracting text from RTF

  • Antoni Mylka

    Antoni Mylka - 2011-11-28
    • milestone: --> 1.6.0 - features
  • Antoni Mylka

    Antoni Mylka - 2011-11-28
    • status: open --> closed
  • Antoni Mylka

    Antoni Mylka - 2011-11-28

    This age-old issue got fixed with the latest TIka-powered RTFExtractor improvements committed in rev2578


Log in to post a comment.