Aperture / Feature Requests / #78 improved RtfExtractor

#78 improved RtfExtractor

Milestone: 1.6.0 - features

Status: closed

Owner: Antoni Mylka

Labels: None

Priority: 5

Updated: 2011-11-28

Created: 2009-09-03

Creator: Christiaan Fluit

Private: No

The current RtfExtractor uses javax.swing.text.rtf.RTFEditorKit to create a data structure out of a RTF stream and extract the text from it. This RTF parser seems to be very buggy, in practice we get lots of Exceptions, e.g.

java.lang.NullPointerException
at java.util.Hashtable.put(Unknown Source)
at javax.swing.text.rtf.RTFReader$AttributeTrackingDestination.handleKeyword(Unknown Source)
at javax.swing.text.rtf.RTFReader.handleKeyword(Unknown Source)
at javax.swing.text.rtf.RTFParser.write(Unknown Source)
at javax.swing.text.rtf.RTFParser.writeSpecial(Unknown Source)
at javax.swing.text.rtf.AbstractFilter.write(Unknown Source)
at javax.swing.text.rtf.AbstractFilter.readFromStream(Unknown Source)
at javax.swing.text.rtf.RTFEditorKit.read(Unknown Source)
at ...

For a proprietary system that needed to extract the text from RTF fragments, we uses a different approach based on (hopefully robust) regular expressions. For these RTF fragments they seemed to work flawlessly. We need to see if this approach also works for other RTF documents: is the implementation reliable and does it scale to larger documents? If so, it can replace the current implementation.

I have attached the static utility method for extracting the text from the RTF string. It can probably be optimized a little, e.g. work on an InputStream, prevent unnecessary trim's, etc.

Discussion

Christiaan Fluit - 2009-09-03

utility class for extracting text from RTF

RTFTextExtractor.java

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Antoni Mylka - 2011-11-28

milestone: --> 1.6.0 - features
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Antoni Mylka - 2011-11-28

status: open --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Antoni Mylka - 2011-11-28

This age-old issue got fixed with the latest TIka-powered RTFExtractor improvements committed in rev2578

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

improved RtfExtractor

Group

Searches

Help

#78 improved RtfExtractor

Discussion