Thanks for the info.  I will look into this.  I also found Jon Iles' RTF Parser Kit on github, will look at that too.


- Chris

Chris Bamford m: +44 7860 405292 w:
Senior Developer p: +44 207 847 8700 Address click here

On 22 Mar 2014, at 22:34, P Foomer wrote:

I did this (media wiki crawler) by copying an existing source (web crawler), and calling the library to extract the wiki data elements.

I would at a guess look at copying the existing RTF extractor and making a new extractor , perhaps using tika, or modify the existing rtf extractor ( probably easier as the aperture model is quite convoluted without a road map) again possibly using tika.

There were some discussions about using tika instead of aperture, but aperture does two things. extract and then do the RDF stuff (I think!!) where as tika does only the first.

I feel that in an ideal world the extraction should be divorced from aperture, but at the time tika wasn't readily available.

Not a perfect solution but it is possible, however you will have to study the code, and its not just java, look down the resource path as well, there be stuff lurking there!!

[feature-requests:#118] Advice please on extracting embedded files from RTF docs

Status: open
Group: 1.6.0 - features
Created: Fri Mar 21, 2014 09:52 AM UTC by Chris Bamford
Last Updated: Sat Mar 22, 2014 09:05 PM UTC
Owner: nobody


I would like to extract embedded files from RTF docs. Loking through the codebase I have found an example on how to extract text, but nothing more.
Please can you point me in the right direction?


  • Chris

Sent from because you indicated interest in

To unsubscribe from further messages, please visit