Thanks for the info.  I will look into this.  I also found Jon Iles' RTF Parser Kit on github, will look at that too.


I did this (media wiki crawler) by copying an existing source (web crawler), and calling the library to extract the wiki data elements.

I would at a guess look at copying the existing RTF extractor and making a new extractor , perhaps using tika, or modify the existing rtf extractor ( probably easier as the aperture model is quite convoluted without a road map) again possibly using tika.

There were some discussions about using tika instead of aperture, but aperture does two things. extract and then do the RDF stuff (I think!!) where as tika does only the first.

I feel that in an ideal world the extraction should be divorced from aperture, but at the time tika wasn't readily available.

Not a perfect solution but it is possible, however you will have to study the code, and its not just java, look down the resource path as well, there be stuff lurking there!!

I would like to extract embedded files from RTF docs. Loking through the codebase I have found an example on how to extract text, but nothing more.
Please can you point me in the right direction?


