I did this (media wiki crawler) by copying an existing source (web crawler), and calling the library to extract the wiki data elements.
I would at a guess look at copying the existing RTF extractor and making a new extractor , perhaps using tika, or modify the existing rtf extractor ( probably easier as the aperture model is quite convoluted without a road map) again possibly using tika.
There were some discussions about using tika instead of aperture, but aperture does two things. extract and then do the RDF stuff (I think!!) where as tika does only the first.
I feel that in an ideal world the extraction should be divorced from aperture, but at the time tika wasn't readily available.
Not a perfect solution but it is possible, however you will have to study the code, and its not just java, look down the resource path as well, there be stuff lurking there!!
Advice please on extracting embedded files from RTF docs
Group: 1.6.0 - features
Created: Fri Mar 21, 2014 09:52 AM UTC by Chris Bamford
Last Updated: Sat Mar 22, 2014 09:05 PM UTC
I would like to extract embedded files from RTF docs. Loking through the codebase I have found an example on how to extract text, but nothing more.
Please can you point me in the right direction?
sourceforge.net because you indicated interest in
To unsubscribe from further messages, please visit