I see.  I am in the half that doesn't  want RDF as my priority is text and embedded file extraction only, so I guess Tika is the way forward for me!
(I originally evaluated Tika v Aperture some 4 years ago and plumped for Aperture as the text extraction on the whole was more accurate.)

I don't need crawlers either as I deal with files locally - so really I just need the best extractor technology.  

Thanks guys


Sent from my iPhone

Chris Bamford m: +44 7860 405292 w: www.mimecast.com
Senior Developer p: +44 207 847 8700 Address click here


On 25 Mar 2014, at 00:09, "Antoni Mylka" <mylka@users.sf.net> wrote:


Tika has graduated from Apache Incubator to a Top-Level Project a long time ago. The homepage now is http://tika.apache.org

Aperture was founded in 2005. At that time there was no Tika. It was founded by Leo Sauermann from DFKI and Christiaan Fluit from Aduna. Leo left DFKI sometime in 2009. His successor, Christian Reuschling, used Aperture in his projects unti I left DFKI at the end of 2011. Afterwards he decided he'd rather use Tika and develop some crawling functionality on top of it. RDF wasn't his priority. The result of this is the Leech Crawler http://leechcrawler.github.io/leech/. Christiaan Fluit and Aduna changed their focus completely.

So, the logical continuation of Aperture is Tika for text extraction and Leech Crawler for crawling. The Leech Crawler hasn't been updated for a year though so I don't know how active it is at the moment. Didn't use it myself. It may work though, as long as RDF is not your priority.

If RDF is your priority, then I'm afraid there is no successor. You could try to use any23.apache.org if you want to extract RDF triples from various sources (my personal experience with any23 output is rather bad though, would recommend to evaluate it carefully before you commit to it), or you could develop your own code to convert Tika's Metadata objects into RDF triples. You could also develop your own bridge between Tika and Aperture. Back in the day we discussed some TikaSubCrawler or something. Nothing tangible emerged out of it though.

In hindsight, the biggest issue with Aperture was that half of the world did not want RDF. As for the other half, everyone had their own ontologies and the Aperture output wasn't configurable. We tried to push for a "standard" ontology, but that's not an easy task (see http://www.semanticdesktop.org/ontologies/ for the current status).

[feature-requests:#118] Advice please on extracting embedded files from RTF docs

Status: open
Group: 1.6.0 - features
Created: Fri Mar 21, 2014 09:52 AM UTC by Chris Bamford
Last Updated: Mon Mar 24, 2014 09:43 PM UTC
Owner: nobody


I would like to extract embedded files from RTF docs. Loking through the codebase I have found an example on how to extract text, but nothing more.
Please can you point me in the right direction?


  • Chris

Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/aperture/feature-requests/118/

To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/