Thanks Leo, we’ll take a look.

FYI, one of the goals of Tika is to be extremely light-weight, and to provide canonical metadata representation, independent of any particular “view” of metadata, which in my mind RDF is as much of as e.g., RSS, or FGDC, JSON, etc., or any one of the myriad views out there. Sure it comes with inference, and all of the other promised goodies, but in my experience, I’ve seen little real use of those in data management systems. I’ve seen more use of RDF as a nice, compact XML format to represent metadata and allow interchange than anything else. I’d be opposed to making it the standard in Tika though, as I said b/c to me it’s just a view.

Regardless, thanks for reaching out and I have a number of downstream ideas for helping Tika become more useful for showing different metadata “views” as I call them and plan on starting to implement/contribute some of them in the coming year, as soon as this book [1] starts to wrap up :) I think a number of other Tika community members have been doing a fantastic job at keeping the metadata capabilities in Tika simple, light-weight, and feature-rich, and I expect it to continue down that path.



On 11/14/10 1:13 AM, "Leo Sauermann" <> wrote:

Hi Tika,
(cc Aperture, just fyi)

I stumbled upon

The problems don't stop there,
if you think it through you end up with zip-files containing zip-files
containing .pst and email files containing attached word documents
containing embedded excel.

In the sourceforge project "Aperture" (its similar to Tika) the solution
was to use the W3C standard RDF which allows endlessly stacking
information into each other. This was also used in the NEPOMUK-KDE linux
implementation, but there in C++ and with a slightly different angle to it.

it may be useful to check out their documentation and their status of

the data model:

this is the specific model of stacking things into each other:

the stacking/recursive problem was solved using "subcrawlers":

general structure of things coming together:

>From my experience (I am co-author and was initiator of most of the
above) there is only a limited short-term benefit of adopting this
thinking, but a bigger long-term benefit as being compatible with
RDF/W3C will on the long turn make Tika compatible with what happens in
HTML5 and other standardization efforts.
Looking at this stuff could help as a guideline for decisions in Tika.

So - Could anyone please think about it for a minute and add these links
and some ideas how to deal with it to

Leo Sauermann, Dr.
CEO and Founder

There used to be a much closer tie between tika and aperture in 2007,
but as Aperture development is kind of finished (its in production now
at some places and fixes only done when needed) it seems communication
between them has lowered a bit. Anyone knows why?

mobile: +43 6991 gnowsis

helping people remember,

so join our newsletter

Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA