Thanks Leo, appreciate the discussion.

Cheers,
Chris


On 11/15/10 7:15 AM, "Leo Sauermann" <leo.sauermann@gnowsis.com> wrote:

Hi,

ok, good feedback, thanks for taking the time to answer.

I feel an urge to take an "RDF vs JSON/..." discussion off-list, as I
have seen this discussion since 1999. btw, RSS meant "RDF site
syndication"... so RDF>RSS

but its an important discussion - so - more input

RDF is the only cross-format standard out there, there are standardized
representations in XML, JSON, HTML, and databases. That would make it a
good fit for frameworks, such as Tika.

of course, the 120 minutes it takes to learn RDF are longer than the 10
minutes it takes to learn JSON. My experience was, that for data
integration projects, the 110 minutes pay off.

I guess thats the reason why Facebook and Google dig RDF now... it is
the only proper way to let data flow from databases out to the web and
back into other databases.
(thats what google now supports with price databases and the RDF-based
"GoodRelations" ecommerce SEO format)


if the consensus within Tika is - "rdf is too complex for us, we don't
need it", that's fine.
It took Sebastian Trüg about a year of discussion in the KDE
mailinglists to explain why RDF is better suited for data integration in
document indexing until the KDE people were convinced to switch the
system search engine to RDF.

some points:
Inference - please ignore this, you don't need it.

Field definition - you will soon have a problem in TIKA when you want to
crawl VCARD and ICAL files and extract the full richness of ALL data
embedded in those formats. Here RDF helped Aperture a lot.
So for the whole area of Types and their Fields and subfields and
hierarchical fields, RDF could help.

XML - whatever, RDF is serialization-agnostic. It works best in internal
APIs I guess, where data should flow from one component to another
without being reformatted.



Lets see it the other way round ?

if you need info why RDF is better than anything else (ho ho ho), call
the Aperture-dev mailinglist, people there are eager to help I guess.

Grant Ingersoll used to hang out over at the Aperture-Dev mailinglist

if this is ok, I would cease this thread now from my side and say: if
the question pops up, get in touch with Aperture or KDE people.

if there is a need to get inspired, aperture people are there to help.

I would guess the same is said for the KDE linux desktop indexing
writers. There they also use RDF as format and there is an overarching
standardization effort (OSCAF.org) amongst all of us.... that could also
be a place to discuss, we had around a million eur spent just discussing
about those RDF data formats (ontologies) that are now running ;-)

I cc Sebastian Trüg in this mail, he is the main developer and
boss-of-ontologies at KDE. I guess that Tika people are welcome to check
out what happens on the KDE/Gnome side in the "Xesame" mailinglist.
There is (not enough) documentation here whom to ask in case of questions:
http://sourceforge.net/apps/trac/oscaf/wiki/Communication
http://sourceforge.net/apps/trac/oscaf/wiki/Ontologies

best
Leo


It was Mattmann, Chris A (388J) who said at the right time 14.11.2010
17:48 the following words:
> Thanks Leo, we'll take a look.
>
> FYI, one of the goals of Tika is to be extremely light-weight, and to
> provide canonical metadata representation, independent of any
> particular "view" of metadata, which in my mind RDF is as much of as
> e.g., RSS, or FGDC, JSON, etc., or any one of the myriad views out
> there. Sure it comes with inference, and all of the other promised
> goodies, but in my experience, I've seen little real use of those in
> data management systems. I've seen more use of RDF as a nice, compact
> XML format to represent metadata and allow interchange than anything
> else. I'd be opposed to making it the standard in Tika though, as I
> said b/c to me it's just a view.
>
> Regardless, thanks for reaching out and I have a number of downstream
> ideas for helping Tika become more useful for showing different
> metadata "views" as I call them and plan on starting to
> implement/contribute some of them in the coming year, as soon as this
> book [1] starts to wrap up :) I think a number of other Tika
> community members have been doing a fantastic job at keeping the
> metadata capabilities in Tika simple, light-weight, and feature-rich,
> and I expect it to continue down that path.
>
> Cheers, Chris
>
> [1] http://www.manning.com/mattmann/
>
> On 11/14/10 1:13 AM, "Leo Sauermann" <leo.sauermann@gnowsis.com>
> wrote:
>
> Hi Tika, (cc Aperture, just fyi)
>
> I stumbled upon http://wiki.apache.org/tika/MetadataDiscussion and
> http://wiki.apache.org/tika/RecursiveMetadata
>
>
> The problems don't stop there, if you think it through you end up
> with zip-files containing zip-files containing .pst and email files
> containing attached word documents containing embedded excel.
>
> In the sourceforge project "Aperture" (its similar to Tika) the
> solution was to use the W3C standard RDF which allows endlessly
> stacking information into each other. This was also used in the
> NEPOMUK-KDE linux implementation, but there in C++ and with a
> slightly different angle to it.
>
> it may be useful to check out their documentation and their status
> of dicussion:
>
> the data model: http://www.semanticdesktop.org/ontologies/
>
> this is the specific model of stacking things into each other:
> http://www.semanticdesktop.org/ontologies/2007/01/19/nie/
>
> the stacking/recursive problem was solved using "subcrawlers":
> http://sourceforge.net/apps/trac/aperture/wiki/SubCrawlers
>
> general structure of things coming together:
> http://sourceforge.net/apps/trac/aperture/wiki/GeneralStructure
>
>
> From my experience (I am co-author and was initiator of most of the
> above) there is only a limited short-term benefit of adopting this
> thinking, but a bigger long-term benefit as being compatible with
> RDF/W3C will on the long turn make Tika compatible with what happens
> in HTML5 and other standardization efforts. Looking at this stuff
> could help as a guideline for decisions in Tika.
>
>
> So - Could anyone please think about it for a minute and add these
> links and some ideas how to deal with it to
> http://wiki.apache.org/tika/MetadataDiscussion and
> http://wiki.apache.org/tika/RecursiveMetadata ?
>
>
> best Leo Sauermann, Dr. CEO and Founder
>
> p.s. There used to be a much closer tie between tika and aperture in
> 2007, but as Aperture development is kind of finished (its in
> production now at some places and fixes only done when needed) it
> seems communication between them has lowered a bit. Anyone knows
> why?
>
>
> mail: leo.sauermann@gnowsis.com mobile: +43 6991 gnowsis
> http://www.gnowsis.com
>
> helping people remember,
>
> so join our newsletter
> http://www.gnowsis.com/about/content/newsletter
> ____________________________________________________
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion
> Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop:
> 171-246 Email: Chris.Mattmann@jpl.nasa.gov WWW:
> http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department University
> of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>


--
Leo Sauermann, Dr.
CEO and Founder

mail: leo.sauermann@gnowsis.com
mobile: +43 6991 gnowsis
http://www.gnowsis.com

helping people remember,

so join our newsletter
http://www.gnowsis.com/about/content/newsletter
____________________________________________________



++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++