From: Leo S. <leo...@gn...> - 2010-11-14 09:13:53
|
Hi Tika, (cc Aperture, just fyi) I stumbled upon http://wiki.apache.org/tika/MetadataDiscussion and http://wiki.apache.org/tika/RecursiveMetadata The problems don't stop there, if you think it through you end up with zip-files containing zip-files containing .pst and email files containing attached word documents containing embedded excel. In the sourceforge project "Aperture" (its similar to Tika) the solution was to use the W3C standard RDF which allows endlessly stacking information into each other. This was also used in the NEPOMUK-KDE linux implementation, but there in C++ and with a slightly different angle to it. it may be useful to check out their documentation and their status of dicussion: the data model: http://www.semanticdesktop.org/ontologies/ this is the specific model of stacking things into each other: http://www.semanticdesktop.org/ontologies/2007/01/19/nie/ the stacking/recursive problem was solved using "subcrawlers": http://sourceforge.net/apps/trac/aperture/wiki/SubCrawlers general structure of things coming together: http://sourceforge.net/apps/trac/aperture/wiki/GeneralStructure >From my experience (I am co-author and was initiator of most of the above) there is only a limited short-term benefit of adopting this thinking, but a bigger long-term benefit as being compatible with RDF/W3C will on the long turn make Tika compatible with what happens in HTML5 and other standardization efforts. Looking at this stuff could help as a guideline for decisions in Tika. So - Could anyone please think about it for a minute and add these links and some ideas how to deal with it to http://wiki.apache.org/tika/MetadataDiscussion and http://wiki.apache.org/tika/RecursiveMetadata ? best Leo Sauermann, Dr. CEO and Founder p.s. There used to be a much closer tie between tika and aperture in 2007, but as Aperture development is kind of finished (its in production now at some places and fixes only done when needed) it seems communication between them has lowered a bit. Anyone knows why? mail: leo...@gn... mobile: +43 6991 gnowsis http://www.gnowsis.com helping people remember, so join our newsletter http://www.gnowsis.com/about/content/newsletter ____________________________________________________ |
From: Mattmann, C. A (388J) <chr...@jp...> - 2010-11-14 17:17:01
|
Thanks Leo, we'll take a look. FYI, one of the goals of Tika is to be extremely light-weight, and to provide canonical metadata representation, independent of any particular "view" of metadata, which in my mind RDF is as much of as e.g., RSS, or FGDC, JSON, etc., or any one of the myriad views out there. Sure it comes with inference, and all of the other promised goodies, but in my experience, I've seen little real use of those in data management systems. I've seen more use of RDF as a nice, compact XML format to represent metadata and allow interchange than anything else. I'd be opposed to making it the standard in Tika though, as I said b/c to me it's just a view. Regardless, thanks for reaching out and I have a number of downstream ideas for helping Tika become more useful for showing different metadata "views" as I call them and plan on starting to implement/contribute some of them in the coming year, as soon as this book [1] starts to wrap up :) I think a number of other Tika community members have been doing a fantastic job at keeping the metadata capabilities in Tika simple, light-weight, and feature-rich, and I expect it to continue down that path. Cheers, Chris [1] http://www.manning.com/mattmann/ On 11/14/10 1:13 AM, "Leo Sauermann" <leo...@gn...> wrote: Hi Tika, (cc Aperture, just fyi) I stumbled upon http://wiki.apache.org/tika/MetadataDiscussion and http://wiki.apache.org/tika/RecursiveMetadata The problems don't stop there, if you think it through you end up with zip-files containing zip-files containing .pst and email files containing attached word documents containing embedded excel. In the sourceforge project "Aperture" (its similar to Tika) the solution was to use the W3C standard RDF which allows endlessly stacking information into each other. This was also used in the NEPOMUK-KDE linux implementation, but there in C++ and with a slightly different angle to it. it may be useful to check out their documentation and their status of dicussion: the data model: http://www.semanticdesktop.org/ontologies/ this is the specific model of stacking things into each other: http://www.semanticdesktop.org/ontologies/2007/01/19/nie/ the stacking/recursive problem was solved using "subcrawlers": http://sourceforge.net/apps/trac/aperture/wiki/SubCrawlers general structure of things coming together: http://sourceforge.net/apps/trac/aperture/wiki/GeneralStructure >From my experience (I am co-author and was initiator of most of the above) there is only a limited short-term benefit of adopting this thinking, but a bigger long-term benefit as being compatible with RDF/W3C will on the long turn make Tika compatible with what happens in HTML5 and other standardization efforts. Looking at this stuff could help as a guideline for decisions in Tika. So - Could anyone please think about it for a minute and add these links and some ideas how to deal with it to http://wiki.apache.org/tika/MetadataDiscussion and http://wiki.apache.org/tika/RecursiveMetadata ? best Leo Sauermann, Dr. CEO and Founder p.s. There used to be a much closer tie between tika and aperture in 2007, but as Aperture development is kind of finished (its in production now at some places and fixes only done when needed) it seems communication between them has lowered a bit. Anyone knows why? mail: leo...@gn... mobile: +43 6991 gnowsis http://www.gnowsis.com helping people remember, so join our newsletter http://www.gnowsis.com/about/content/newsletter ____________________________________________________ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: Chr...@jp... WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ |
From: Leo S. <leo...@gn...> - 2010-11-15 15:15:59
|
Hi, ok, good feedback, thanks for taking the time to answer. I feel an urge to take an "RDF vs JSON/..." discussion off-list, as I have seen this discussion since 1999. btw, RSS meant "RDF site syndication"... so RDF>RSS but its an important discussion - so - more input RDF is the only cross-format standard out there, there are standardized representations in XML, JSON, HTML, and databases. That would make it a good fit for frameworks, such as Tika. of course, the 120 minutes it takes to learn RDF are longer than the 10 minutes it takes to learn JSON. My experience was, that for data integration projects, the 110 minutes pay off. I guess thats the reason why Facebook and Google dig RDF now... it is the only proper way to let data flow from databases out to the web and back into other databases. (thats what google now supports with price databases and the RDF-based "GoodRelations" ecommerce SEO format) if the consensus within Tika is - "rdf is too complex for us, we don't need it", that's fine. It took Sebastian Trüg about a year of discussion in the KDE mailinglists to explain why RDF is better suited for data integration in document indexing until the KDE people were convinced to switch the system search engine to RDF. some points: Inference - please ignore this, you don't need it. Field definition - you will soon have a problem in TIKA when you want to crawl VCARD and ICAL files and extract the full richness of ALL data embedded in those formats. Here RDF helped Aperture a lot. So for the whole area of Types and their Fields and subfields and hierarchical fields, RDF could help. XML - whatever, RDF is serialization-agnostic. It works best in internal APIs I guess, where data should flow from one component to another without being reformatted. Lets see it the other way round ? if you need info why RDF is better than anything else (ho ho ho), call the Aperture-dev mailinglist, people there are eager to help I guess. Grant Ingersoll used to hang out over at the Aperture-Dev mailinglist if this is ok, I would cease this thread now from my side and say: if the question pops up, get in touch with Aperture or KDE people. if there is a need to get inspired, aperture people are there to help. I would guess the same is said for the KDE linux desktop indexing writers. There they also use RDF as format and there is an overarching standardization effort (OSCAF.org) amongst all of us.... that could also be a place to discuss, we had around a million eur spent just discussing about those RDF data formats (ontologies) that are now running ;-) I cc Sebastian Trüg in this mail, he is the main developer and boss-of-ontologies at KDE. I guess that Tika people are welcome to check out what happens on the KDE/Gnome side in the "Xesame" mailinglist. There is (not enough) documentation here whom to ask in case of questions: http://sourceforge.net/apps/trac/oscaf/wiki/Communication http://sourceforge.net/apps/trac/oscaf/wiki/Ontologies best Leo It was Mattmann, Chris A (388J) who said at the right time 14.11.2010 17:48 the following words: > Thanks Leo, we'll take a look. > > FYI, one of the goals of Tika is to be extremely light-weight, and to > provide canonical metadata representation, independent of any > particular "view" of metadata, which in my mind RDF is as much of as > e.g., RSS, or FGDC, JSON, etc., or any one of the myriad views out > there. Sure it comes with inference, and all of the other promised > goodies, but in my experience, I've seen little real use of those in > data management systems. I've seen more use of RDF as a nice, compact > XML format to represent metadata and allow interchange than anything > else. I'd be opposed to making it the standard in Tika though, as I > said b/c to me it's just a view. > > Regardless, thanks for reaching out and I have a number of downstream > ideas for helping Tika become more useful for showing different > metadata "views" as I call them and plan on starting to > implement/contribute some of them in the coming year, as soon as this > book [1] starts to wrap up :) I think a number of other Tika > community members have been doing a fantastic job at keeping the > metadata capabilities in Tika simple, light-weight, and feature-rich, > and I expect it to continue down that path. > > Cheers, Chris > > [1] http://www.manning.com/mattmann/ > > On 11/14/10 1:13 AM, "Leo Sauermann" <leo...@gn...> > wrote: > > Hi Tika, (cc Aperture, just fyi) > > I stumbled upon http://wiki.apache.org/tika/MetadataDiscussion and > http://wiki.apache.org/tika/RecursiveMetadata > > > The problems don't stop there, if you think it through you end up > with zip-files containing zip-files containing .pst and email files > containing attached word documents containing embedded excel. > > In the sourceforge project "Aperture" (its similar to Tika) the > solution was to use the W3C standard RDF which allows endlessly > stacking information into each other. This was also used in the > NEPOMUK-KDE linux implementation, but there in C++ and with a > slightly different angle to it. > > it may be useful to check out their documentation and their status > of dicussion: > > the data model: http://www.semanticdesktop.org/ontologies/ > > this is the specific model of stacking things into each other: > http://www.semanticdesktop.org/ontologies/2007/01/19/nie/ > > the stacking/recursive problem was solved using "subcrawlers": > http://sourceforge.net/apps/trac/aperture/wiki/SubCrawlers > > general structure of things coming together: > http://sourceforge.net/apps/trac/aperture/wiki/GeneralStructure > > > From my experience (I am co-author and was initiator of most of the > above) there is only a limited short-term benefit of adopting this > thinking, but a bigger long-term benefit as being compatible with > RDF/W3C will on the long turn make Tika compatible with what happens > in HTML5 and other standardization efforts. Looking at this stuff > could help as a guideline for decisions in Tika. > > > So - Could anyone please think about it for a minute and add these > links and some ideas how to deal with it to > http://wiki.apache.org/tika/MetadataDiscussion and > http://wiki.apache.org/tika/RecursiveMetadata ? > > > best Leo Sauermann, Dr. CEO and Founder > > p.s. There used to be a much closer tie between tika and aperture in > 2007, but as Aperture development is kind of finished (its in > production now at some places and fixes only done when needed) it > seems communication between them has lowered a bit. Anyone knows > why? > > > mail: leo...@gn... mobile: +43 6991 gnowsis > http://www.gnowsis.com > > helping people remember, > > so join our newsletter > http://www.gnowsis.com/about/content/newsletter > ____________________________________________________ > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion > Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: > 171-246 Email: Chr...@jp... WWW: > http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department University > of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > -- Leo Sauermann, Dr. CEO and Founder mail: leo...@gn... mobile: +43 6991 gnowsis http://www.gnowsis.com helping people remember, so join our newsletter http://www.gnowsis.com/about/content/newsletter ____________________________________________________ |
From: Mattmann, C. A (388J) <chr...@jp...> - 2010-11-15 16:59:12
|
Thanks Leo, appreciate the discussion. Cheers, Chris On 11/15/10 7:15 AM, "Leo Sauermann" <leo...@gn...> wrote: Hi, ok, good feedback, thanks for taking the time to answer. I feel an urge to take an "RDF vs JSON/..." discussion off-list, as I have seen this discussion since 1999. btw, RSS meant "RDF site syndication"... so RDF>RSS but its an important discussion - so - more input RDF is the only cross-format standard out there, there are standardized representations in XML, JSON, HTML, and databases. That would make it a good fit for frameworks, such as Tika. of course, the 120 minutes it takes to learn RDF are longer than the 10 minutes it takes to learn JSON. My experience was, that for data integration projects, the 110 minutes pay off. I guess thats the reason why Facebook and Google dig RDF now... it is the only proper way to let data flow from databases out to the web and back into other databases. (thats what google now supports with price databases and the RDF-based "GoodRelations" ecommerce SEO format) if the consensus within Tika is - "rdf is too complex for us, we don't need it", that's fine. It took Sebastian Trüg about a year of discussion in the KDE mailinglists to explain why RDF is better suited for data integration in document indexing until the KDE people were convinced to switch the system search engine to RDF. some points: Inference - please ignore this, you don't need it. Field definition - you will soon have a problem in TIKA when you want to crawl VCARD and ICAL files and extract the full richness of ALL data embedded in those formats. Here RDF helped Aperture a lot. So for the whole area of Types and their Fields and subfields and hierarchical fields, RDF could help. XML - whatever, RDF is serialization-agnostic. It works best in internal APIs I guess, where data should flow from one component to another without being reformatted. Lets see it the other way round ? if you need info why RDF is better than anything else (ho ho ho), call the Aperture-dev mailinglist, people there are eager to help I guess. Grant Ingersoll used to hang out over at the Aperture-Dev mailinglist if this is ok, I would cease this thread now from my side and say: if the question pops up, get in touch with Aperture or KDE people. if there is a need to get inspired, aperture people are there to help. I would guess the same is said for the KDE linux desktop indexing writers. There they also use RDF as format and there is an overarching standardization effort (OSCAF.org) amongst all of us.... that could also be a place to discuss, we had around a million eur spent just discussing about those RDF data formats (ontologies) that are now running ;-) I cc Sebastian Trüg in this mail, he is the main developer and boss-of-ontologies at KDE. I guess that Tika people are welcome to check out what happens on the KDE/Gnome side in the "Xesame" mailinglist. There is (not enough) documentation here whom to ask in case of questions: http://sourceforge.net/apps/trac/oscaf/wiki/Communication http://sourceforge.net/apps/trac/oscaf/wiki/Ontologies best Leo It was Mattmann, Chris A (388J) who said at the right time 14.11.2010 17:48 the following words: > Thanks Leo, we'll take a look. > > FYI, one of the goals of Tika is to be extremely light-weight, and to > provide canonical metadata representation, independent of any > particular "view" of metadata, which in my mind RDF is as much of as > e.g., RSS, or FGDC, JSON, etc., or any one of the myriad views out > there. Sure it comes with inference, and all of the other promised > goodies, but in my experience, I've seen little real use of those in > data management systems. I've seen more use of RDF as a nice, compact > XML format to represent metadata and allow interchange than anything > else. I'd be opposed to making it the standard in Tika though, as I > said b/c to me it's just a view. > > Regardless, thanks for reaching out and I have a number of downstream > ideas for helping Tika become more useful for showing different > metadata "views" as I call them and plan on starting to > implement/contribute some of them in the coming year, as soon as this > book [1] starts to wrap up :) I think a number of other Tika > community members have been doing a fantastic job at keeping the > metadata capabilities in Tika simple, light-weight, and feature-rich, > and I expect it to continue down that path. > > Cheers, Chris > > [1] http://www.manning.com/mattmann/ > > On 11/14/10 1:13 AM, "Leo Sauermann" <leo...@gn...> > wrote: > > Hi Tika, (cc Aperture, just fyi) > > I stumbled upon http://wiki.apache.org/tika/MetadataDiscussion and > http://wiki.apache.org/tika/RecursiveMetadata > > > The problems don't stop there, if you think it through you end up > with zip-files containing zip-files containing .pst and email files > containing attached word documents containing embedded excel. > > In the sourceforge project "Aperture" (its similar to Tika) the > solution was to use the W3C standard RDF which allows endlessly > stacking information into each other. This was also used in the > NEPOMUK-KDE linux implementation, but there in C++ and with a > slightly different angle to it. > > it may be useful to check out their documentation and their status > of dicussion: > > the data model: http://www.semanticdesktop.org/ontologies/ > > this is the specific model of stacking things into each other: > http://www.semanticdesktop.org/ontologies/2007/01/19/nie/ > > the stacking/recursive problem was solved using "subcrawlers": > http://sourceforge.net/apps/trac/aperture/wiki/SubCrawlers > > general structure of things coming together: > http://sourceforge.net/apps/trac/aperture/wiki/GeneralStructure > > > From my experience (I am co-author and was initiator of most of the > above) there is only a limited short-term benefit of adopting this > thinking, but a bigger long-term benefit as being compatible with > RDF/W3C will on the long turn make Tika compatible with what happens > in HTML5 and other standardization efforts. Looking at this stuff > could help as a guideline for decisions in Tika. > > > So - Could anyone please think about it for a minute and add these > links and some ideas how to deal with it to > http://wiki.apache.org/tika/MetadataDiscussion and > http://wiki.apache.org/tika/RecursiveMetadata ? > > > best Leo Sauermann, Dr. CEO and Founder > > p.s. There used to be a much closer tie between tika and aperture in > 2007, but as Aperture development is kind of finished (its in > production now at some places and fixes only done when needed) it > seems communication between them has lowered a bit. Anyone knows > why? > > > mail: leo...@gn... mobile: +43 6991 gnowsis > http://www.gnowsis.com > > helping people remember, > > so join our newsletter > http://www.gnowsis.com/about/content/newsletter > ____________________________________________________ > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion > Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: > 171-246 Email: Chr...@jp... WWW: > http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department University > of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > -- Leo Sauermann, Dr. CEO and Founder mail: leo...@gn... mobile: +43 6991 gnowsis http://www.gnowsis.com helping people remember, so join our newsletter http://www.gnowsis.com/about/content/newsletter ____________________________________________________ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: Chr...@jp... WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ |
From: Jukka Z. <jzi...@ad...> - 2010-11-15 19:39:38
|
Hi, From: Leo Sauermann [mailto:leo...@gn...] > RDF is the only cross-format standard out there, there are standardized > representations in XML, JSON, HTML, and databases. That would make it a > good fit for frameworks, such as Tika. Agreed. The idea of using XMP (a metadata model based on RDF) has come up every now and then on dev@tika (see the archives), and I think that's what we should be working towards. Note however that the scope of Tika has at least so far been intentionally smaller than that of Aperture. For example, we explicitly don't try to preserve the full structural or semantic details of parsed documents. Thus the points about mapping VCARD or ICAL data to RDF are somewhat irrelevant for Tika, as we'd just map such data to semi-structured XHTML whose main purpose is to support full text indexing or other unstructured text processing applications. In other words, Tika is lossy by design. Another point, more related to recursive metadata, is that we make no attempt at defining a representation for compound documents. The rationale for this is that such representations are necessarily application- or domain-specific. Tika avoids making those design choices by having the Parser API only recognize singular documents, but allowing programmatic access to subdocuments through the EmbeddedDocumentExtractor (or the more general ParseContext) mechanism. A client application can use these tools to construct any kind of hierarchical metadata structures. To summarize: yes, I think RDF is a good idea for Tika, but only in terms of extending our metadata model to XMP. I don't see how RDF would be more useful than XHTML in representing the full text content of a document; at least as long as we're not looking at radically extending the scope of Tika. BR, Jukka Zitting |