Support special XML extractors based on DTD/XSD/NS
We live in a world of XML formats-horray!
Note that this issue is being worked on. This wiki page will be updated as new developments arrive. See the corresponding ticket:
Aperture does not convert XML formats to proper semantic RDF counterparts, it would be needed to:
- make special type of Extractors that work on XML FileDataObjects (not necessarily getting SAX as input, but at least saying to support XML parsing)
- Chris: I think these Extractors should still work on an InputStream: the choice whether to use SAX, DOM, XPATH, regular expressions or black magic is still best made on a case-by-case basis.
- analyse .xml files to detect what data is inside: identify xml document type by Document Type Definition (DTD), XML Schema definitions (XSD), or declared Namespaces in the root element
- Chris: this requires considerable work in the current MIME type identifier, or perhaps even a different MimeTypeIdentifier implementation that is invoked when the MagicMimeTypeIdentifier identifies a document as text/xml or application/xml. Right now the former tries to handle specific cases of XML-based document types using file extensions, which clearly is very brittle. I think Tika also has support for detecting XML-based document types, we should take a look at it.
- invoke XML Extractors that are registered by their xml document type
This could also improve the handling when crawling RDF files :-)
what is the URI of the reported data object when having XML data?
When crawling complex XML files, such as RDF files, what is the reported URI in the single returned DataObject?
- Chris: the DataObject representing the entire file should always have the URI of that file IMO. When a file needs to be broken down in multiple DataObjects, we should always use a SubCrawler, also because the AccessData needs to include the URIs of these sub-DataObjects (necessary for incremental crawling).
When the file gets too big, where is the border to SubCrawlers?
Some XML files may contain MANY documents/dataobjects, then a subcrawler would be better (See below). What is a good rule-of-thumb to determinem when to use a subcrawler and when to use an XML extractor?
- Chris: as said above, I would let SubCrawler be the only implementation that produces additional DataObjects, Extractor should just populate a single DataObject (or its RDFContainer, actually). I can't think of a definitive guideline of when to use a SubCrawler or an Extractor. I think that you want a SubCrawler when the parts to represent as individual DataObjects each have their own "life cycle" (are added/deleted/edited independently) and/or when end-user apps are likely to represent these DataObjects as first-class citizens.
Proposed solutions for XML:
- extend the MimetypeRegistry to include xml types. Maybe most xml types can be represented as mimetypes
- build a new registry for XML types
new mime-type detection needed?
- create a new MimetypeIdentifier implementation that extracts both MimetypeIdentifier and XML detection, maybe added to the magicnumber-based mimetype identifiers or as separate class. Maybe hide the XML detection as a class referenced from the MimetypeIdentifier
new registry for special XML-handlers needed?
- we base our registry based on mimetypes and use the existing mimetype registry
- for additional cases we make a registry to map the URIs of DTDs, namespaces, XSD types to mimetypes they correspond to. not part of core API.
no new interface is needed for the XML exctractors (we may implement an abstract XML extractor that does SAX or something)
implement the extractors "in the wild" and look for abstractions later.
- Chris to look into how to extend the MimetypeIdentifier. Extends the MagicMimetypeIdentifiers
- Antoni & Leo: find example of XML files that we want to handle and put them in the test documents folder