#21 Support special XML extractors based on DTD/XSD/NS

1.6.0 - features

We live in a world of XML formats-horray!

Aperture does not convert XML formats to proper semantic RDF counterparts, it would be needed to:

make special type of Extractors that work on XML FileDataObjects (not necessarily getting SAX as input, but at least saying to support XML parsing)
Chris: I think these Extractors should still work on an InputStream: the choice whether to use SAX, DOM, XPATH, regular expressions or black magic is still best made on a case-by-case basis.

analyse .xml files to detect what data is inside: identify xml document type by Document Type Definition (DTD), XML Schema definitions (XSD), or declared Namespaces in the root element
Chris: this requires considerable work in the current MIME type identifier, or perhaps even a different MimeTypeIdentifier implementation that is invoked when the MagicMimeTypeIdentifier identifies a document as text/xml or application/xml. Right now the former tries to handle specific cases of XML-based document types using file extensions, which clearly is very brittle. I think Tika also has support for detecting XML-based document types, we should take a look at it.
invoke XML Extractors that are registered by their xml document type


what is the URI of the reported data object when having XML data?
When crawling complex XML files, such as RDF files, what is the reported URI in the single returned DataObject?
Chris: the DataObject representing the entire file should always have the URI of that file IMO. When a file needs to be broken down in multiple DataObjects, we should always use a SubCrawler, also because the AccessData needs to include the URIs of these sub-DataObjects (necessary for incremental crawling).
When the file gets too big, where is the border to SubCrawlers?
Some XML files may contain MANY documents/dataobjects, then a subcrawler would be better (See below). What is a good rule-of-thumb to determinem when to use a subcrawler and when to use an XML extractor?
Chris: as said above, I would let SubCrawler be the only implementation that produces additional DataObjects, Extractor should just populate a single DataObject (or its RDFContainer, actually). I can't think of a definitive guideline of when to use a SubCrawler or an Extractor. I think that you want a SubCrawler when the parts to represent as individual DataObjects each have their own "life cycle" (are added/deleted/edited independently) and/or when end-user apps are likely to represent these DataObjects as first-class citizens.

Proposed solutions:


extend the MimetypeRegistry to include xml types. Maybe most xml types can be represented as mimetypes
build a new registry for XML types
new mime-type detection needed?

create a new MimetypeIdentifier implementation that extracts both MimetypeIdentifier and XML detection, maybe added to the magicnumber-based mimetype identifiers or as separate class. Maybe hide the XML detection as a class referenced from the MimetypeIdentifier

new registry for special XML-handlers needed?
we base our registry based on mimetypes and use the existing mimetype registry

for additional cases we make a registry to map the URIs of DTDs, namespaces, XSD types to mimetypes they correspond to. not part of core API.

no new interface is needed for the XML exctractors (we may implement an abstract XML extractor that does SAX or something)

implement the extractors "in the wild" and look for abstractions later.


Chris to look into how to extend the MimetypeIdentifier. Extends the MagicMimetypeIdentifiers

Antoni & Leo: find example of XML files that we want to handle and put them in the test documents folder


  • Antoni Mylka

    Antoni Mylka - 2007-11-26

    Logged In: YES
    Originator: YES

    This issue is a never-ending one, I've created a wiki page for it, it will have better visibility.


    So the task is.
    1. Think about the architecture, how to extend the MagicMimeTypeIdentifier
    2. Write these thoughts on the wiki page
    3. Change/create the apropriate interfaces and document them (if necessary)
    4. Write one or two example implementations of XML extractors
    5. Update the wiki page (if anything changed)
    6. Close the ticket.

  • Antoni Mylka

    Antoni Mylka - 2011-11-28
    • milestone: --> 1.6.0 - features
  • Antoni Mylka

    Antoni Mylka - 2011-11-28

    This issue has been solved with the expanded XML detection in TikaMimeTypeIdentifier and processing with the X2RSubCrawler. I hereby declare it closed.

  • Antoni Mylka

    Antoni Mylka - 2011-11-28
    • status: open --> closed

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.

No, thanks