Menu

#6 Improve processing of XHTML docs

1.6.0 - features
closed
general (27)
5
2011-11-28
2007-02-02
No

XHTML documents are currently not handled in an ideal manner.

First of all, the MagicMimeTypeIdentifier classifies them as text/html instead of the more appropriate application/xhtml+xml. Note that a lot of XHTML files out there still use a .html or .htm extension or even no extension at all (e.g. website root pages), so just defining it as description with text/xml as parent type and .xhtml as extension does not work.

Second, because of this inappropriate MIME type, the XmlExtractor is used instead of the HtmlExtractor (which I hope is able to process XHTML). This has two drawbacks:

- XmlExtractor just gathers and concatenates all PCDATA, it has no knowledge of e.g. title, body and META tags, etc.

- XmlExtractor uses a SAX parser, meaning that it will fail when the XHTML doc is not well-formed. And trust me, I have already encountered such docs that Firefox just displays without any problem.

The latter problem would evaporate (for XHTML docs at least) if the MIME type identifier did a better job.

Discussion

  • Christiaan Fluit

    Logged In: YES
    user_id=617090
    Originator: YES

    The website on which this problem occurred: http://www.dashandyforum.de/.

     
  • Christiaan Fluit

    Logged In: YES
    user_id=617090
    Originator: YES

    Because of this problem I have added text/xml to the set of supported MIME types for the HtmlLinkExtractor, so that you are at least able to crawl XHTML sites. This will most likely not pose any problems, as there is no other LinkExtractor for text/xml and the link extraction process will most likely be harmless on other types of XML documents.

     
  • Christiaan Fluit

    Logged In: YES
    user_id=617090
    Originator: YES

    I further alleviated the problem by adding a child descriptor for text/xml that checks for (X)HTML file name extensions. Consequently, files that start with <?xml version=... and that have a .htm, .html or .xhtml file extension are now classified as application/xhtml+xml rather than text/xml, meaning that Extractor selection will return a HtmlExtractor rather than a XmlExtractor and non-wellformed XHTML files are no longer a (big) problem.

    You're still out of luck though when your files do not have one of these file name extensions.

    Related problem on the Aduna AutoFocus forum: http://www.aduna-software.net/forum/posts/list/114.page

     
  • Antoni Mylka

    Antoni Mylka - 2008-06-06
    • milestone: --> 533940
     
  • Leo Sauermann

    Leo Sauermann - 2008-07-23

    Logged In: YES
    user_id=1242018
    Originator: NO

    I don't understand - if the mimetype identifier returns text/html, why is the XmlExtractor used???

    surely, we should use the HtmlExtractor for XHTML for the time being (this will give good resutls, I hope)

     
  • Christiaan Fluit

    Logged In: YES
    user_id=617090
    Originator: YES

    Hmmm, I think my bug description is less than optimal... (I had trouble understanding it myself :) )

    There are actually several subproblems. First, magic numbers overrule file type extensions in the MagicMimeTypeIdentifier. I.e., a file starting with "<?xml ..." is seen as text/xml, even when its file extension says something else. This is problematic for XHTML docs as they can start with this header, whereas you really want the HtmlExtractor to handle them: it knows the HTML semantics and is able to degrade gracefully on XHTML docs that are not well-formed.

    The (less than optimal) solution used in the MagicMimeTypeIdentifier is to have application/xslt+xml defined as a subtype of text/xml with some known XHTML file extension. When a file is classified as text/xml, the MMTI will check whether it matches any of the subtypes, before returning text/xml.

    In practice this does not solve all cases, as web pages often have no file name or one that end with .jsp, .php, etc. That's when they get classified as text/xml and the XmlExtractor will be selected.

    The only way to thoroughly solve this is to have a post-processing step in the MMTI for text/xml (and application/xml) that has knowledge about specific document types. That idea was mentioned before, just never implemented.

     
  • Christiaan Fluit

    Logged In: YES
    user_id=617090
    Originator: YES

    application/xslt+xml --> application/xhtml+xml

     
  • Antoni Mylka

    Antoni Mylka - 2008-09-28
    • labels: 827280 --> general
    • milestone: 533940 -->
     
  • Antoni Mylka

    Antoni Mylka - 2011-11-28
    • milestone: --> 1.6.0 - features
     
  • Antoni Mylka

    Antoni Mylka - 2011-11-28

    This has been fixed with the TikaMimeTypeIdentifier, which makes the HtmlExtractor work on XHTML files.

     
  • Antoni Mylka

    Antoni Mylka - 2011-11-28
    • status: open --> closed
     

Log in to post a comment.