Aperture / Feature Requests / #6 Improve processing of XHTML docs

#6 Improve processing of XHTML docs

Milestone: 1.6.0 - features

Status: closed

Owner: Christiaan Fluit

Labels: general (27)

Priority: 5

Updated: 2011-11-28

Created: 2007-02-02

Creator: Christiaan Fluit

Private: No

XHTML documents are currently not handled in an ideal manner.

First of all, the MagicMimeTypeIdentifier classifies them as text/html instead of the more appropriate application/xhtml+xml. Note that a lot of XHTML files out there still use a .html or .htm extension or even no extension at all (e.g. website root pages), so just defining it as description with text/xml as parent type and .xhtml as extension does not work.

Second, because of this inappropriate MIME type, the XmlExtractor is used instead of the HtmlExtractor (which I hope is able to process XHTML). This has two drawbacks:

- XmlExtractor just gathers and concatenates all PCDATA, it has no knowledge of e.g. title, body and META tags, etc.

- XmlExtractor uses a SAX parser, meaning that it will fail when the XHTML doc is not well-formed. And trust me, I have already encountered such docs that Firefox just displays without any problem.

The latter problem would evaporate (for XHTML docs at least) if the MIME type identifier did a better job.

Discussion

Christiaan Fluit - 2007-02-02

Logged In: YES
user_id=617090
Originator: YES

The website on which this problem occurred: http://www.dashandyforum.de/.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Christiaan Fluit - 2007-02-02

Logged In: YES
user_id=617090
Originator: YES

Because of this problem I have added text/xml to the set of supported MIME types for the HtmlLinkExtractor, so that you are at least able to crawl XHTML sites. This will most likely not pose any problems, as there is no other LinkExtractor for text/xml and the link extraction process will most likely be harmless on other types of XML documents.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Christiaan Fluit - 2007-06-05

Logged In: YES
user_id=617090
Originator: YES

I further alleviated the problem by adding a child descriptor for text/xml that checks for (X)HTML file name extensions. Consequently, files that start with <?xml version=... and that have a .htm, .html or .xhtml file extension are now classified as application/xhtml+xml rather than text/xml, meaning that Extractor selection will return a HtmlExtractor rather than a XmlExtractor and non-wellformed XHTML files are no longer a (big) problem.

You're still out of luck though when your files do not have one of these file name extensions.

Related problem on the Aduna AutoFocus forum: http://www.aduna-software.net/forum/posts/list/114.page

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Antoni Mylka - 2008-06-06

milestone: --> 533940
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Leo Sauermann - 2008-07-23

Logged In: YES
user_id=1242018
Originator: NO

I don't understand - if the mimetype identifier returns text/html, why is the XmlExtractor used???

surely, we should use the HtmlExtractor for XHTML for the time being (this will give good resutls, I hope)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Christiaan Fluit - 2008-07-25

Logged In: YES
user_id=617090
Originator: YES

Hmmm, I think my bug description is less than optimal... (I had trouble understanding it myself :) )

There are actually several subproblems. First, magic numbers overrule file type extensions in the MagicMimeTypeIdentifier. I.e., a file starting with "<?xml ..." is seen as text/xml, even when its file extension says something else. This is problematic for XHTML docs as they can start with this header, whereas you really want the HtmlExtractor to handle them: it knows the HTML semantics and is able to degrade gracefully on XHTML docs that are not well-formed.

The (less than optimal) solution used in the MagicMimeTypeIdentifier is to have application/xslt+xml defined as a subtype of text/xml with some known XHTML file extension. When a file is classified as text/xml, the MMTI will check whether it matches any of the subtypes, before returning text/xml.

In practice this does not solve all cases, as web pages often have no file name or one that end with .jsp, .php, etc. That's when they get classified as text/xml and the XmlExtractor will be selected.

The only way to thoroughly solve this is to have a post-processing step in the MMTI for text/xml (and application/xml) that has knowledge about specific document types. That idea was mentioned before, just never implemented.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Christiaan Fluit - 2008-07-25

Logged In: YES
user_id=617090
Originator: YES

application/xslt+xml --> application/xhtml+xml

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Antoni Mylka - 2008-09-28

labels: 827280 --> general

milestone: 533940 -->
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Antoni Mylka - 2011-11-28

milestone: --> 1.6.0 - features
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Antoni Mylka - 2011-11-28

This has been fixed with the TikaMimeTypeIdentifier, which makes the HtmlExtractor work on XHTML files.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Antoni Mylka - 2011-11-28

status: open --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.