XHTML documents are currently not handled in an ideal manner.
First of all, the MagicMimeTypeIdentifier classifies them as text/html instead of the more appropriate application/xhtml+xml. Note that a lot of XHTML files out there still use a .html or .htm extension or even no extension at all (e.g. website root pages), so just defining it as description with text/xml as parent type and .xhtml as extension does not work.
Second, because of this inappropriate MIME type, the XmlExtractor is used instead of the HtmlExtractor (which I hope is able to process XHTML). This has two drawbacks:
- XmlExtractor just gathers and concatenates all PCDATA, it has no knowledge of e.g. title, body and META tags, etc.
- XmlExtractor uses a SAX parser, meaning that it will fail when the XHTML doc is not well-formed. And trust me, I have already encountered such docs that Firefox just displays without any problem.
The latter problem would evaporate (for XHTML docs at least) if the MIME type identifier did a better job.