Hence, files beginning with DOCTYPE HTML are classified as XML, not HTML. This makes them use the XmlExtractor, which is famously vulnerable to any sort of syntactic error in the file. I personally wonder what do we have the XmlExtractor for anyway.
Fixed in rev2615 and rev2616
Reopening, the magic added in r2615 and 2616 should be expanded with one that supports the BOM at the front
Fixed in r2621