#150 option not to relocate markup in HTML->XML conversion


Below is the content of my post to the html-tidy list,
which was admittedly not received very enthusiastically
by Bjoern, but I thought the idea might be worth
posting here anyway, at least "for the future".

Converting an HTML file to XML, I noticed that Tidy
relocates bits of markup trying to be HTML-compliant.
In the case at hand, it removes <meta> elements from
within <td> elements and puts the former at the top of
the output file. The comment is: "Warning: <meta> isn't
allowed in <td> elements".

Bjoern says that Tidy's job is to fix invalid markup,
but my point is that in the absence of an XML DTD, Tidy
has has no way of knowing whether the markup in
question is invalid XML or not, and its provenance
should not matter much (I return to this last bit
further below).

Given that Tidy offers the option of translating HTML
into XML, which I'm sure is the reason many people use
this tool nowadays, "fixing invalid markup" should not
mean the same as "making markup compliant with *some*
version of
HTML". This is why I'd like to suggest an option that
would be active with -asxml and that would prevent Tidy
from moving stuff around in such cases.

This may be considered as bringing Tidy closer to the
"Unix way" in this respect, i.e. to proceed by small
specialized steps rather than attempting to force
happiness on people by performing lots of jobs at once.
If someone wanted their XML to be HTML-compliant, they
could first use Tidy to act on their HTML source,
outputting valid HTML, and then pipe this through Tidy
again, this time making sure that the result is
well-formed XML (I guess Tidy could make the second
step unnecessary, but that's another matter). I would
be happy with *not* performing the first step and just
making sure that all elements are closed, etc.

Naturally, looking at the HTML DTDs is / may be
necessary in creating (the right sort of) well-formed
XML, cause I admit that I would like e.g. the missing
</table> tag to appear after the last </tr> tag and not
somewhere else, etc. However, when we have, say:
<meta> .... </meta>
oh gosh, I'm invalid now

It seems to me (without knowing what Tidy's internal
workings are in this case, I admit) that it might be
possible for Tidy to only emit a warning about "<meta>
not allowed inside <td> elements" but crucially, do
nothing more about it.


Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.

No, thanks