Trejkaz - 2004-10-25

If an HTML file is actually a valid XHTML file, it will have the XML header at the top:

<?xml version="1.0" encoding="some_encoding" ?>

HTMLParser apparently returns this as a string node.  Is that the expected behaviour?

Also, I wonder what happens in general when this sort of PI is encountered in the middle of a document.

e.g.:

<p>The value is <?php ... ?></p>

I assume that this shouldn't be text, either.  But if I do a search and replace through the text for the pattern "<\?.*\?>", I might intercept some cases I'm not supposed to:

<p>The value is &lt;?php ... ?&gt;</p>

Because I receive that value decoded, it will match the same expression.