Sam Tregar <sam@...> writes:
> As far as I know the character-set conversions are not necessary to
> achieve this goal, so they weren't included.
You are correct. No general-purpose HTML quoting function handles
internationalization, for two reasons:
* It's not necessary to achieve the primary purpose of quoting, which
is to prevent the HTML metacharacters to be interpreted as markup.
* It's extremely hard to implement without making simplistic
assumptions. Handling of I18N text is highly context-dependent.
For example, it may seem "correct" to change the character 220 to
"Ü". But if the target template is in a different charset,
where 220 has a wholly different meaning?
For example, in Latin 1, the character 169 is the copyright sign,
with entities "©" and "©". But in a Latin 2 HTML
document, exactly the same code represents the "S with caron"
character, with entities "Š" and "Š". In UTF-8, the same
code is an illegal character.
How is a quoting function to know whether to convert code 169 to
"©" or to "Š"?
A quoting function that tried to fully handle I18N would have to know
everything about charsets and HTML and the surrounding context. Doing
that kind of work for no gain is pointless. Doing the simple thing
and assuming Latin 1 is actually *harmful* for non-Latin 1 users.