On Fri, Mar 10, 2006 at 10:46:56AM +0100, Moof wrote:
-> On 3/9/06, Titus Brown <titus@...> wrote:
-> > Hi all,
-> >
-> > I'm facing a minor revolt on the twill list over the default use of
-> > 'latin-1' for form parsing. I'm going to be switching to UTF-8, because
-> > that appears to be a sensible default:
-> >
-> > http://www.alanwood.net/unicode/htmlunicode.html
-> >
-> > The true situation seems to be confused, as usual:
-> >
-> > http://www.w3.org/TR/REC-html40/charset.html
-> >
-> > (I'd try to reference older wwwsearch-general posts, but right now the
-> > search on SF is broken!)
->
-> In my experience, pages in the western world are normally declared to
-> be latin-1, and are actually iso-8859-15 (latin-1 plus the euro sign),
-> or more usually cp1252. The rationale for this is the fact that the
-> standard Apache config will automatically add the default character
-> set of latin-1 to web headers unless explicitly modified either by the
-> dynamic script running the website (which many programmers seem not to
-> bother doing), or in the appropriate .htaccess file or apache config
-> file in the case of static files.
->
-> I believe the various RFCs and TSs end up specifying that latin-1
-> should be the default encoding for HTML if none is specified in the
-> HTTP headers or in the page's META tag, unless a unicode encoding's
-> Byte Order Mark is encountered. The default encoding for XML (and
-> hence XHTML) is, I believe, UTF-8.
Here's something from http://www.alanwood.net/unicode/htmlunicode.html:
"""
The only detail about character encodings that a writer needs to know is
that some character encodings (for example UTF-8) allow any of the
characters in the document character set to be included, while others
(for example ISO-8859-1 or SHIFT_JIS) only allow for subsets. However,
characters that are not allowed for in a character encoding can still be
included in an HTML document by using character references. UTF-8 is the
normal character encoding for any HTML file that contains text in two or
more non-Latin scripts, but it can be used for any document.
"""
This appears to be the problem twill users (4 separate ones...) are
running into: latin-1 simply can't encode the characters in some of
the documents. A simple example is a page containing only
this:
"""
›
"""
If you try to parse the forms on a page including that character, you'll
get a unicode error.
cheers,
--titus
|