From: William S. <sp...@rh...> - 2009-09-03 22:58:27
|
I wrote a proposal that I think will be more to the group's liking, what I am proposing now is a new tag, similar to "binary", which I called "utf8". I'm afraid my proposal is rather long-winded, anybody who wants to shorten it or clarify it, go ahead! Basically invalid UTF-8 is written like this, in this example both a correct UTF-8 Aacute and a 1-byte error are in the string: - !!utf8 "Aacute = Á, UTF-8 error = %80" Valid UTF-8 multi-byte characters can also be written with %nn, this may be useful for getting UTF-8 out of an editor that insists on producing some other encoding and thus ASCII letters are the only ones that work. For a valid string there are many equivalent ways of writing it, though the last here is preferred: - !!utf8 "Aacute = %C3%81" - !!utf8 "Aacute = Á" - !!utf8 "Aacute = \xC1" - "Aacute = \xC1" - "Aacute = Á" The main goal is to allow lossless storage of arbitrary bytes streams but not discourage use of UTF-8 in these streams. User should be able to read any valid UTF-8 and insert valid UTF-8 using a Unicode-aware text editor. Not allowing this causes users to treat the source as being in some other encoding, such as ASCII only, and prevents them from ever switching to UTF-8. I changed our software to use this % encoding, although I am currently using the fact that the text is double-quoted rather than the tag to indicate if this is needed. Need some agreement on questionable aspects of my design before I continue: 1. The exact name of the tag. I chose "utf8" because there is no guarantee the string is invalid. 2. My idea that only %25 and %80-%FF are interpreted, %20 for instance is not a space but instead '%','2','0'. This is to make it less-mangling of %-escaped urls. 3. Any case requirements on the hex letters (I made it accept both, just like url encoding and the \x in yaml). 4. exactly how to escape a '%', though I used %25 just like url encoding. |