From: Ingy d. N. <in...@in...> - 2011-10-29 02:43:28
|
William, I put the content of this email here: https://github.com/yaml2/YAML2/wiki/Unicode-Strictness Oren, I started listing some bad parts. I am making a page for each bad point so that they can each have their own long discussion. On Fri, Oct 28, 2011 at 3:49 PM, William Spitzak <sp...@rh...> wrote: > PLEASE!!! This is the main reason we cannot use unaltered YAML: > > SUPPORT FOR INVALID UTF-8 AND UTF-16 > > There needs to be a way to put an arbitrary byte sequence into a scalar > without losing the ability to make valid byte sequences human-readable. > > Currently YAML is limited to only putting byte sequences that are valid > UTF-8 into scalars, unless some transformation is done that makes some > (often all) Unicode unreadable in the YAML input. This has the > counter-productive effect of *discouraging* use of Unicode on any backend > that uses bytes where there is no guarantee that the backend limits the byte > sequences to valid UTF-8. Examples are all byte-based file formats, most > internet protocols, Unix filenames, and Windows resource identifiers. > > My recommendation is here. However any solution that allows an arbitrary > byte stream to be produced, while allowing valid UTF-8 bytes to be > represented by the correct Unicode character in the YAML source, would be > acceptable: > > 1. The backslash escape of \xNN represents a "raw UTF-8 byte" with the > given value. This is only different from current YAML for 0x80-0xFF. The > sequence \u00NN must be used for actual Unicode code points in this range. > > 2. An api that requests YAML scalars as UTF-8 gets these bytes, inserted > between the UTF-8 encoding of all other characters, as raw data. > > 3. An api that requests scalers in some other form, such as UTF-16, gets > these bytes as unchanged code units. This makes \xNN work identically to > current YAML/JSON when the UTF-16 api is used. It may also allow invalid > forms of other encodings to be supported. > > In addition invalid UTF-16 must also be supported. Support of invalid > UTF-16 is more common, due to it's use on Windows and therefore the > realization by otherwise ignorant programmers of the inability to work > without supporting them. Technically the YAML spec does not allow invalid > UTF-16, but my proposal here formalizes the actual support that is in most > (all?) YAML and JSON implementations: > > 1. The backslash escape of \uNNNN for NNNN in the range 0xD800..0xDFFF > represents a "raw UTF-16 code unit". > > 2. An api that requests UTF-16 or other 16-bit code units will get these > codes unchanged. > > 3. An api that requests bytes will get 3 for each of these, these three > bytes match the encoding you get from UTF-8 if you extend it to these > invalid code points. > |