From: William S. <sp...@rh...> - 2011-10-28 23:02:39
|
PLEASE!!! This is the main reason we cannot use unaltered YAML: SUPPORT FOR INVALID UTF-8 AND UTF-16 There needs to be a way to put an arbitrary byte sequence into a scalar without losing the ability to make valid byte sequences human-readable. Currently YAML is limited to only putting byte sequences that are valid UTF-8 into scalars, unless some transformation is done that makes some (often all) Unicode unreadable in the YAML input. This has the counter-productive effect of *discouraging* use of Unicode on any backend that uses bytes where there is no guarantee that the backend limits the byte sequences to valid UTF-8. Examples are all byte-based file formats, most internet protocols, Unix filenames, and Windows resource identifiers. My recommendation is here. However any solution that allows an arbitrary byte stream to be produced, while allowing valid UTF-8 bytes to be represented by the correct Unicode character in the YAML source, would be acceptable: 1. The backslash escape of \xNN represents a "raw UTF-8 byte" with the given value. This is only different from current YAML for 0x80-0xFF. The sequence \u00NN must be used for actual Unicode code points in this range. 2. An api that requests YAML scalars as UTF-8 gets these bytes, inserted between the UTF-8 encoding of all other characters, as raw data. 3. An api that requests scalers in some other form, such as UTF-16, gets these bytes as unchanged code units. This makes \xNN work identically to current YAML/JSON when the UTF-16 api is used. It may also allow invalid forms of other encodings to be supported. In addition invalid UTF-16 must also be supported. Support of invalid UTF-16 is more common, due to it's use on Windows and therefore the realization by otherwise ignorant programmers of the inability to work without supporting them. Technically the YAML spec does not allow invalid UTF-16, but my proposal here formalizes the actual support that is in most (all?) YAML and JSON implementations: 1. The backslash escape of \uNNNN for NNNN in the range 0xD800..0xDFFF represents a "raw UTF-16 code unit". 2. An api that requests UTF-16 or other 16-bit code units will get these codes unchanged. 3. An api that requests bytes will get 3 for each of these, these three bytes match the encoding you get from UTF-8 if you extend it to these invalid code points. |