From: Devin J. <jea...@gm...> - 2011-10-29 02:54:50
|
> There needs to be a way to put an arbitrary byte sequence into a scalar > without losing the ability to make valid byte sequences human-readable. Mixing byte data into what is ostensibly unicode seems like a bad idea. Either have a particular part of the document be unicode, or bytes, not both. That said, if you mix unicode and bytes in the same file, it ceases to be exactly readable in a standard text editor. So I don't like that either. Maybe two separate types of YAML file (text / binary/compressed)? more than one protocol has had a "binary" version made of it. Devin On Fri, Oct 28, 2011 at 6:49 PM, William Spitzak <sp...@rh...> wrote: > PLEASE!!! This is the main reason we cannot use unaltered YAML: > > SUPPORT FOR INVALID UTF-8 AND UTF-16 > > There needs to be a way to put an arbitrary byte sequence into a scalar > without losing the ability to make valid byte sequences human-readable. > > Currently YAML is limited to only putting byte sequences that are valid > UTF-8 into scalars, unless some transformation is done that makes some > (often all) Unicode unreadable in the YAML input. This has the > counter-productive effect of *discouraging* use of Unicode on any > backend that uses bytes where there is no guarantee that the backend > limits the byte sequences to valid UTF-8. Examples are all byte-based > file formats, most internet protocols, Unix filenames, and Windows > resource identifiers. > > My recommendation is here. However any solution that allows an arbitrary > byte stream to be produced, while allowing valid UTF-8 bytes to be > represented by the correct Unicode character in the YAML source, would > be acceptable: > > 1. The backslash escape of \xNN represents a "raw UTF-8 byte" with the > given value. This is only different from current YAML for 0x80-0xFF. The > sequence \u00NN must be used for actual Unicode code points in this range. > > 2. An api that requests YAML scalars as UTF-8 gets these bytes, inserted > between the UTF-8 encoding of all other characters, as raw data. > > 3. An api that requests scalers in some other form, such as UTF-16, gets > these bytes as unchanged code units. This makes \xNN work identically to > current YAML/JSON when the UTF-16 api is used. It may also allow invalid > forms of other encodings to be supported. > > In addition invalid UTF-16 must also be supported. Support of invalid > UTF-16 is more common, due to it's use on Windows and therefore the > realization by otherwise ignorant programmers of the inability to work > without supporting them. Technically the YAML spec does not allow > invalid UTF-16, but my proposal here formalizes the actual support that > is in most (all?) YAML and JSON implementations: > > 1. The backslash escape of \uNNNN for NNNN in the range 0xD800..0xDFFF > represents a "raw UTF-16 code unit". > > 2. An api that requests UTF-16 or other 16-bit code units will get these > codes unchanged. > > 3. An api that requests bytes will get 3 for each of these, these three > bytes match the encoding you get from UTF-8 if you extend it to these > invalid code points. > > ------------------------------------------------------------------------------ > The demand for IT networking professionals continues to grow, and the > demand for specialized networking skills is growing even more rapidly. > Take a complimentary Learning@Cisco Self-Assessment and learn > about Cisco certifications, training, and career opportunities. > http://p.sf.net/sfu/cisco-dev2dev > _______________________________________________ > Yaml-core mailing list > Yam...@li... > https://lists.sourceforge.net/lists/listinfo/yaml-core > |