There is one notable disadvantage to using Base64 encoding: it makes it impossible for a human reading the file to tell what the value of the Nth byte is. Using \uXXXX notation, this is trivial. Of course, using \uXXXX notation is also inefficient and abusing the notation.

Is it "useful" (in the real world) to have an alternative presentation for !!bin blobs, such that the data is presented as XX XX XX XX XX bytes? Would this be addressing the need that lead to the request for arbitrary \xXX and \uXXXX?

Something like:

blob: !!bin <prefix-TBD> A0B3 C4 DA BC12

(Hexadecimal, two chars per byte, white space is ignored).

If this is the case, it is trivial to add it.

Have fun,

    Oren Ben-Kiki

On Sat, Oct 29, 2011 at 9:43 AM, Oren Ben-Kiki <oren@ben-kiki.org> wrote:
Just to be clear - we are talking about allowing \xXX and \uXXXX with arbitrary XX values, regardless of the result is a valid Unicode point; as opposed to allowing arbitrary unescaped bytes in the YAML stream itself (which would make it unreadable/uneditable).

If so, then Peter has a good point - what is wrong with Base64 encoding? That is, what is the use case where most of the data is valid Unicode, but it is sprinkled with an occasional arbitrary binary data? I'll grant you that if you need such data, it is way more readable to use \x and/or \u escape sequences.

But what kind of data is this? Does such data get loaded into a normal string type in the application, or into some sort of a binary buffer type?

Note YAML makes it possible for you to use a tagged unquoted scalar containing any escape mechanism you want, e.g.:

    foo: !bar Baz \xXX \uUUUU #XXXX or whatever.

You may also come up with a way to use implicit tagging to avoid the need for an explicit tag. I'm still unclear what this would be used for...

Have fun,

    Oren Ben-Kiki.  


On Sat, Oct 29, 2011 at 7:53 AM, Peter Murphy <peterkmurphy@gmail.com> wrote:
William,

There's already a binary tag in YAML, which allows arbitrary binary
data to be encoded as base64.

http://yaml.org/type/binary.html

Is there any reason why that's not good enough for your needs?

Best regards,
Peter


On Sat, Oct 29, 2011 at 3:03 PM, Ingy dot Net <ingy@ingy.net> wrote:
> Hi Devin,
>
> I added your comments to
> https://github.com/yaml2/YAML2/wiki/Unicode-Strictness
>
> This topic seems like it would be more poignant if people actually wrote
> some test cases about what they were talking about. It's so easy to be
> misunderstood on this type of issue. But with actual test files, it's much
> less so.
>
> The best way to do this would be to simply create a repo on github and push
> some files up. Even if it's not completely running code, it would be
> helpful. Of course, running code in something like pyyaml would be even
> better.
>
> I really don't want to have YAML2 discussions, without actual tests to make
> people's points. William, it would be great if you could post some files
> that elegantly show off your concerns. Otherwise it just feels like
> conjecture.
>
> Ingy
>
> On Fri, Oct 28, 2011 at 7:54 PM, Devin Jeanpierre <jeanpierreda@gmail.com>
> wrote:
>>
>> > There needs to be a way to put an arbitrary byte sequence into a scalar
>> > without losing the ability to make valid byte sequences human-readable.
>>
>> Mixing byte data into what is ostensibly unicode seems like a bad
>> idea. Either have a particular part of the document be unicode, or
>> bytes, not both.
>>
>> That said, if you mix unicode and bytes in the same file, it ceases to
>> be exactly readable in a standard text editor. So I don't like that
>> either. Maybe two separate types of YAML file (text /
>> binary/compressed)? more than one protocol has had a "binary" version
>> made of it.
>>
>> Devin
>>
>> On Fri, Oct 28, 2011 at 6:49 PM, William Spitzak <spitzak@rhythm.com>
>> wrote:
>> > PLEASE!!! This is the main reason we cannot use unaltered YAML:
>> >
>> > SUPPORT FOR INVALID UTF-8 AND UTF-16
>> >
>> > There needs to be a way to put an arbitrary byte sequence into a scalar
>> > without losing the ability to make valid byte sequences human-readable.
>> >
>> > Currently YAML is limited to only putting byte sequences that are valid
>> > UTF-8 into scalars, unless some transformation is done that makes some
>> > (often all) Unicode unreadable in the YAML input. This has the
>> > counter-productive effect of *discouraging* use of Unicode on any
>> > backend that uses bytes where there is no guarantee that the backend
>> > limits the byte sequences to valid UTF-8. Examples are all byte-based
>> > file formats, most internet protocols, Unix filenames, and Windows
>> > resource identifiers.
>> >
>> > My recommendation is here. However any solution that allows an arbitrary
>> > byte stream to be produced, while allowing valid UTF-8 bytes to be
>> > represented by the correct Unicode character in the YAML source, would
>> > be acceptable:
>> >
>> > 1. The backslash escape of \xNN represents a "raw UTF-8 byte" with the
>> > given value. This is only different from current YAML for 0x80-0xFF. The
>> > sequence \u00NN must be used for actual Unicode code points in this
>> > range.
>> >
>> > 2. An api that requests YAML scalars as UTF-8 gets these bytes, inserted
>> > between the UTF-8 encoding of all other characters, as raw data.
>> >
>> > 3. An api that requests scalers in some other form, such as UTF-16, gets
>> > these bytes as unchanged code units. This makes \xNN work identically to
>> > current YAML/JSON when the UTF-16 api is used. It may also allow invalid
>> > forms of other encodings to be supported.
>> >
>> > In addition invalid UTF-16 must also be supported. Support of invalid
>> > UTF-16 is more common, due to it's use on Windows and therefore the
>> > realization by otherwise ignorant programmers of the inability to work
>> > without supporting them. Technically the YAML spec does not allow
>> > invalid UTF-16, but my proposal here formalizes the actual support that
>> > is in most (all?) YAML and JSON implementations:
>> >
>> > 1. The backslash escape of \uNNNN for NNNN in the range 0xD800..0xDFFF
>> > represents a "raw UTF-16 code unit".
>> >
>> > 2. An api that requests UTF-16 or other 16-bit code units will get these
>> > codes unchanged.
>> >
>> > 3. An api that requests bytes will get 3 for each of these, these three
>> > bytes match the encoding you get from UTF-8 if you extend it to these
>> > invalid code points.
>> >
>> >
>> > ------------------------------------------------------------------------------
>> > The demand for IT networking professionals continues to grow, and the
>> > demand for specialized networking skills is growing even more rapidly.
>> > Take a complimentary Learning@Cisco Self-Assessment and learn
>> > about Cisco certifications, training, and career opportunities.
>> > http://p.sf.net/sfu/cisco-dev2dev
>> > _______________________________________________
>> > Yaml-core mailing list
>> > Yaml-core@lists.sourceforge.net
>> > https://lists.sourceforge.net/lists/listinfo/yaml-core
>> >
>>
>>
>> ------------------------------------------------------------------------------
>> Get your Android app more play: Bring it to the BlackBerry PlayBook
>> in minutes. BlackBerry App World&#153; now supports Android&#153; Apps
>> for the BlackBerry&reg; PlayBook&#153;. Discover just how easy and simple
>> it is! http://p.sf.net/sfu/android-dev2dev
>> _______________________________________________
>> Yaml-core mailing list
>> Yaml-core@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/yaml-core
>
>
> ------------------------------------------------------------------------------
> Get your Android app more play: Bring it to the BlackBerry PlayBook
> in minutes. BlackBerry App World&#153; now supports Android&#153; Apps
> for the BlackBerry&reg; PlayBook&#153;. Discover just how easy and simple
> it is! http://p.sf.net/sfu/android-dev2dev
>
> _______________________________________________
> Yaml-core mailing list
> Yaml-core@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/yaml-core
>
>



--
Email: peterkmurphy@gmail.com
WWW: http://www.pkmurphy.com.au/