Re: [Yaml-core] utf8u tag proposal

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Osamu TAKEUCHI wrote:

> I point out that no YAML program will ignore any unknown Tag. They will 
> just reject a document with unknown node. So, I think
> it will not be so benefitial to have !!utf8u as close as
> possible to the original string.

The reason is so a user with a text editor can "fix" a string by 
deleting the tag.

> In addition, it seems easier for me to convert all "%25" to "%" with a 
> text editor than to distinguish "%cat" from "%dog" by my eyes.

Yes only a program should try to do this. But it does mean it is very 
likely that a % the user missed converting when typing will be 
interpreted correctly.

> So, I vote one to the form "!!utf8u a%b%c" as the canonical form.

I believe you meant "a%25b%25c". It was also suggested that the 
canonical form replace *all* bytes with %xx, so it might be 
"%61%25%62%25%63".

> At the 
> same time, I think that a parser can also accept "!!utf8u a%b%c" as 
> "a%b%c" because there is no uncertainty about such a flexible 
> interpretation.

Absolutely. Also this avoids having to define an "error", which I 
absolutely do not want!

> Imagine my ruby application is using the YAML library...

I'm guessing Ruby is one of the many programs that uses UTF-16 or maybe 
UCS-2 and calls it "Unicode". You are basically saying "the code must 
not return something that will make my Ruby program throw an error". The 
problem is that I think this is a bug in Ruby. Programmers and users 
think of the conversion as a "cast" and do not expect errors. You are 
attempting to patch this by changing every other api in the world that 
can produce input data and artificially limit them to the subset that 
won't throw errors. This is very bad programming practice and will never 
work, for the simple reason that programs can have bugs, so assuming the 
output is a given subset is impossible. The real solution is to fix 
Ruby/Python so the "cast" really is just that. A second call to see if 
the cast is lossy can be added.

I can tell you what I think a proper API for YAML is though I know it is 
hopeless to convince people here:

1. One call to return UTF-8. "UTF-8" means an array of bytes and 
therefore this can return *any* array of bytes. This will return exactly 
the byte stream in the file if the file is encoded in UTF-8, except for 
the few ASCII characters that are part of yaml syntax.

2. *Another* call to return UTF-16. The main purpose is to provide  data 
that will not make Ruby/Python throw an error, though this may be 
implemented more efficiently if the YAML file is UTF-16 encoded. If the 
file is UTF-8 invalid bytes are decoded to 0xDCxx. The UTF-8 can still 
be accessed with the other call, this is necessary as this call is lossy.

3. A call to return "errors" with the current string. Although I think 
you will be surprised at how little this will be used! Since yaml 
scanned the string in order to parse the file, it can  often detect 
these errors almost for free. One is invalid UTF-8, another is invalid 
UTF-16. There can also be indicators for non-characters, control 
characters, wrong canonical form, and all the other things that can be 
"wrong" about a string.