From: William S. <sp...@rh...> - 2009-09-28 19:17:00
|
Osamu TAKEUCHI wrote: > I point out that no YAML program will ignore any unknown Tag. They will > just reject a document with unknown node. So, I think > it will not be so benefitial to have !!utf8u as close as > possible to the original string. The reason is so a user with a text editor can "fix" a string by deleting the tag. > In addition, it seems easier for me to convert all "%25" to "%" with a > text editor than to distinguish "%cat" from "%dog" by my eyes. Yes only a program should try to do this. But it does mean it is very likely that a % the user missed converting when typing will be interpreted correctly. > So, I vote one to the form "!!utf8u a%b%c" as the canonical form. I believe you meant "a%25b%25c". It was also suggested that the canonical form replace *all* bytes with %xx, so it might be "%61%25%62%25%63". > At the > same time, I think that a parser can also accept "!!utf8u a%b%c" as > "a%b%c" because there is no uncertainty about such a flexible > interpretation. Absolutely. Also this avoids having to define an "error", which I absolutely do not want! > Imagine my ruby application is using the YAML library... I'm guessing Ruby is one of the many programs that uses UTF-16 or maybe UCS-2 and calls it "Unicode". You are basically saying "the code must not return something that will make my Ruby program throw an error". The problem is that I think this is a bug in Ruby. Programmers and users think of the conversion as a "cast" and do not expect errors. You are attempting to patch this by changing every other api in the world that can produce input data and artificially limit them to the subset that won't throw errors. This is very bad programming practice and will never work, for the simple reason that programs can have bugs, so assuming the output is a given subset is impossible. The real solution is to fix Ruby/Python so the "cast" really is just that. A second call to see if the cast is lossy can be added. I can tell you what I think a proper API for YAML is though I know it is hopeless to convince people here: 1. One call to return UTF-8. "UTF-8" means an array of bytes and therefore this can return *any* array of bytes. This will return exactly the byte stream in the file if the file is encoded in UTF-8, except for the few ASCII characters that are part of yaml syntax. 2. *Another* call to return UTF-16. The main purpose is to provide data that will not make Ruby/Python throw an error, though this may be implemented more efficiently if the YAML file is UTF-16 encoded. If the file is UTF-8 invalid bytes are decoded to 0xDCxx. The UTF-8 can still be accessed with the other call, this is necessary as this call is lossy. 3. A call to return "errors" with the current string. Although I think you will be surprised at how little this will be used! Since yaml scanned the string in order to parse the file, it can often detect these errors almost for free. One is invalid UTF-8, another is invalid UTF-16. There can also be indicators for non-characters, control characters, wrong canonical form, and all the other things that can be "wrong" about a string. |