Re: [Yaml-core] utf8u tag proposal

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi William,

Don't be shocked by my possible misunderstanding.
To be honest, I, by myself, have no need for utf8u in my
real applications at this moment. So, I was too lazy to 
study about it and about Unicode coding enough. 
This is simply my fault.

I will not discuss whether utf8u really defines a new data 
type and how it should be compared to !!str *by YAML* 
because I feel I'm not a right person for it.

What I say is that if utf8u does not define a new data 
type, we should stop and think before adding it to the 
Tag repository. In that case, having "Node Encoding" 
property might work well for it. If utf8u does define 
a new data type, I have no objection to have it in the 
Tag repository.

Best,
Osamu TAKEUCHI

> Osamu TAKEUCHI wrote:
> 
>> Since utf8u data can not be safely stored in a utf8 string variable,
>> the raw data is stored in byte array.
> 
> I think you said "utf8" when you meant "Unicode". At least I am going to 
> hope so!
> 
> You seem to have this idea that it is not "unicode" when it is stored in 
> a byte array. But somehow storing it in a 16-bit word array makes it 
> "Unicode", even though it is still variable length and can still contain 
> invalid sequences!
> 
> "utf8u" data CAN be "safely" stored in a "utf8" string, because it *IS* 
> a UTF-8 string! They are arrays of bytes! Claiming "invalid sequences" 
> somehow makes it "not be UTF-8" is like claiming that misspelled words 
> makes it not be UTF-8.
> 
> The ONLY reason for elevating "invalid sequences" to this magical 
> importance is the ulterior motive of making sure it is impossible to use 
> UTF-8, perhaps to validate your previous decision to use "wide 
> characters" and an unwillingness to admit that it was a mistake.
> 
>> This class is only used temporarily after the data is input from user 
>> and before it is
>> validated to be a valid utf8 string for the future use.
> 
> NO! This class is PERMANENT. I will convert all string ***TO*** this 
> "utf8u" form. Translation to glyphs is the "temporary" storage and is 
> the one I do NOT want in the file and I will NOT put in my data 
> structure!!! Translation to glyphs is lossy (because I will not throw an 
> error but instead use replacement characters) so it cannot be used to 
> store anything!
> 
> I have to say I am absolutly floored and shocked at the failure for 
> obviously intelligent people to understand this. For some reason UTF8 
> turns geniuses into morons, they suddenly act as though everything that 
> works with byte arrays is forgotten, or that there is circuitry so that 
> their program will crash the moment you store an invalid byte sequence, 
> when in fact invalid byte sequences are trivial to detect and can be 
> stored losslessly.
> 
> THINK! Imagine it is binary data. Would you go through such elaborate 
> difficulty trying to make sure that the data when read or written was 
> some legal form? Or would defer this until after the binary data is 
> loaded in memory and let the code that interprets it figure this out? 
> What is so magical about UTF-8 that this cannot be done?
> 
> Or pretend the characters are words in ASCII text, and that invalid byte 
> sequences are misspelled words. Imagine how you would write the program 
> if some strings contained misspelled words, and try applying the same 
> ideas to UTF-8.
> 
>> This indeed declares a different type from !!str.
> 
> NO NO NO NO NO NO!!!! I will convert all strings TO "utf8u" so they are 
> IDENTICAL!!!
> 
>> ns-tag-char seems to contain "#".
> 
> The libyaml source code is incorrect, then. Somebody should fix it.