Re: [Yaml-core] Invalid UTF-8

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Oren Ben-Kiki wrote:

> I have no special interest in "helping Unicode". All I want is the
> security of knowing that a YAML file is portable across systems,

My files contain the valid UTF-8 encodings of '\', 'X', and two hex 
digits. They are valid Unicode and are therefore "portable across systems".

> What encoding is used for file names?

UTF-8.

The fact that it is physically possible for the byte arrays naming the 
file to contain an invalid UTF-8 sequence does not change this, any more 
than the possibility of a misspelled word means that we cannot say a 
text file is in English.

> When you type "ls" in a shell inside an xterm window, and it
> displays the file names, somewhere along the chain there _must_ be an
> assumption about the encoding used for the file names.

The assumption is done inside the code in Xterm that decides how to 
*DISPLAY* the string on the screen (try "glob" instead of "ls" if you 
are confused by ls's attempt to do a *display* function of columnizing 
the filenames).

> The POSIX
> standard solves this by mandating that filenames must be encoded
> according to the LC_CTYPE locale setting (I don't recall chapter and
> verse off the top of my head; it has been a while).

Wrong. The current standard is "you must set locale to something using 
UTF-8 so that programs that still pay attention to it will not 
misinterpret the UTF-8 filenames." If you do not do this then UTF-16 
filesystems such as NTFS are unreadable and filenames will not match 
between platforms.

> What happens if the bytes in the filename are not valid according to the
> LC_CTYPE locale setting?

Nothing happens until there is a need to interpret the filename as 
Unicode code points, like when an attempt is made to render it on the 
screen. At that point the display routine should figure out a useful 
method of showing the erroneous bytes to the user. Lossy methods such as 
treating the bytes as CP1252 are harmless here (this is in fact the 
solution used by html browsers).

> For example, what would a Java
> program (that can only deal with UTF-16 characters) see when it fetches
> the name of such a file? I'm not really certain but I suspect it isn't
> pretty...

JAVA has a bug and cannot operate a UTF-8 filesystem. They either need a 
new byte-oriented api to the filesystem, or (more likely) a lossless 
method to encrypt UTF-8 into UTF-16 arrays (a popular one is to do 
UTF-16 "decoding" but turn invalid bytes into a 0xDCxx word and consider 
the UTF-8 encodings of U+DC80..U+DCFF to be invalid byte sequences).

A bug in JAVA is no excuse. If I wrote a decoder that threw an error on 
misspelled words, does that mean that YAML now must be defined to not 
allow misspelled words? Or would you correctly say that the problem is 
my decoder?

> At any rate, YAML doesn't attempt to act as a "transparent pipeline"
> allowing you to put anything-at-all on one side and get
> exactly-the-same-thing on the other. It definitely _does_ have
> restrictions about "what you can put in" (e.g., no duplicate keys in
> mappings, so PHP must therefore use sequence-of-single-pair-mappings for
> its weird data).

Being able to read raw invalid UTF-8 seemed to be a requirement from 
YAML's basic design principle of being "editable by users".

However I believe you are worried that programmers will use the ability 
as a "data compression" method for arbitrary binary data and I agree 
this is a problem. The \XNN sequence expands random binary data by 3 
times, so programmers will probably prefer the 4/3 expansion of base64. 
However that won't stop a determined programmer from coding their own 
writer that uses raw bytes as much as possible.

There may be a solution where libyaml could continue to reject invalid 
UTF-8 input: We could assume the editor sanitizes the invalid UTF-8 into 
the equivalent \XNN sequences. I'm sure Emacs could be told to do this 
in "yaml mode". I initially wanted to use \nnn octal escapes because 
there actually are examples of existing programs that produce this 
escape. I did not do this because of your use of \0 for nul (it would 
break if followed by a '0' through '7' byte). Using \xNN also has this 
nice property but is obviously incompatible.

> IMO, if the LC_CTYPE locale is UTF-8, the kernel would be
> within its rights to _reject_ any attempts to create files containing
> invalid UTF-8 bytes, or at minimum convert them into an "equivalent"
> valid UTF-8 string.

The filesystem (not the Kernel) is allowed to reject ANY byte array for 
any reason it wants. For instance the FAT driver rejects an awful lot of 
byte sequences such as ':' or a sequence where the ASCII letters if case 
is swapped name an existing unreadable file. All file systems I have 
seen reject strings with '/' in them where the part before the '/' is 
not an existing directory.

I fail to see how libyaml's rejection of a certain subset of byte arrays 
just because it MAY be a set a filesystem will reject is any kind of 
excuse however.

> AFAIK Apple's OS-X does exactly that, since it uses UTF-16 internally in
> HFS (even if it presents filenames as UTF-8 to the applications).

OS/X can handle NFS mounts and NFS can explicitly return arbitrary byte 
arrays including invalid UTF-8. Terminal App decodes strings as though 
they are UTF-8 and (suprise for me) displays invalid byte sequences as 
the hex code in a block.

HFS translates the Unicode into decomposed normalized form and therefore 
rejects and remaps quite a number of UTF-8 sequences. I fail to see how 
libyaml doing a tiny subset of it's manipulations helps me.

> Linux
> does not do this; I'm not certain about BSD and Solaris and AIX and QNX
> and all the other UNIXes out there.

Linux treats filenames as arrays of bytes, and explicitly supports some 
filesystems such as Ext3 where all possible arrays of bytes not 
containing 0x25 or 0x00 identify different files.

> Then there's Plan9 ("UNIX as it
> should have been") who _invented_ UTF-8 and I'm willing to bet that they
> definitely enforce valid UTF-8 filenames.

Plan9 accepted any array of bytes as a filename as long as it did not 
contain a 0x00 or 0x25 byte. Read the 9FS specification for details.

The 9.5 window system did interpret UTF-8 to draw glyphs on the screen. 
Assuming they used their torune() function to look up the glyphs, they 
drew each invalid UTF-8 byte as a '?', and they also did not detect 
overlong encodings as errors. Cut & paste handled the original UTF-8 as 
raw bytes (9.5 simply indicated the point in the string closest to the 
glyph the user clicked on) so even this unsafe and lossy practice was 
harmless in Plan9. This is an excellent example where interpretation was 
correctly deferred until display.