From: William S. <sp...@rh...> - 2009-08-26 19:39:56
|
Oren Ben-Kiki wrote: > I have no special interest in "helping Unicode". All I want is the > security of knowing that a YAML file is portable across systems, My files contain the valid UTF-8 encodings of '\', 'X', and two hex digits. They are valid Unicode and are therefore "portable across systems". > What encoding is used for file names? UTF-8. The fact that it is physically possible for the byte arrays naming the file to contain an invalid UTF-8 sequence does not change this, any more than the possibility of a misspelled word means that we cannot say a text file is in English. > When you type "ls" in a shell inside an xterm window, and it > displays the file names, somewhere along the chain there _must_ be an > assumption about the encoding used for the file names. The assumption is done inside the code in Xterm that decides how to *DISPLAY* the string on the screen (try "glob" instead of "ls" if you are confused by ls's attempt to do a *display* function of columnizing the filenames). > The POSIX > standard solves this by mandating that filenames must be encoded > according to the LC_CTYPE locale setting (I don't recall chapter and > verse off the top of my head; it has been a while). Wrong. The current standard is "you must set locale to something using UTF-8 so that programs that still pay attention to it will not misinterpret the UTF-8 filenames." If you do not do this then UTF-16 filesystems such as NTFS are unreadable and filenames will not match between platforms. > What happens if the bytes in the filename are not valid according to the > LC_CTYPE locale setting? Nothing happens until there is a need to interpret the filename as Unicode code points, like when an attempt is made to render it on the screen. At that point the display routine should figure out a useful method of showing the erroneous bytes to the user. Lossy methods such as treating the bytes as CP1252 are harmless here (this is in fact the solution used by html browsers). > For example, what would a Java > program (that can only deal with UTF-16 characters) see when it fetches > the name of such a file? I'm not really certain but I suspect it isn't > pretty... JAVA has a bug and cannot operate a UTF-8 filesystem. They either need a new byte-oriented api to the filesystem, or (more likely) a lossless method to encrypt UTF-8 into UTF-16 arrays (a popular one is to do UTF-16 "decoding" but turn invalid bytes into a 0xDCxx word and consider the UTF-8 encodings of U+DC80..U+DCFF to be invalid byte sequences). A bug in JAVA is no excuse. If I wrote a decoder that threw an error on misspelled words, does that mean that YAML now must be defined to not allow misspelled words? Or would you correctly say that the problem is my decoder? > At any rate, YAML doesn't attempt to act as a "transparent pipeline" > allowing you to put anything-at-all on one side and get > exactly-the-same-thing on the other. It definitely _does_ have > restrictions about "what you can put in" (e.g., no duplicate keys in > mappings, so PHP must therefore use sequence-of-single-pair-mappings for > its weird data). Being able to read raw invalid UTF-8 seemed to be a requirement from YAML's basic design principle of being "editable by users". However I believe you are worried that programmers will use the ability as a "data compression" method for arbitrary binary data and I agree this is a problem. The \XNN sequence expands random binary data by 3 times, so programmers will probably prefer the 4/3 expansion of base64. However that won't stop a determined programmer from coding their own writer that uses raw bytes as much as possible. There may be a solution where libyaml could continue to reject invalid UTF-8 input: We could assume the editor sanitizes the invalid UTF-8 into the equivalent \XNN sequences. I'm sure Emacs could be told to do this in "yaml mode". I initially wanted to use \nnn octal escapes because there actually are examples of existing programs that produce this escape. I did not do this because of your use of \0 for nul (it would break if followed by a '0' through '7' byte). Using \xNN also has this nice property but is obviously incompatible. > IMO, if the LC_CTYPE locale is UTF-8, the kernel would be > within its rights to _reject_ any attempts to create files containing > invalid UTF-8 bytes, or at minimum convert them into an "equivalent" > valid UTF-8 string. The filesystem (not the Kernel) is allowed to reject ANY byte array for any reason it wants. For instance the FAT driver rejects an awful lot of byte sequences such as ':' or a sequence where the ASCII letters if case is swapped name an existing unreadable file. All file systems I have seen reject strings with '/' in them where the part before the '/' is not an existing directory. I fail to see how libyaml's rejection of a certain subset of byte arrays just because it MAY be a set a filesystem will reject is any kind of excuse however. > AFAIK Apple's OS-X does exactly that, since it uses UTF-16 internally in > HFS (even if it presents filenames as UTF-8 to the applications). OS/X can handle NFS mounts and NFS can explicitly return arbitrary byte arrays including invalid UTF-8. Terminal App decodes strings as though they are UTF-8 and (suprise for me) displays invalid byte sequences as the hex code in a block. HFS translates the Unicode into decomposed normalized form and therefore rejects and remaps quite a number of UTF-8 sequences. I fail to see how libyaml doing a tiny subset of it's manipulations helps me. > Linux > does not do this; I'm not certain about BSD and Solaris and AIX and QNX > and all the other UNIXes out there. Linux treats filenames as arrays of bytes, and explicitly supports some filesystems such as Ext3 where all possible arrays of bytes not containing 0x25 or 0x00 identify different files. > Then there's Plan9 ("UNIX as it > should have been") who _invented_ UTF-8 and I'm willing to bet that they > definitely enforce valid UTF-8 filenames. Plan9 accepted any array of bytes as a filename as long as it did not contain a 0x00 or 0x25 byte. Read the 9FS specification for details. The 9.5 window system did interpret UTF-8 to draw glyphs on the screen. Assuming they used their torune() function to look up the glyphs, they drew each invalid UTF-8 byte as a '?', and they also did not detect overlong encodings as errors. Cut & paste handled the original UTF-8 as raw bytes (9.5 simply indicated the point in the string closest to the glyph the user clicked on) so even this unsafe and lossy practice was harmless in Plan9. This is an excellent example where interpretation was correctly deferred until display. |