From: William S. <sp...@rh...> - 2009-08-19 03:09:45
Attachments:
patch
|
I'm sure this has come up before, but anyway: In our software it looks like YAML would be a great way to store are retrieve data. However we have the need to store arbitrary bytes, with the caveat that they are *LIKELY* to be UTF-8. Examples are Unix filenames, and metadata stored in image files such as comments. This requires that we be able to losslessly store invalid UTF-8. The attached patch against yaml-0.1.2 implements these changes: 1. Invalid UTF-8 is stored by writing each erroneous byte using the new escape sequence "\XNN" in quoted strings. Thus the saved files are themselves valid UTF-8 and writing UTF-16 files still works. 2. The parser reads "\XNN" sequences in quoted strings. It can also read raw invalid UTF-8 bytes from the input file without changing it. 3. Mismatched surrogate pairs (ie invalid UTF-16) are accepted and read and written directly. This means that a UTF-16 file read/written by YAML can contain raw invalid UTF-16 sequences. The obvious encoding of invalid UTF-16 to UTF-8 is used to read/write UTF-8 files. Notice that this encoding is lossless and does not interfere with correct UTF-8 encoding of Unicode characters >= U+10000. 4. The parser also accepts mismatched surrogate pairs as "\uNNNN" escapes in quoted strings. It may be desirable to write invalid surrogate pairs this way but I did not implement this. 5. I commented out the production of "\xNN" escapes as that indicates a raw byte, not Unicode, for C and C++. It writes "\u00NN" escapes instead. This is not a requirement however. 6. It also accepts arbitrary strings in tag names. It %-encodes all the bytes in any valid or invalid UTF-8 encoding. This matches how invalid UTF-8 is handled in URLs by most servers. NOTES ON THE IMPLEMENTATION: Handling UTF-8 is MUCH easier than most people believe. It is far more useful to think of it as a byte stream and to use byte operations than to try to decode it. I tried to make this patch do this, while minimizing the size of the diff. An "error" is ONE byte long and that byte ALWAYS has the high bit set. A pointer to an error can NEVER match any ASCII character or any valid UTF-8 encoding. This means search can be byte oriented and unconcerned about errors. If you are pointing at a UTF-8 character or error and you move one byte, you will either be pointing at an error or at the next character. This makes searching for UTF-8 trivial by doing byte pattern matching, there is no need to decode UTF-8 or detect errors. For instance you can find the next BREAK by running IS_BREAK every byte. Other than BREAK and BOM, the only characters handled specially by YAML are one byte long. This means the majority of it can treat files as byte streams and not decode anything. I initially replaced all the macros like READ() with READ1() and READN() replacements. However I then renamed READ1() back to READ() as this greatly reduced the diff size. IS_BREAK is somewhat annoying. I think it may be a good idea to redefine breaks as only being NL Bill Spitzak Rhythm & Hues Software Department |
From: Oren Ben-K. <or...@be...> - 2009-08-21 14:35:49
|
On Thu, 2009-08-20 at 18:44 -0700, Adrian Klaver wrote: > It is a data issue, pushing it up the stack only prolongs the agony. Exactly. Not only prolongs it but also spreads it everywhere. That said, in cases like (say) UNIX files names containing arbitrary bytes, there's not much you can do "data entry"-wise. BTW - Unicode does allow for _all_ 8-bit characters. You can argue about their semantics, but the fact is that a simple "\xNN" escape sequence inside double-quoted strings _will work_ for all the 256 single-byte values. So, for the case of stuff like UNIX file names, I really don't see the problem. As for tags - I don't see anyone defining tags that are not valid printable Unicode characters. It isn't as if we need to support all the imaginable UR*L*s out there. Tags are a very specific controlled set of UR*I*s. You can even use this to encode arbitrary binary data, as long as you accept that (1) you are using a 4 YAML stream bytes ('\' 'x' N N) for each 8 bits of payload (except for say 1/4 of the bytes - call it 3 bytes on average for random binary data), (2) that there will be an interim in-memory string representation using one "character worth" of bytes for each 8-bit payload (2 bytes if using UTF-16 like Java does, 1-2 bytes if using UTF-8 - call it 1.5 bytes on average for random binary data), and (3) that you'll need to execute string2binary on the interim in-memory string to convert it to a true binary bytes array. You'd be better off using !!binary and base64 for true binary data, of course - only 1.25 YAML stream bytes for each 8 bits of payload, and it will be loaded directly to an in-memory bytes buffer with no interim representation. But still, using "\xNN" for binary data may be a valid trick to use in some circumstances. Either way, I simply don't see the use case for supporting invalid Unicode characters, either as raw bytes or as escaped characters. Have fun, Oren Ben-Kiki |