Thread: [Yaml-core] Invalid UTF-8

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

I'm sure this has come up before, but anyway:

In our software it looks like YAML would be a great way to store are 
retrieve data. However we have the need to store arbitrary bytes, with 
the caveat that they are *LIKELY* to be UTF-8. Examples are Unix 
filenames, and metadata stored in image files such as comments.

This requires that we be able to losslessly store invalid UTF-8.

The attached patch against yaml-0.1.2 implements these changes:

1. Invalid UTF-8 is stored by writing each erroneous byte using the new 
escape sequence "\XNN" in quoted strings. Thus the saved files are 
themselves valid UTF-8 and writing UTF-16 files still works.

2. The parser reads "\XNN" sequences in quoted strings. It can also read 
raw invalid UTF-8 bytes from the input file without changing it.

3. Mismatched surrogate pairs (ie invalid UTF-16) are accepted and read 
and written directly. This means that a UTF-16 file read/written by YAML 
can contain raw invalid UTF-16 sequences. The obvious encoding of 
invalid UTF-16 to UTF-8 is used to read/write UTF-8 files. Notice that 
this encoding is lossless and does not interfere with correct UTF-8 
encoding of Unicode characters >= U+10000.

4. The parser also accepts mismatched surrogate pairs as "\uNNNN" 
escapes in quoted strings. It may be desirable to write invalid 
surrogate pairs this way but I did not implement this.

5. I commented out the production of "\xNN" escapes as that indicates a 
raw byte, not Unicode, for C and C++. It writes "\u00NN" escapes 
instead. This is not a requirement however.

6. It also accepts arbitrary strings in tag names. It %-encodes all the 
bytes in any valid or invalid UTF-8 encoding.  This matches how invalid 
UTF-8 is handled in URLs by most servers.

NOTES ON THE IMPLEMENTATION:

Handling UTF-8 is MUCH easier than most people believe. It is far more 
useful to think of it as a byte stream and to use byte operations than 
to try to decode it. I tried to make this patch do this, while 
minimizing the size of the diff.

An "error" is ONE byte long and that byte ALWAYS has the high bit set. A 
pointer to an error can NEVER match any ASCII character or any valid 
UTF-8 encoding. This means search can be byte oriented and unconcerned 
about errors.

If you are pointing at a UTF-8 character or error and you move one byte, 
you will either be pointing at an error or at the next character. This 
makes searching for UTF-8 trivial by doing byte pattern matching, there 
is no need to decode UTF-8 or detect errors. For instance you can find 
the next BREAK by running IS_BREAK every byte.

Other than BREAK and BOM, the only characters handled specially by YAML 
are one byte long. This means the majority of it can treat files as byte 
streams and not decode anything. I initially replaced all the macros 
like READ() with READ1() and READN() replacements. However I then 
renamed READ1() back to READ() as this greatly reduced the diff size.

IS_BREAK is somewhat annoying. I think it may be a good idea to redefine 
breaks as only being NL

Bill Spitzak
Rhythm & Hues Software Department

Thread: [Yaml-core] Invalid UTF-8

yaml-core