On Fri, Sep 4, 2009 at 4:48 PM, Brad R <email@example.com>
If the default/required behavior of the library is to return a language specific string when it can (option 3), then, in my opinion, it probably should not, by default, handle how invalid byte sequences are encoded into that string. In other words, the default behavior would be to only convert the bytes into a language specific string if the byte sequence is valid for that string type. A library should then expose options to change this behavior (with a tag handler being the most flexible choice).
I've thought better of the way I said that... when I talk about only converting bytes into a language specific string here, the byte sequence is being treated as UTF-8, regardless of the language string type, as Oren described. So, unless the native string type being used is a UTF-8 byte sequence (with allowance for invalid bytes), the implication of what I said is that the default behavior would be to only convert valid UTF-8 byte sequences to the native string type and leave others as a binary array.