Re: [Yaml-core] Invalid UTF-8

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Fri, Aug 21, 2009 at 3:51 PM, William Spitzak<sp...@rh...> wrote:
> Ben Woolley wrote:
>>
>> The main issue here appears to me to be over whether the handling of
>> invalid UTF-8 is a display issue.
>>
>> I argue that it is not a display issue for this simple reason: the
>> existence of methods which operate on UTF-8 which have nothing to do
>> with display.
>>
>> Let's look at the effects of dealing with it as a display issue.
>> Allowing invalid UTF-8, or allowing a special encoding on top of
>> UTF-8, requires that all methods which operate on the UTF-8, and
>> therefore need to parse it in some way, need to be aware of this extra
>> possible syntax to avoid mangling the string's semantics.
>
> I'm sorry but you have EXACTLY BACKWARDS!! Detecting errors requires
> understanding the "semantics".

I don't know what you mean here. Please clarify.

> Treating the UTF-8 as an array of bytes does
> not, but it means encoding errors are passed through!
>

That is why you validate it first so that I know that the bytes are
the right encoding.

> UTF-8 is TRIVIAL to process if you stop being numbed by ASCII history and
> realize that "characters" are not very important. Even with errors it is
> self-synchronizing, a pattern will not match except at a character boundary.
>

Precisely why it is easier to just validate input as soon as possible.

> I don't know what to do, really. For some reason UTF-8 turns otherwise
> intelligent programmers into complete morons. All I can suggest is you try
> to write some code that works with UTF-8 instead of blindly calling
> decode().

I don't blindly call decode(). I validate the input first, working
with UTF-8 which, as you have mentioned many times, isn't that hard.

> And check what you are doing about mismatched UTF-16 surrogates.
> Nothing? And your program still works? But you claimed right above that
> ignoring these such will require "special encoding atop UTF-16 and therefore
> the need to parse it in some way". I hope you can realize you are talking
> total nonsense when you realize that the solutions for UTF-8 and UTF-16 are
> IDENTICAL!!!
>

By special encoding, I was not referring to UTF-16. I was referring to
your special syntax for storing invalid bytes. I was not clear. I was
trying to say that all of the UTF-8 methods which normally just needed
to be aware of UTF-8 syntax would also need to be aware of invalid
syntax so that invalid UTF-8 would be caught.

You seem to want all UTF-8 parsers to be validating parsers. But at
some point, you need to declare a point of responsibility for
transposing the data. That is certainly not in a serialization library
that does its own transposing. Let it do its job, and just its job.

You should want the part of the process which produces UTF-8
responsible for producing valid UTF-8, even if that is merely an input
routine. By having the input routine responsible for validating the
input data, the issues occur closer to the context in which they were
produced, and make it easier to deal with errors.

>> You would be
>>
>> effectively moving the burden of data entry issues deeper down the
>> stack. That would not help UTF-8 awareness at all. It would do the
>> opposite.
>
> WRONG!!!! Throwing errors is exactly what is damaging UTF-8 awareness. The
> reaction by a programmer, IN ONE HUNDERED PERCENT OF THE CASES I HAVE SEEN,
> is to change the encoding to ISO-8859-1 or to strip all bytes with the high
> bit set. This includes a poster right here who said that the solution for
> Unix filenames is to restrict them to ISO-8859-1. This has happened and is
> happening, over and over and over again, right now, destroying
> internationalization! Anybody doing this is part of the problem and is very
> annoying that you would make the claim you are helping.

I proposed a different technique, making the boundaries between
encodings very clear. I am not one of the 100% of cases you have seen.
We both want boundaries between encodings. However, you want that
boundary to be at the consumer (callee). I want it to be at the
producer (caller). I want it to be when the encoding is first
encountered.

>
>> The solution can be simple: You sanitize all user-submitted syntax
>> before you send it to a library which operates on that syntax. This
>> can be as easy as offering a preview functionality, and stripping out
>> invalid UTF-8 or ASCII sequences.
>
> But I can't use libyaml to do this "sanitize" without these changes! IT
> THROWS UNRECOVERABLE ERRORS and I can't "strip out invalid UTF-8"!!!
>

Then don't use libyaml to validate your input. Don't expect it to. The
caller should give it valid UTF-8.

>> I had to deal with an even worse version of this problem when
>> providing an interface for users to edit php smarty templates. The
>> smarty would get compiled to php, and smarty syntax errors would
>> sometimes result in uncatchable fatal php errors. We simply provided a
>> preview functionality, and the problem went away.
>
> Oddly enough you got these PHP errors despite having made sure all your
> UTF-8 was valid? What a big help that was! I think this is an excellent
> counter point to your arguments. Removing some small set of possible errors
> from the data stream is useless and simply a waste of time!
>

No, it had nothing to do with UTF-8. I was not clear on that point. I
was referring to finding a way to help people produce smarty code
which produced valid php by validating the smarty compilation at data
entry.

My point was that the same thing could be done with UTF-8. For UTF-8,
we would translate invalid UTF-8 to valid UTF-8, but right at the
beginning of the process, so that we had total control over the
invalidities at the beginning of the process, in the context at which
it is most suited. We never gave invalid UTF-8 to a functions which
needed valid UTF-8.

>> However, you do have a secondary point I would like to address. Most
>> applications don't do much UTF-8 manipulation, and just use decode().
>> This is why it appears to be display issue,
>
> Again totally backwards from my understanding. The reason people DON'T
> believe it is a display issue is that they are used to decode() throwing the
> error. decode() is NOT a display function.
>
> If it was a "display issue" there would be a "draw this utf8 string" api,
> not "decode".

Where, in all of this, is the "draw this kinda-UTF-8 string" API? That
would be draw_utf8ish_input(). Because that function handles the data
from end-to-end, I would expect it to validate the input at the
beginning end. But decode() is not that. It is merely a low level
function in a library.

>
>> Why not just avoid that problem altogether and write a little wrapper
>> library which handles data entry issues the way you want, instead of
>> trying to fundamentally change the nature of the library? That would
>> be a handy tool which a lot of projects may use. It could even have
>> its own library for solving the data entry problem on other tools, and
>> your cause could move forward.
>
> This is EXACTLY what I am trying to do. I'd like you to explain how I can
> use libyaml to do this if it craps out with an error and does not give me
> the invalid UTF-8? Am I supposed to use the block read/write api rather than
> the FILE api and "sanitize" my UTF-8 there, while somehow not screwing up
> the quoting? And strip out my encryption after I get the libyaml scalars? Do
> you know how incredibly complicated and slow that will be? Do you know that
> the error messages for libyaml won't give me the correct line and charcter
> numbers any more?
>

I am suggesting validating UTF-8 outside of libyaml.