From: Oren Ben-K. <or...@be...> - 2009-09-03 14:02:54
|
On Wed, 2009-09-02 at 16:39 -0700, William Spitzak wrote: > And look what happens when a user edits the yaml file in a Unicode > editor and pastes in some text. All characters in the range 0x80..0xFF > will turn into raw bytes! If these are the majority or only non-ASCII > then in fact the resulting name is in effect ISO-8859-1! This seems to be your core concern and the answer is that a Unicode editor is responsible to convert the pasted text from whatever encoding it is in into the Unicode encoding used in the file. What if I paste Hebrew text copied from a text file using ISO-8859-8 into a Unicode editor? What if I paste Japanese text copied from a text file using SJIS into a Unicode editor? How is this YAML's problem in any way, shape or form? > Please understand what users will see: text in ISO-8859-1 is READABLE, > while UTF-8 is UNREADABLE! I have no idea what you are talking about. UTF-8 is perfectly readable on any Unicode aware system, including my VIM editor, MS Notepad, any browser, etc. If you are running DOS 3.1 and using "code pages", then yes, you are SOL. YAML is by design not interested in systems that are not Unicode aware and that treat UTF-8 as "unreadable". This is the 21st century! > I don't care how much you say the encoding is > "UTF-8", it is ISO-8859-1. ?!?!?!?!?!?!?!?!?!?!?!?! Read my lips: > > You keep saying that but it makes no sense and I am completely baffled > > by it. For the record, and for the last one, I do not suggest YAML uses > > any encoding other than Unicode (UTF-*), under any circumstance, at any > > place, in any library, file, API, anywhere, *ever*. > > This has NOTHING to do with the encoding of the yaml file itself. I > fully support libyaml writing only valid UTF-8 and it probably is not a > huge deal if only valid UTF-8 is accepted on input. > > What I need is a lossless way to store invalid UTF-8 in a scalar, > without making valid UTF-8 unreadable in the resulting file. The file > format itself can be valid UTF-8 or UTF-16 or UTF-32. I'm sorry my > initial posts confused this with the (perhaps unrelated) ability to read > invalid UTF-8 files. We covered that. To do this you need a lossless way to convert between invalid UTF-8 and invalid UTF-16 and invalid UTF-32, so that the same YAML file could be loaded by any YAML library regardless of the in-memory string encoding used ("cross platform portability"). No such standard method to deal with invalid UTF bytes exists at this point. No standard or common tool uses anything close to such method. YAML is not the place to create such a method. If/when such a method exists, we'll consider adopting it. Until such time as such a method exists, *it ain't gonna happen*, period. Have fun, Oren Ben-Kiki |