|
From: Slava P. <sl...@fa...> - 2007-02-01 05:33:08
|
You're exactly right. The best thing would be for people to gradually transition to UTF16 and UTF8 and slowly phase out legacy encodings. Slava On 31-Jan-07, at 11:26 PM, Marcelo Vanzin wrote: > I might be repeating myself here, but the problem with using > encoding as > a buffer-local property embedded in the buffer is the "chicken and > egg" > problem. What encoding do you use to read the encoding string? > > XML parsing is not a very good example. If you look at the parser code > in the JDK, it's really ugly. I've had to fix it at my last job and I > still have nightmares about it. :-) Basically what it does is ready > the > first few bytes, does a big "if then else" and checks if that > chacacter > is the "<" character in several different encodings. Then tries to > parse > using that encoding, and if it then works, use the encoding that > the XML > declaration defines. > > This "works" for XML because the first character in an XML file > (except > for whitespace) always has to be a "<". But even then it's easy to get > things wrong; try to parse an XML file encoded in UTF-16LE using the > 1.4.2 JDK parser and watch it blow up (1.5 works fine, BTW). > > Trying to apply that to a file that doesn't have to respect any > structure is, to say the least, very, very difficult. Even if most of > the time you can get away with just treating everything as ASCII, > there > are always exceptions (the multi-byte unicode encodings being examples > of where treating things as ASCII would fail). |