|
From: Marcelo V. <va...@us...> - 2007-02-01 04:26:23
|
Matthieu Casanova wrote: > In fact why not reading that to choose the encoding like it is done for > the xml encoding detection ? I might be repeating myself here, but the problem with using encoding as a buffer-local property embedded in the buffer is the "chicken and egg" problem. What encoding do you use to read the encoding string? XML parsing is not a very good example. If you look at the parser code in the JDK, it's really ugly. I've had to fix it at my last job and I still have nightmares about it. :-) Basically what it does is ready the first few bytes, does a big "if then else" and checks if that chacacter is the "<" character in several different encodings. Then tries to parse using that encoding, and if it then works, use the encoding that the XML declaration defines. This "works" for XML because the first character in an XML file (except for whitespace) always has to be a "<". But even then it's easy to get things wrong; try to parse an XML file encoded in UTF-16LE using the 1.4.2 JDK parser and watch it blow up (1.5 works fine, BTW). Trying to apply that to a file that doesn't have to respect any structure is, to say the least, very, very difficult. Even if most of the time you can get away with just treating everything as ASCII, there are always exceptions (the multi-byte unicode encodings being examples of where treating things as ASCII would fail). -- Marcelo Vanzin va...@us... "Life is too short to drink cheap beer" |