|
From: Matthieu C. <cho...@gm...> - 2007-02-01 08:05:24
|
2007/2/1, Marcelo Vanzin <va...@us...>: > > Matthieu Casanova wrote: > > In fact why not reading that to choose the encoding like it is done for > > the xml encoding detection ? > > I might be repeating myself here, but the problem with using encoding as > a buffer-local property embedded in the buffer is the "chicken and egg" > problem. What encoding do you use to read the encoding string? > > XML parsing is not a very good example. If you look at the parser code > in the JDK, it's really ugly. I've had to fix it at my last job and I > still have nightmares about it. :-) Basically what it does is ready the > first few bytes, does a big "if then else" and checks if that chacacter > is the "<" character in several different encodings. Then tries to parse > using that encoding, and if it then works, use the encoding that the XML > declaration defines. > > This "works" for XML because the first character in an XML file (except > for whitespace) always has to be a "<". But even then it's easy to get > things wrong; try to parse an XML file encoded in UTF-16LE using the > 1.4.2 JDK parser and watch it blow up (1.5 works fine, BTW). > > Trying to apply that to a file that doesn't have to respect any > structure is, to say the least, very, very difficult. Even if most of > the time you can get away with just treating everything as ASCII, there > are always exceptions (the multi-byte unicode encodings being examples > of where treating things as ASCII would fail). > Yes that's right, but look at my example. My jEdit use UTF-8 by default. But sometimes I open a file encoded in ISO-8859-1, there are some accentuated characters that are displayed as boxes (meaning the encoding was not the good one). But the :encoding=ISO-8859-1: was read correctly so it would have been possible to read it and to switch to that encoding. For your example, you're right, maybe it would not work well everytime but I think it could help. (and if it doesn't work with 1.4.2 we don't care since jEdit requires now Java 5 :) And there is an important problem about encoding in jEdit : if jEdit uses by default UTF-8. I open a file that contains this :encoding=someencoding: The file will be loaded using UTF-8 because it's the default encoding but the status bar will show the encoding found in the file that will also be used to save the file. Nowhere the user can know what encoding was used to load the file In fact I think it almost every case this encoding would be read correctly. I tried to read an UTF-16 or UTF-8 file using default encoding Cp1252, the UTF-16 was detected by the magic unicode characters, the UTF-8 was not detected but and some characters were wrong but the encoding=UTF-8 was fine So is there still examples where it fails ? |