Thread: [ jEdit-users ] Bug? Can you confirm it? (was Encoding autodetection for ... Python?)

jEdit is a programmer's text editor written in Java.

Brought to you by: daleanson, ezust, grepppo, k_satoda, and 5 others

jedit-users

[ jEdit-users ] Bug? Can you confirm it? (was Encoding autodetection for ... Python?)

From: Petr P. <Pri...@sk...> - 2007-01-31 09:54:27

Dalibor Petri=E8evi=E6
> Petr Prikryl wrote:
> > [...]
> > I can explain the encoding to jEdit by adding
> > explicitly another similar line; so, I can use
> > [...]
> > # :encoding=3Dwindows-1250:
>=20
> Pardon my question but are you shure that=20
> :encoding=3Dwindows-1250: causes Jedit to set proper=20
> encoding when opening file?

One can never be sure ;-). Double checking is
always better. It seems that you found the bug.
I have created the file a.txt like this
----------------------------------------
:encoding=3Dwindows-1250:
P=F8=ED=B9ern=EC =BElu=BBou=E8k=FD k=F9=F2 =FAp=ECl =EF=E1belsk=E9 =F3dy
(i.e. The quick brown fox.... for the Czech=20
language using the encoding.)
----------------------------------------

in another editor. Then I switched jEdit to use
utf-8 as default encoding (this is important
to show the bug) and exited jEdit (not running=20
in background, invisible mode).

If I start jEdit and pass the file through
the command line argument, it loads the file
and displays the content in the prescribed=20
encoding correctly.

However, when I open the file using Ctrl+O,=20
the prescribed encoding is ignored and the=20
content is displayed asuming the default
jEdit encoding (here utf-8).=20

When I do File - Reload from the menu, it=20
is then reloaded and displayed correctly.
It seems that the File - Open implementation
forgot to interpret the explicit encoding
prescription before displaying the content.

> This is something that I wished for years
> but experience shows that this is not actually=20
> working. I THINK the process goes this way:=20
> Jedit first opens file and then parses it.=20
> THEN Jedit can determine encoding specified=20
> as you are doing it and then it should reopen it=20
> with propper encoding and (re)show chars in=20
> proper encoding. It does not do that as=20
> far as I can see. You actually have to specify=20
> manually file encoding BEFORE opening file.=20
> This is how stuff works now.

As far as I can say, the Dalibor's observation
is true.

I did not noticed the bug until now, because
I use the FAR manager and usually open the
file in jEdit by pointing to the file and=20
using Ctrl+F4 shortcut. It means that I pass
the file to jEdit through command line and
it works correctly even if jEdit is already=20
running.

Also, normally I have the default encoding
of jEdit set to windows-1250, so the bug=20
is masked. Could someone else confirm the bug?
Would it be difficult to correct the bug?

I am using jEdit 4.3pre9 with Java 1.5.0_10
with options -background -nogui -reuseview

Thanks,
  pepr

Re: [ jEdit-users ] Bug? Can you confirm it? (was Encoding autodetection for ... Python?)

From: <dal...@is...> - 2007-01-31 10:13:41

Petr Prikryl wrote:

> Also, normally I have the default encoding
> of jEdit set to windows-1250, so the bug 
> is masked. Could someone else confirm the bug?
> Would it be difficult to correct the bug?
> 
> I am using jEdit 4.3pre9 with Java 1.5.0_10
> with options -background -nogui -reuseview
> 

Saddle, I'm not shure it's a bug. It's just weired by design :-)

OT
here's something my wife makes me repet just to twist my tongue when 
she's bored (she is Czech by father's line):
Tři sta třicet tři stříbrných stříkaček stříkalo přes tři sta třiatřicet 
stříbrných střech
;-)

Cheers,
-- 
Dalibor Petričević

Re: [ jEdit-users ] Bug? Can you confirm it? (was Encoding autodetection for ... Python?)

From: Slava P. <sl...@fa...> - 2007-01-31 15:37:44

On 31-Jan-07, at 4:53 AM, Petr Prikryl wrote:

> ----------------------------------------
> :encoding=windows-1250:

I wonder where people get the idea that this works. 'encoding' is not  
a buffer-local property and it cannot be set in this way, and the  
documentation does not mention it.

Don't do this.

Slava

Re: [ jEdit-users ] Bug? Can you confirm it? (was Encoding autodetection for ... Python?)

From: Matthieu C. <cho...@gm...> - 2007-01-31 15:43:44

2007/1/31, Slava Pestov <sl...@fa...>:
>
>
> On 31-Jan-07, at 4:53 AM, Petr Prikryl wrote:
>
> > ----------------------------------------
> > :encoding=windows-1250:
>
> I wonder where people get the idea that this works. 'encoding' is not
> a buffer-local property and it cannot be set in this way, and the
> documentation does not mention it.
>
> Don't do this.
>
> Slava
>
>
Hi, it is not used to load the buffer, but if you put an
:encoding=windows-1250:
jEdit will read that and change the encoding of the buffer (it can be
seen in the status bar).
In fact why not reading that to choose the encoding like it is done for the
xml encoding detection ?

Matthieu

Re: [ jEdit-users ] Bug? Can you confirm it? (was Encoding autodetection for ... Python?)

From: Marcelo V. <va...@us...> - 2007-02-01 04:26:23

Matthieu Casanova wrote:
> In fact why not reading that to choose the encoding like it is done for
> the xml encoding detection ?

I might be repeating myself here, but the problem with using encoding as
a buffer-local property embedded in the buffer is the "chicken and egg"
problem. What encoding do you use to read the encoding string?

XML parsing is not a very good example. If you look at the parser code
in the JDK, it's really ugly. I've had to fix it at my last job and I
still have nightmares about it. :-) Basically what it does is ready the
first few bytes, does a big "if then else" and checks if that chacacter
is the "<" character in several different encodings. Then tries to parse
using that encoding, and if it then works, use the encoding that the XML
declaration defines.

This "works" for XML because the first character in an XML file (except
for whitespace) always has to be a "<". But even then it's easy to get
things wrong; try to parse an XML file encoded in UTF-16LE using the
1.4.2 JDK parser and watch it blow up (1.5 works fine, BTW).

Trying to apply that to a file that doesn't have to respect any
structure is, to say the least, very, very difficult. Even if most of
the time you can get away with just treating everything as ASCII, there
are always exceptions (the multi-byte unicode encodings being examples
of where treating things as ASCII would fail).

-- 
Marcelo Vanzin
va...@us...
"Life is too short to drink cheap beer"

Re: [ jEdit-users ] Bug? Can you confirm it? (was Encoding autodetection for ... Python?)

From: Slava P. <sl...@fa...> - 2007-02-01 05:33:08

You're exactly right. The best thing would be for people to gradually  
transition to UTF16 and UTF8 and slowly phase out legacy encodings.

Slava

On 31-Jan-07, at 11:26 PM, Marcelo Vanzin wrote:

> I might be repeating myself here, but the problem with using  
> encoding as
> a buffer-local property embedded in the buffer is the "chicken and  
> egg"
> problem. What encoding do you use to read the encoding string?
>
> XML parsing is not a very good example. If you look at the parser code
> in the JDK, it's really ugly. I've had to fix it at my last job and I
> still have nightmares about it. :-) Basically what it does is ready  
> the
> first few bytes, does a big "if then else" and checks if that  
> chacacter
> is the "<" character in several different encodings. Then tries to  
> parse
> using that encoding, and if it then works, use the encoding that  
> the XML
> declaration defines.
>
> This "works" for XML because the first character in an XML file  
> (except
> for whitespace) always has to be a "<". But even then it's easy to get
> things wrong; try to parse an XML file encoded in UTF-16LE using the
> 1.4.2 JDK parser and watch it blow up (1.5 works fine, BTW).
>
> Trying to apply that to a file that doesn't have to respect any
> structure is, to say the least, very, very difficult. Even if most of
> the time you can get away with just treating everything as ASCII,  
> there
> are always exceptions (the multi-byte unicode encodings being examples
> of where treating things as ASCII would fail).

Re: [ jEdit-users ] Bug? Can you confirm it? (was Encoding autodetection for ... Python?)

From: <dal...@is...> - 2007-02-01 13:44:32

Slava Pestov wrote:
> You're exactly right. The best thing would be for people to gradually  
> transition to UTF16 and UTF8 and slowly phase out legacy encodings.
> 
> Slava
> 
> On 31-Jan-07, at 11:26 PM, Marcelo Vanzin wrote:
> 
>> I might be repeating myself here, but the problem with using  
>> encoding as
>> a buffer-local property embedded in the buffer is the "chicken and  
>> egg"

It is not really down to checkens and eggs ...

:encoding=windows-1250:

This line does not have any "strange" characters. And it never should 
have one. (Problem will emerge with strange encodings like Chinese or 
something like that, and feather will fly ...)
Now, file containing that line can be encoded in some one-byte or 
multy-byte encoding.
First try to recognize sequence ":encoding" in any one-byte encoding and 
if you succeed you won. You got encoding. No chickens, no eggs. No flue.
If you fail - try reading file as it is multy-byte encoded (UTF 
something). Go same way as described earlier. If sequence is not found 
after all the searches - it probably isn't in file.

Yes, in worst case scenario you will parse file several times and if 
file is big ... it might be a performance problem (:encoding can be 
placed at the end of file - you will read whole file). Now, if there can 
be a config option "I want to use this" then if user wants he can use 
this feature.

No flame, no war. I might be wrong about this but I would love to have 
this feature. Normally, I would try doing it by myself but my Java 
knowledge and experience is quite humble.

Tnx,
-- 
Dalibor Petricevic

Re: [ jEdit-users ] Bug? Can you confirm it? (was Encoding autodetection for ... Python?)

From: Matthieu C. <cho...@gm...> - 2007-02-01 08:05:24

2007/2/1, Marcelo Vanzin <va...@us...>:
>
> Matthieu Casanova wrote:
> > In fact why not reading that to choose the encoding like it is done for
> > the xml encoding detection ?
>
> I might be repeating myself here, but the problem with using encoding as
> a buffer-local property embedded in the buffer is the "chicken and egg"
> problem. What encoding do you use to read the encoding string?
>
> XML parsing is not a very good example. If you look at the parser code
> in the JDK, it's really ugly. I've had to fix it at my last job and I
> still have nightmares about it. :-) Basically what it does is ready the
> first few bytes, does a big "if then else" and checks if that chacacter
> is the "<" character in several different encodings. Then tries to parse
> using that encoding, and if it then works, use the encoding that the XML
> declaration defines.
>
> This "works" for XML because the first character in an XML file (except
> for whitespace) always has to be a "<". But even then it's easy to get
> things wrong; try to parse an XML file encoded in UTF-16LE using the
> 1.4.2 JDK parser and watch it blow up (1.5 works fine, BTW).
>
> Trying to apply that to a file that doesn't have to respect any
> structure is, to say the least, very, very difficult. Even if most of
> the time you can get away with just treating everything as ASCII, there
> are always exceptions (the multi-byte unicode encodings being examples
> of where treating things as ASCII would fail).
>

Yes that's right, but look at my example.
My jEdit use UTF-8 by default. But sometimes I open a file encoded in
ISO-8859-1, there are
some accentuated characters that are displayed as boxes (meaning the
encoding was not the good one).
But the :encoding=ISO-8859-1: was read correctly so it would have been
possible to read it and to switch
to that encoding.

For your example, you're right, maybe it would not work well everytime but I
think it could help. (and if it doesn't
work with 1.4.2 we don't care since jEdit requires now Java 5 :)

And there is an important problem about encoding in jEdit :
if jEdit uses by default UTF-8. I open a file that contains this
:encoding=someencoding:
The file will be loaded using UTF-8 because it's the default encoding but
the status bar will show the encoding found
in the file that will also be used to save the file. Nowhere the user can
know what encoding was used to load the file

In fact I think it almost every case this encoding would be read correctly.
I tried to read an UTF-16 or UTF-8 file
using default encoding Cp1252, the UTF-16 was detected by the magic unicode
characters, the UTF-8 was not
detected but and some characters were wrong but the encoding=UTF-8 was fine

So is there still examples where it fails ?

Re: [ jEdit-users ] Bug? Can you confirm it? (was Encoding autodetection for ... Python?)

From: Slava P. <sl...@fa...> - 2007-01-31 16:17:06

On 31-Jan-07, at 10:43 AM, Matthieu Casanova wrote:

> Hi, it is not used to load the buffer, but if you put an
> :encoding=windows-1250:
> jEdit will read that and change the encoding of the buffer (it can be
> seen in the status bar).
> In fact why not reading that to choose the encoding like it is done  
> for the xml encoding detection ?

Because buffer-local properties are only processed after the file is  
loaded.

Slava

Re: [ jEdit-users ] Bug? Can you confirm it? (was Encoding autodetection for ... Python?)

From: Matthieu C. <cho...@gm...> - 2007-01-31 16:20:15

2007/1/31, Slava Pestov <sl...@fa...>:
>
>
> On 31-Jan-07, at 10:43 AM, Matthieu Casanova wrote:
>
> > Hi, it is not used to load the buffer, but if you put an
> > :encoding=windows-1250:
> > jEdit will read that and change the encoding of the buffer (it can be
> > seen in the status bar).
> > In fact why not reading that to choose the encoding like it is done
> > for the xml encoding detection ?
>
> Because buffer-local properties are only processed after the file is
> loaded.
>
> Slava
>

That's right, but I think the encoding property could be processing during
the buffer loading don't you ?
Of course unless the other properties it would only work if the property is
set in
 the ten first line and not the ten last lines

Matthieu