[ jEdit-users ] Encoding detection -- discussion (was Bug? ...)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Matthieu Casanova asked:
> > In fact why not reading that to choose the encoding
> > like it is done for the xml encoding detection ?

Marcelo Vanzin objected:
> I might be repeating myself here, but the problem with
> using encoding as a buffer-local property embedded in the
> buffer is the "chicken and egg" problem. What encoding do
> you use to read the encoding string?

Slava Pestov added:
> You're exactly right. The best thing would be for people to
> gradually transition to UTF16 and UTF8 and slowly phase out
> legacy encodings.

I think I understand Matthieu. He asks as a thiking user. I
dare to feel the same.

For the Unicode encodings and the "chicken and egg" problem.
Detection of the Unicode encoding format is ALSO detection
of the encoding. If I perfectly know what form of Unicode
encoding is used, I can ignore other attempts to explicitly
prescribe the encoding by some special sequence or the like.
As a user, I would not feel the need for anything like
explicit prescription of the used encoding in such case.

Even if the encoding were explicitly defined in the
file--think about the older file was modernized by
converting to UTF-16, for example--I would know that the
sequence can be read correctly (here UTF-16) and checked
whether it fits with the really used encoding (and possibly
warn the user).

But not all files use some Unicode encoding. I dare to say
that majority of text files still do not use Unicode. And
even UTF-8 may not be detected if it does not contain the
initial mark bytes. In such case, I still need to decide
what encoding to use. The simplest way is to use the jEdit's
default encoding -- which may be wrong in the case of the
file.=20

The great value of jEdit is not in the fact that it will
work perfectly in future, but in the fact that it works
nicely now. The future is, well the future. Until then, we
can improve the presence or "near future".=20

To autodetect the encoding, I should first try to detect the
encoding from the file. If UTF-8 without the leading bytes
is used, I could still read the :encoding=3Dutf-8: as ASCII
characters. Only when I am not able to autodetect the
encoding, I should use the default encoding. To be perfect,
after deciding the encoding and swithing the buffer to that
encoding, jEdit could even check if there is no conflict with
what is explicitly said inside the file.

It is clear that the described functionality is not
extremely simple and it possibly should not be the part of
the core. However, the core could be modified and the
core-related plugin for encoding autodetection could be
added (like in the LatestVersion Check or QuickNotepad
case). Only when the plugin was not present or activated,
the default encoding should be set.

The plugin should have some generic part and the mechanism
that allows extension (similar to syntax highlighing modes,
for example).

What is your opinion?
  pepr

[ jEdit-users ] Encoding detection -- discussion (was Bug? ...)

jEdit is a programmer's text editor written in Java.

[ jEdit-users ] Encoding detection -- discussion (was Bug? ...)