[ jEdit-devel ] [ jedit-Bugs-3040720 ] Incorrect handling of Unicode outside BMP ("surrogates")

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Bugs item #3040720, was opened at 2010-08-06 09:48
Message generated for change (Comment added) made by jarekczek
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=100588&aid=3040720&group_id=588

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: editor core
Group: severe bug
Status: Open
Resolution: None
Priority: 6
Private: No
Submitted By: Makarius (makarius)
Assigned to: Kazutoshi Satoda (k_satoda)
Summary: Incorrect handling of Unicode outside BMP ("surrogates")

Initial Comment:
Handling of unicode characters outside the "Basic Multilingial Plane" (U+0000 to U+FFFF) does not really work in TextArea and a few other
custom made text boxes of jEdit, e.g. the "Console" plugin.

Instead of navigating text according to "code points", as explained in http://download.oracle.com/javase/6/docs/api/index.html?java/lang/Character.html jEdit usually refers to plain-old UTF-16 Chars for caret movement, deletion etc.

The following example in Console/Beanshell illustrates this:

  buffer.insert(0, "\uD835\uDC9C\n")

This produces a calligraphic letter A, e.g. use the STIX fonts http://www.stixfonts.org to see it on the screen.  Positioning the caret at the end of the first line, a single BACKSPACE will only delete the second "surrogate" character in the internal representation, leaving the first one standing alone.  Thus the content of the buffer becomes malformed in the sense of this funny Java text representation.

This has been reproduced in various official jEdit versions, such as 4.3.2.  It seems that the most basic text fields of Swing can handle these newer unicode points.

----------------------------------------------------------------------

>Comment By: Jarek Czekalski (jarekczek)
Date: 2012-02-29 22:56

Message:
I see there is something else also in the patch. Wouldn't it be better to
separate it?

With this patch you introduce line length complexity to single character
operations. That is strong degradation. Not that I am against this, because
I can't imagine at the moment a situation where this would be detectable.
Maybe a file with 100K line length and pressing and keeping arrow right.
But one will have to keep in mind that methods with innocent names, like
getNextCharacterOffset, actually fetch the whole line. Maybe it could be
added to a javadoc of these methods?

I mean in the future someone may invoke getNextCharacterOffset many times
in a row, which would be inefficient.

If the methods before operating on the whole line first checked the
simplest case: whether it's not just the 1-byte character, we couldn't have
this issue at all. And the programmers (jedit is a programmer's editor)
deal usually with the files that consist only of such characters. The
malicious examples of 100K length lines would also consists only of simple
characters, because they are usually machine generated, like log files or
other outputs. Human readable text cannot have such a long paragraph.

So summarizing: nothing has to be changed at the moment. These are only
clues like:
1. Comment possible inefficiency of character operations
2. Separate patch
3. Consider checking whether we're at 1-byte character before starting the
machinery

----------------------------------------------------------------------

Comment By: Kazutoshi Satoda (k_satoda)
Date: 2012-02-29 16:44

Message:
A rough patch (with some mixture) is attached. Now on testing.

Note that this patch fixes basic edit operation, leaving problems with
edit via rectangular selection.

----------------------------------------------------------------------

Comment By: Jarek Czekalski (jarekczek)
Date: 2012-01-15 00:33

Message:
Some time passed and now I have a different, more general approach to this
bug entry. I suggest lowering its priority and changing severity to normal.
Here are my reasons:
1. It affects minority of users, I would assess that 3%. No programers
among them and jedit is programmer's editor, as the title page says.
2. The data destruction is not coming in surprising moments, but only
during deletion operations, where one expects removing parts of text and
will easily see that something went wrong.
3. There is a workaround: manually deleting orphaned characters.
Bug tracker directions are on our wiki, linked also from main tracker
page.
Kazutoshi, please make final decision.

----------------------------------------------------------------------

Comment By: Max Funk (mf3)
Date: 2012-01-07 07:16

Message:
I wanted to complete / correct my issue list from previous comment

- left, right arrows go between half characters (not: "delete")
- Selections with mouse go between half characters
- Rectangular selections with mouse go between half characters

----------------------------------------------------------------------

Comment By: Max Funk (mf3)
Date: 2012-01-06 18:07

Message:
I made extensive testing of the high characters 
with the new charactermap plugin from svn trunk.

I found the following issues:
- backspace deletes half characters
- left, right arrows delete half characters
- Characters of auxiliary fonts are not displayed.

On the other hand:
- Display of the main font is fine
- Converting between different unicode encodings is fine
- Save and Reload is fine
- Printing is fine.

----------------------------------------------------------------------

Comment By: Jarek Czekalski (jarekczek)
Date: 2011-12-01 00:01

Message:
I attach a file revealing some unicode details. I post it here to store it
on sf server. Please discuss it only on jedit-devel mailing list.

----------------------------------------------------------------------

Comment By: Makarius (makarius)
Date: 2011-11-29 14:27

Message:
One more example from the API doc, see also
http://docs.oracle.com/javase/6/docs/api/java/text/BreakIterator.html

Print the element at a specified position:

     public static void printAt(BreakIterator boundary, int pos, String
source) {
         int end = boundary.following(pos);
         int start = boundary.previous();
         System.out.println(source.substring(start,end));
     }

This works even for unaligned "pos": following will slide to the end of the
current compositional character, and previous back to its actual start. 
Thus one can standardize an accidental caret offset, for example.

----------------------------------------------------------------------

Comment By: Makarius (makarius)
Date: 2011-11-29 14:19

Message:
I have never heard of BreakIterator before, but it actually looks
interesting.  See also the tutorial in
http://docs.oracle.com/javase/tutorial/i18n/text/char.html

Note that this is a factory, so nothing to implement from the user's side: 
 java.text.BreakIterator.getCharacterInstance() does the main job.  One can
then use setText(), next(), previous(), or isBoundary(offset) to check if
an accidental did actually "land" correctly on a character boundary
(surrogate sequence or more complex composition).

One remaining question is about the locale.  By default it is implicit, but
using a funny one could change the meaning of editor movements in ways that
need to be understood first.  There could be two extremes: hardwire a
default locale that does the Unicode job right, but nothing more.  Provide
jEdit properties for choosing a locale.  (Right now our main application is
mathematics via unicode, but I can't wait to see proper arabic text
processing :-))

----------------------------------------------------------------------

Comment By: Jarek Czekalski (jarekczek)
Date: 2011-11-29 09:56

Message:
We're preparing a design of a general solution at devel mail list. It will
take some time. So at the moment the only option is this ad-hoc patch,
which I believe is working correctly.

----------------------------------------------------------------------

Comment By: Kazutoshi Satoda (k_satoda)
Date: 2011-11-28 08:54

Message:
I agree that the malformation of data is a serious problem.

I know this problem as a part of more difficult "character" handling in
a text component. In fact, even a code point can be a part of
"character" in some languages. For such languages, alignments should be
done using BreakIterator.
http://docs.oracle.com/javase/6/docs/api/java/text/BreakIterator.html#getCharacterInstance%28%29

I admit that fixes to avoid malformation of surrogate pair may be some
right steps toward the goal. But If the design or interface is revised
for these problems, I hope they allow support for more general (and
complex) character breaks.

----------------------------------------------------------------------

Comment By: Jarek Czekalski (jarekczek)
Date: 2011-11-28 06:56

Message:
Let me summarize: mathematical applications require input in unicode text
format and the symbols used are from above BMP. Knowing this I second the
need to properly handle such texts in jedit. From the other side we could
claim that jedit doesn't support characters above BMP, but such cowardity
is not necessary.

If jedit is to support full unicode, described malformation of data is a
serious problem. Since there is nothing in the middle between normal and
severe bug, I mark it as severe and ask Kazutoshi for an answer.

----------------------------------------------------------------------

Comment By: Makarius (makarius)
Date: 2011-11-28 02:44

Message:
> >Comment By: Jarek Czekalski (jarekczek)
> Date: 2011-11-27 22:44

> I wonder how important are chars above BMP. Could you describe your text
> and why do you need this uncommon feature? I'll quote from Unicode
Standard
> 6.0.0 chapter 2 "General Structure", 2.8 "Unicode Allocation", Planes:
> http://www.unicode.org/versions/Unicode6.0.0/ch02.pdf
> The Basic Multilingual Plane (BMP, or Plane 0) contains the
> common-use characters for all the modern scripts of the world as well as
> many historical and rare characters. By far the majority of all Unicode
characters for
> almost all textual data can be found in the BMP.

This quote from the Unicode standard is a bit optimistic.  Our application
is mathematical text, and BMP is providing not so much on that.  The STIX
project managed to get the glyph tables extended already some years ago,
but it is all outside the BMP.

See also http://www4.in.tum.de/~wenzelm/papers/isabelle-doc.pdf especially
section 2.1 to get an idea of the situation of "Unicode for poor man's
mathematical rendering".  With some tweaks of the TextAreaPainter jEdit
does actually quite well, including sub- and superscripts.

----------------------------------------------------------------------

Comment By: Jarek Czekalski (jarekczek)
Date: 2011-11-27 22:44

Message:
Makarius, there is a patch for this bug at
https://sourceforge.net/tracker/index.php?func=detail&aid=3419148&group_id=588&atid=300588
If you are able to compile jedit you could have your own unicode version.
To test or even use it.

I wonder how important are chars above BMP. Could you describe your text
and why do you need this uncommon feature? I'll quote from Unicode Standard
6.0.0 chapter 2 "General Structure", 2.8 "Unicode Allocation", Planes:
http://www.unicode.org/versions/Unicode6.0.0/ch02.pdf
The Basic Multilingual Plane (BMP, or Plane 0) contains the
common-use characters for all the modern scripts of the world as well as
many historical
and rare characters. By far the majority of all Unicode characters for
almost all textual data
can be found in the BMP.

----------------------------------------------------------------------

Comment By: Makarius (makarius)
Date: 2011-10-03 02:03

Message:
I would join an effort to improve the situation -- the problem occurs in my
everyday life when working with additional mathematical characters.  I am
not an expert of the integral parts of jEdit, but I have studied the
sources often, and can offer to try out intermediate versions produced by
the experts.

----------------------------------------------------------------------

Comment By: Jarek Czekalski (jarekczek)
Date: 2011-10-03 00:51

Message:
I confirm this bug. There is also a problem trying to programatically
access code points from TextArea. Code points are real characters,
sometimes having length of 2 in a String. Would you mind if I added
methods:

int getCodePointAt(int index)
String getCodePoints(int start, int len)

to TextArea class?

The whole problem is broader as I guess all text methods of jedit assume a
code point lenght of 1. Other methods to change for sure are:
delete()
goToNextCharacter()
goToPrevCharacter()
backspace()
Do you think they can simply be changed to skip/delete the amount of chars
return by String.offsetByCodePoints?

All these details are explained quite well under String api documentation:
http://download.oracle.com/javase/6/docs/api/java/lang/String.html

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=100588&aid=3040720&group_id=588

[ jEdit-devel ] [ jedit-Bugs-3040720 ] Incorrect handling of Unicode outside BMP ("surrogates")

jEdit is a programmer's text editor written in Java.

[ jEdit-devel ] [ jedit-Bugs-3040720 ] Incorrect handling of Unicode outside BMP ("surrogates")