[ jEdit-devel ] [ jedit-Bugs-3040720 ] Incorrect handling of Unicode outside BMP ("surrogates")

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Bugs item #3040720, was opened at 2010-08-06 18:48
Message generated for change (Comment added) made by jarekczek
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=100588&aid=3040720&group_id=588

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: editor core
Group: normal bug
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Makarius (makarius)
Assigned to: Matthieu Casanova (kpouer)
Summary: Incorrect handling of Unicode outside BMP ("surrogates")

Initial Comment:
Handling of unicode characters outside the "Basic Multilingial Plane" (U+0000 to U+FFFF) does not really work in TextArea and a few other
custom made text boxes of jEdit, e.g. the "Console" plugin.

Instead of navigating text according to "code points", as explained in http://download.oracle.com/javase/6/docs/api/index.html?java/lang/Character.html jEdit usually refers to plain-old UTF-16 Chars for caret movement, deletion etc.

The following example in Console/Beanshell illustrates this:

  buffer.insert(0, "\uD835\uDC9C\n")

This produces a calligraphic letter A, e.g. use the STIX fonts http://www.stixfonts.org to see it on the screen.  Positioning the caret at the end of the first line, a single BACKSPACE will only delete the second "surrogate" character in the internal representation, leaving the first one standing alone.  Thus the content of the buffer becomes malformed in the sense of this funny Java text representation.

This has been reproduced in various official jEdit versions, such as 4.3.2.  It seems that the most basic text fields of Swing can handle these newer unicode points.

----------------------------------------------------------------------

Comment By: Jarek Czekalski (jarekczek)
Date: 2011-10-03 09:51

Message:
I confirm this bug. There is also a problem trying to programatically
access code points from TextArea. Code points are real characters,
sometimes having length of 2 in a String. Would you mind if I added
methods:

int getCodePointAt(int index)
String getCodePoints(int start, int len)

to TextArea class?

The whole problem is broader as I guess all text methods of jedit assume a
code point lenght of 1. Other methods to change for sure are:
delete()
goToNextCharacter()
goToPrevCharacter()
backspace()
Do you think they can simply be changed to skip/delete the amount of chars
return by String.offsetByCodePoints?

All these details are explained quite well under String api documentation:
http://download.oracle.com/javase/6/docs/api/java/lang/String.html

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=100588&aid=3040720&group_id=588

[ jEdit-devel ] [ jedit-Bugs-3040720 ] Incorrect handling of Unicode outside BMP ("surrogates")

jEdit is a programmer's text editor written in Java.

[ jEdit-devel ] [ jedit-Bugs-3040720 ] Incorrect handling of Unicode outside BMP ("surrogates")