On Thu, 1 Dec 2011, SourceForge.net wrote:
>> Comment By: Jarek Czekalski (jarekczek)
> Date: 2011-12-01 00:01
> I attach a file revealing some unicode details. I post it here to store it
> on sf server. Please discuss it only on jedit-devel mailing list.
Thanks, I have now made some experiments with the example HTML file.
First of all here is a change to my own jEdit plugin, to use the BreakIterator
instead of inspecting surrogates manually:
At this point I determine a text range for my caret painter replacement, that
prints the whole piece of text under the caret in reverse -- this fits better
than the default box caret for different text styles (sub/superscript) and
unusually wide glyphs in the mathematical font (long arrows etc.).
The BreakIterator with the CharacterIterator view on the buffer's text Segment
works reasonably well, although one could imagine some method of JEditBuffer to
produce the initialization for client code to reduce the required tinkering a
Back to some further fine points. In the example file you write:
5. Combining Character Sequence, A with 2 dots above, _Ä_.
It was inserted into jedit buffer with plain A followed by
buffer.insert(textArea.getCaretPosition(), "\u0308"), which is a
"combining diaeresis" code. This one displays correctly in my jedit on
Windows XP. It's funny what happens when you erase characters before
"combining diaeresis" (A and farther to the left).
I have tried this on Windows, Linux, Mac OS, with mixed results. Some
configurations don't display the composite characters, producing merely a
replacement box. What was your funny effect above? In my situation, deleting
the A before the "combining diaeresis" would apply that to the preceeding
character if possible (say aeiuo), but produce a box if not.
I would say this is OK -- users will just depend on the display capabilities of
their platform and installed fonts.
Concerning the question about navigation / deletion wrt. codepoints or combined
characters, I've looked at what JTextArea does -- e.g. in the jEdit Find
dialog. There seem to be two cases:
(1) Plain caret navigation without editing uses greater units,
via "visual positions" according to the GUI view. See also
(2) Actual editing is based on plain unicode surrogates.
See also DeletePrevCharAction in
So in your example, the "combining diaeresis" may be deleted via backspace, but
the "A" will remain. Due to the navigation though, it is not possible to
delete the "A" and leave the "combining diaeresis" stand alone.
My impression is that most of these gory details of the standard swing
components are a bit over-engineered.
Uniform navigation + editing either via codepoints or via BreakIterator looks
like an adequate approximation to me.