rendering problem with UTF-8 encoded Devanagari

A graphical text difference analyzer

Brought to you by: arondel, joachim99

This project can now be found here.

#52 rendering problem with UTF-8 encoded Devanagari

Milestone: v1.0_(example)

Status: closed

Owner: Joachim Eibl

Labels: None

Priority: 5

Updated: 2014-08-07

Created: 2006-05-24

Creator: Bob Eaton

Private: No

I'm attaching the text below as a pdf file in case
the webpage doesn't render things the way I'm
describing below...

I just installed and tried kdiff3 to use as the diff
utility for TortoiseSVN and it seems very promising
(read: definitely better than what is being offered
by default in TortoiseSVN’s diff), but it has a weird
feature when diff’ing to text files that have UTF-8
encoded Devanagari text:

With such runs of text, they typically have to be
rendered as a whole run, rather than character-by-
character, because otherwise, the dependent
diacritics are shown “offset” from the characters on
which they depend. For example, this is the
word ‘book’ in Hindi:

किताब

But this is what I see in kdiff3:

क‌ि‌ताब (this isn't what I mean; see the attached)

Notice the 2nd and fourth characters show up with
their little dotted circles showing how they position
with respect to their dependent character (and in
fact, out of correct order since Uniscribe either: a)
isn’t being used or b) isn’t being given the
characters to render together as a single run.

This would be okay if the word was actually different
between the two panes, because in order for you to
mark it with a different color, you probably have to
render it in a character-by-character way (or at
least for the portion of the run that is different),
but it’s not as nice to look at when there is no
difference between the two...

Is there any way you can send strings to the render
as whole runs rather than character-by-character when
they are the same in both panes?

I am using the Windows version 9.9.0 and I have it
configured to interpret the data files as UTF-8
encoded (thank you for supporting this!).

By the way, normally I would have preferred to use
Arial Unicode MS as the font since that is a nicer
font to display Unicode-encoded Devanagari, but with
that font (which isn't fixed-width), the display was
even worse: It seemed that every character had a
space (or a virtual space offset) between them so
that the above was rendered as:

क ि त ा ब

Discussion

Bob Eaton - 2006-05-24

pdf image of the above text showing the issue correctly

KDiff3 rendering UTF8 Devanagari.pdf

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Joachim Eibl - 2006-05-27

Logged In: YES
user_id=584435

Hi Bob,
I'm aware of your problem and intend a solution in future.
But because characters are displayed differently depending
on the previous or following characters, it won't be
possible to show character-by-character-differences.
But I will see what can be done.
Cheers,
Joachim

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bob Eaton - 2006-05-29

Logged In: YES
user_id=1327607

It looking into it, I think I know what's wrong: the
underlying QT routines for 'drawText' must render the
strings of text (at least those which represent common
text between the two texts being compared) as runs of
text. It looks like they are rendering the characters of
the string given one-by-one using GetGlyphOutline. This
will not work for Indic languages (or at least not for
Devananagari).
If you are building KDiff3 as a "wide" application (which
it looks like you are -- i.e. using the "UNICODE" define),
then if QT were to use the DrawText or ExtTextOut Win32
api instead, I think we'll get the behavior we're looking
for. Of course, that might make Devanagari (and other
Indic) scripts work, but break something else, but...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nobody/Anonymous - 2006-05-29

Logged In: NO

No... I was wrong.

There may actually be a problem in Qt as well (it looks
like it only wants to do runs of text if it is latin1,
which this won't be), but the prior problem is that the
kdiff code itself is calling drawText one character at a
time.

So the thing to do would be (optionally, though i don't
know what the checkbox should be called), redo
DiffTextWindowData::writeLine so that it accumulates
portions of the line that are the same between the two (or
three) panes and then call drawText...

it might require some fanagling in Qt as well (to get it
to treat it as a run of text rather than glyphs)... but
this would be the first thing that's necessary.

Sorry, you probably already knew all this...

Bob

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bob Eaton - 2006-05-30

source changes to diff.h and difftextwindow.cpp for Unicode Devanagari fix

kdiff.zip

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bob Eaton - 2006-05-30

Logged In: YES
user_id=1327607

I'm attaching the diff.h and difftextwindow.cpp I've
modified. I'm not comfortable (nor do I have any more time
to spend on this) becoming a "real" contributor to this
project, but the attached files do work for my need.

Basically, it accumulates a run of characters of the same
color before doing the drawText. Doing it this way, causes
the runs to be rendered as a unit (which is what many non-
Roman ranges of Unicode need).

I haven't checked that it works for RTL. I'm pretty sure
the wrapping doesn't work (because it's also not based on
actual line lengths, but rather the simplification you did
before about the size of "W"). And the color rectangles
appear to be slightly off.

Nevertheless, I think if you don't already have a solution
for this, this might give you some hints.

Hope this helps,
Bob

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Joachim Eibl - 2006-05-31

Logged In: YES
user_id=584435

Thank you for the patch. The basic idea is correct.
As you also see, there are quite a few things to fix,
before it can be really made public. (Word wrap,
RTL-languages, character highlighting, selections for copy
and paste.)
Nevertheless if this already helps you, very good!
Cheers,
Joachim

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Joachim Eibl - 2014-08-07

status: open --> closed

Group: --> v1.0_(example)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link: