[Wvware-devel] wv and documents from Hebrew word

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hello,

First of all, let me extend my appreciation for the good work
that you have been putting into the wv library!

I've been trying to figure out how to get wv to work with files
output by Hebrew word, and I have found a few problems. In=20
summary, so far I have found the following problems:

1. The UTF-8 output of the Hebrew characters are suffering from
   an endian problem. The detailed description below shows in=20
   more detail what is happening. Basically all the Hebrew
   characters were endian-swapped before they were encoded.=20

2. I have a guess of the answer to your question:

   > Could Someone who sees this tag tell me what was is this type of=20
   > justification, asian languages only i thing

   As far as I can tell this means that the paragraph is a RTL paragraph
   as far as the Unicode BiDi algorithm is concerned. In HTML 4.0 this
   corresponds to <p DIR=3DRTL>.

   I would very much like the HTML output to output the DIR directive
   as it is critical to get the right display in the browsers. (Currently
   only IE supports bidi, but Mozilla is also progressing).

3. The inline image that the file t1.doc is also containing Hebrew
   characters. But wvHtml only gives me some unintelligable error
   messages and no image is output.

I'm attaching t1.doc here if you would like to investigate it.

And finally about text mode I will try to add support for the iso8859-8
conversion (which is basically the same as Windows code page 1255) and=20
connect the library to my FriBidi (see=20
http://imagic.weizmann.ac.il/~dov/freesw/FriBidi) library to do proper
character arrangements in text mode.

Eventually I would like to get AbiWord to work in BiDi, but that is=20
a totally different story. 8-)

My system info is as follows:

   System RedHat Linux 7.0, kernel 2.2.16 running on PIII HW
   wv version 0.6.3=20
   libwfm ver 0.1.21 (even though the file libwfm/version shows 0.1.16)

Regards, Dov
--

Here is my log notes, from trying to figure out why the Hebrew HTML
is wrong.

> wvTrace: (./wvWare.c:801) charset is UTF-8, lid is 40d, type is 0, char i=
s 5e9
> wvTrace: (./sttbf.c:186) title char is =DC

In the output the HTML UTF8 Hebrew characters are not output correctly.=20
The Hebrew characters belong to the range that have 11 significant bits.
These are supposed to be output in UTF8 as:

    110x-xxxx 10xx-xxxx

where x means a significant bit.

Thus the first string Shalom which in Hebrew is U5E9, U5D7I, 05d5, U5dd
would expect the U5E9 to be output in utf8 as=20

     Unicode          Binary         UTF8 Bin              UTF8 Hex output
      U5E9      0000-0101 1110-1001  1101-0111 1010-1001      d7 a9=20
      U5dc      0000-0101 1101-1100  1101-0111 1001-1100      d7 9c=20
      U5d5      0000-0101 1101-0101  1101-0111 1001-0101      d7 95=20
      U5dd      0000-0101 1101-1101  1101-0111 1001-1101      d7 9d=20

Instead in the wv html output we get:

      ee a4 85 ed b0

or in binary

      ee a4 85   1110-1110 1010-0100 1000-0101
=20=20=20=20=20=20
Running it through the utf8-dump script which I'm including below returns=
=20
the sequence:

    U0000e905
    U0000dc05
    U0000d505
    U0000dd05

Thus it seems that there is an endian problem before the characters
are UTF8 encoded and written to the HTML file...

--
Here is my utf8-dump script:

#!/usr/local/bin/perl

my $fn =3D shift || die "Need filename!\n";
open(UCS4, "iconv -f utf8 -t ucs4 $fn |");

while(<UCS4>) {
  @a =3D unpack("N*", $_);   # Network order
  foreach (@a) {
    printf "U%08x", $_;
    if ($_ > 32 &&  $_ < 128) {
      print ' ', pack("C", $_);
    }
    print "\n";
  }
}
--
                                                        ___   ___
                                                      /  o  \   o \
Dov Grobgeld                                         ( o  o  ) o   |
The Weizmann Institute of Science, Israel             \  o  /o  o /
"Where the tree of wisdom carries oranges"              | |   | |
                                                       _| |_ _| |_
=20