|
From: <do...@im...> - 2001-01-17 06:43:32
|
Hello, First of all, let me extend my appreciation for the good work that you have been putting into the wv library! I've been trying to figure out how to get wv to work with files output by Hebrew word, and I have found a few problems. In=20 summary, so far I have found the following problems: 1. The UTF-8 output of the Hebrew characters are suffering from an endian problem. The detailed description below shows in=20 more detail what is happening. Basically all the Hebrew characters were endian-swapped before they were encoded.=20 2. I have a guess of the answer to your question: > Could Someone who sees this tag tell me what was is this type of=20 > justification, asian languages only i thing As far as I can tell this means that the paragraph is a RTL paragraph as far as the Unicode BiDi algorithm is concerned. In HTML 4.0 this corresponds to <p DIR=3DRTL>. I would very much like the HTML output to output the DIR directive as it is critical to get the right display in the browsers. (Currently only IE supports bidi, but Mozilla is also progressing). 3. The inline image that the file t1.doc is also containing Hebrew characters. But wvHtml only gives me some unintelligable error messages and no image is output. I'm attaching t1.doc here if you would like to investigate it. And finally about text mode I will try to add support for the iso8859-8 conversion (which is basically the same as Windows code page 1255) and=20 connect the library to my FriBidi (see=20 http://imagic.weizmann.ac.il/~dov/freesw/FriBidi) library to do proper character arrangements in text mode. Eventually I would like to get AbiWord to work in BiDi, but that is=20 a totally different story. 8-) My system info is as follows: System RedHat Linux 7.0, kernel 2.2.16 running on PIII HW wv version 0.6.3=20 libwfm ver 0.1.21 (even though the file libwfm/version shows 0.1.16) Regards, Dov -- Here is my log notes, from trying to figure out why the Hebrew HTML is wrong. > wvTrace: (./wvWare.c:801) charset is UTF-8, lid is 40d, type is 0, char i= s 5e9 > wvTrace: (./sttbf.c:186) title char is =DC In the output the HTML UTF8 Hebrew characters are not output correctly.=20 The Hebrew characters belong to the range that have 11 significant bits. These are supposed to be output in UTF8 as: 110x-xxxx 10xx-xxxx where x means a significant bit. Thus the first string Shalom which in Hebrew is U5E9, U5D7I, 05d5, U5dd would expect the U5E9 to be output in utf8 as=20 Unicode Binary UTF8 Bin UTF8 Hex output U5E9 0000-0101 1110-1001 1101-0111 1010-1001 d7 a9=20 U5dc 0000-0101 1101-1100 1101-0111 1001-1100 d7 9c=20 U5d5 0000-0101 1101-0101 1101-0111 1001-0101 d7 95=20 U5dd 0000-0101 1101-1101 1101-0111 1001-1101 d7 9d=20 Instead in the wv html output we get: ee a4 85 ed b0 or in binary ee a4 85 1110-1110 1010-0100 1000-0101 =20=20=20=20=20=20 Running it through the utf8-dump script which I'm including below returns= =20 the sequence: U0000e905 U0000dc05 U0000d505 U0000dd05 Thus it seems that there is an endian problem before the characters are UTF8 encoded and written to the HTML file... -- Here is my utf8-dump script: #!/usr/local/bin/perl my $fn =3D shift || die "Need filename!\n"; open(UCS4, "iconv -f utf8 -t ucs4 $fn |"); while(<UCS4>) { @a =3D unpack("N*", $_); # Network order foreach (@a) { printf "U%08x", $_; if ($_ > 32 && $_ < 128) { print ' ', pack("C", $_); } print "\n"; } } -- ___ ___ / o \ o \ Dov Grobgeld ( o o ) o | The Weizmann Institute of Science, Israel \ o /o o / "Where the tree of wisdom carries oranges" | | | | _| |_ _| |_ =20 |