I have played around some more tonight with converting doc-files to
HTML and I have encountered several problems. Here is a list of what I
* Colors are not kept between subsequent lines of same color. See
* Bold attribute is not kept between groups of lines... Example in
file this_is_bold.doc . Italic and underline attributes are not
kept at all. Possibly due to the first line problem as well.
* Line spacing is not kept. This is also seen in the file this_is_bold.doc .
* Fonts. Neither font sizes or faces are kept. See this_is_font.doc.
* Centering of tables is not kept. See this_is_table.doc .=20
* Colors are not set at all for Hebrew formats. Appearently Word
stores for each paragraph different font attributes for Hebrew
and English. Thus it is possible to change the color of Hebrew
to red, but the English color stays black. The translation doesn't
honor this at all. See shir.doc .=20
* Bidi Directive DIR=3DRTL is not set in <tables> or in <p> for RTL
paragraphs. See shir.doc .
I'm attaching the files mentioned above with this email. I believe that
all of the files except shir.doc should be readable in a English word.
How difficult would it be to solve the issues above? I would me happy
to try to fix it myself, but I need some guidance of where the
problems might be. Are the problems in the HTML conversion or in the
parsing of the doc files?
I'm quite scared if these problems I encountered just the tip of an
iceberg, and the prospect of having automatic translation of Hebrew
doc files into HTML would involve me spending months of investment in
the wv library...
My details are as before:
System RedHat Linux 7.0, kernel 2.2.16 running on PIII HW
wv version 0.6.3=20
libwfm ver 0.1.21 (even though the file libwfm/version shows 0.1.16)
(Btw, I am still suffering from the strange endian problem, but I hacked
my way around that by creating an external filter to reswap the UTF-8...)
/ o \ o \
Dov Grobgeld ( o o ) o |
The Weizmann Institute of Science, Israel \ o /o o /
"Where the tree of wisdom carries oranges" | | | |
_| |_ _| |_