Re: [Wvware-devel] A program Word to HTML converter.

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Tuesday 07 May 2002 16:50, aigoabe wrote:
> Hi everybody!

Hi Eneko!

> My name is Eneko Goyenaga and I'm a Spanish TI student. I=C2=B4ve been or=
dered

'ordered' by your professor or by somebody going to pay you momey for that?

> to design a word-doc-file to html converter. As I can see wv already does
> it, but I would like to do it myself. The problem is that I am not what
> could be considered as an "expert" programmer ;-). I know C and the basics
> of C++ too, but haven't had much experience using them, that's why my
> questions may seem to bee "quite" stupid -sorry for that-. Is it as
> difficult as it seems to me programming such a converter?

No.
Most likely is is even more difficult.

> would it be easier if I used de wv libraries in order to acces the
> insides of the doc file?.

Of course it would but that would mean not to do the most difficult
task: write the Scanner part of the converter.

Once you got the scanner running ( =3D=3D as soon as you have solid basic
understanding of the structure of MS Winword files and have your code
able to read and analyze them) you are half done: the rest of the job is
lots of 'hacks' to find dirty ways how to convert the ugly Winword
structures into something looking similiar in HTML.

> Another question I have is wether the wv libraries acces the insides
> of the doc file dealing with it directly, let's say, in raw mode, or
> via ole.

Ths OLE handling is only one little (but important step) in the whole
scanning process:

MS Office files (especially Winword ones) are stored in so-called
'storages'.
Such a storage is kind of a mini-filesystem inside a file: there can be
sub-storages in a storage and there can be streams in a storage.
The sub-storages can be compared to directories and the streams can
be compared to files.

But this is the easy step, the complicated ones come when you try not to
get the PLCF and FCB thingies the wrong way around and when you suddenly
notice that this idiotic 'piece table' is one of the most important
structures in modern MS Winword files (because the Microsofties were toooo
lazy to implement something smart when they were ordered to support
Unicode in Winword documents)...

> I'm sure there's some kind of reference to it on the web, but I haven't
> found it, sorry for that once again.

The best reference is this:

a) get some old versions of the MSDN CDs (Microsoft Developers Network)
   and study the MS Office file format descriptions contained there.

b) get the OpenOffice sources and study their ww97 scanner which is part
   of the ww6/ww95/ww97/ww2000 import filter:

   http://sw.openoffice.org/source/browse/sw/sw/source/filter/ww8/

> Regarding to dealing with the word file, which way do yo recomend me,
> via OLE or directly?

You either need a way how to get the _streams_ out of the storage (this
is what programs like laola are used for) or you need your own storage
class allowing you to transparently access the sub-storages and the streams.

> The problem about OLE is that, as I said, I'm not an
> experienced programmer, and I fear it may be a difficult task for me.
> The problem about doing it directly is... well, you only have to look
> at the spec. documents! I find them virtualy impossible to understand!

Best wishes for this project: It could be your master thesis or something
of similiar complexity - do _not_ expect this to be done within one or
two months!

But of course (as long as you are writing free software) there _is_ the
option to just take lots of code from wv of OpenOffice and have your
scanner done in less time than when writing it from scratch.  This depends
from whether you are allowed to do that by the people who told you to
make this job...

=2D-=20
Karl-Heinz Zimmer, Senior Software Engineer, Klar=C3=A4lvdalens Datakonsult=
 AB
<mailto:kh...@kl...>            <mailto:kh...@kd...>