From: Karl-Heinz Z. <kh...@kd...> - 2002-05-07 17:25:39
|
On Tuesday 07 May 2002 16:50, aigoabe wrote: > Hi everybody! Hi Eneko! > My name is Eneko Goyenaga and I'm a Spanish TI student. I=C2=B4ve been or= dered 'ordered' by your professor or by somebody going to pay you momey for that? > to design a word-doc-file to html converter. As I can see wv already does > it, but I would like to do it myself. The problem is that I am not what > could be considered as an "expert" programmer ;-). I know C and the basics > of C++ too, but haven't had much experience using them, that's why my > questions may seem to bee "quite" stupid -sorry for that-. Is it as > difficult as it seems to me programming such a converter? No. Most likely is is even more difficult. > would it be easier if I used de wv libraries in order to acces the > insides of the doc file?. Of course it would but that would mean not to do the most difficult task: write the Scanner part of the converter. Once you got the scanner running ( =3D=3D as soon as you have solid basic understanding of the structure of MS Winword files and have your code able to read and analyze them) you are half done: the rest of the job is lots of 'hacks' to find dirty ways how to convert the ugly Winword structures into something looking similiar in HTML. > Another question I have is wether the wv libraries acces the insides > of the doc file dealing with it directly, let's say, in raw mode, or > via ole. Ths OLE handling is only one little (but important step) in the whole scanning process: MS Office files (especially Winword ones) are stored in so-called 'storages'. Such a storage is kind of a mini-filesystem inside a file: there can be sub-storages in a storage and there can be streams in a storage. The sub-storages can be compared to directories and the streams can be compared to files. But this is the easy step, the complicated ones come when you try not to get the PLCF and FCB thingies the wrong way around and when you suddenly notice that this idiotic 'piece table' is one of the most important structures in modern MS Winword files (because the Microsofties were toooo lazy to implement something smart when they were ordered to support Unicode in Winword documents)... > I'm sure there's some kind of reference to it on the web, but I haven't > found it, sorry for that once again. The best reference is this: a) get some old versions of the MSDN CDs (Microsoft Developers Network) and study the MS Office file format descriptions contained there. b) get the OpenOffice sources and study their ww97 scanner which is part of the ww6/ww95/ww97/ww2000 import filter: http://sw.openoffice.org/source/browse/sw/sw/source/filter/ww8/ > Regarding to dealing with the word file, which way do yo recomend me, > via OLE or directly? You either need a way how to get the _streams_ out of the storage (this is what programs like laola are used for) or you need your own storage class allowing you to transparently access the sub-storages and the streams. > The problem about OLE is that, as I said, I'm not an > experienced programmer, and I fear it may be a difficult task for me. > The problem about doing it directly is... well, you only have to look > at the spec. documents! I find them virtualy impossible to understand! Best wishes for this project: It could be your master thesis or something of similiar complexity - do _not_ expect this to be done within one or two months! But of course (as long as you are writing free software) there _is_ the option to just take lots of code from wv of OpenOffice and have your scanner done in less time than when writing it from scratch. This depends from whether you are allowed to do that by the people who told you to make this job... =2D-=20 Karl-Heinz Zimmer, Senior Software Engineer, Klar=C3=A4lvdalens Datakonsult= AB <mailto:kh...@kl...> <mailto:kh...@kd...> |