Kern --
--On Tuesday, November 30, 2004 9:54 AM +0100 Kern Sibbald
<kern@...> wrote:
> Hello Karl,
>
> On Mon, 2004-11-29 at 22:53, Karl Cunningham wrote:
>> --On Monday, November 29, 2004 6:02 PM +0100 Kern Sibbald
>> <kern@...> wrote:
>>
>> > Hello,
>> >
>> > After a little more reflection, I think the best solution for the
>> > manual is to translate it to LaTeX rather than to TexInfo as I had
>> > originally thought. I would appreciate it if the people who are
>> > interested in this project would look for html to LaTeX conversion
>> > programs, study them, and make some recommendations, this would help
>> > us make the decision on which one to go with. I'd recommend that you
>> > announce to the list which one you are studying so that there is a
>> > minimal duplication of effort.
>> >
>> > Thinks needed in a conversion program:
>> > 1. Handles a good range of html
>> > 2. Handles multiple files
>> > 3. Preserves pictures
>> > 4. Preserves links to other sites
>> > 5. Preserves links between sites
>> > 6. Adds cross references to chapters and sections
>> > 7. Does a good job on preserving tables
>> > 8. Handles special characters -- especially accented characters (e.g.
>> > á ...)
>> > 9. Does a good job on handling lists
>> >
>> >
>> > Obviously, the more things that the program handles, the better, but we
>> > can also implement so of the above if necessary. You might try it out
>> > on a few of the Bacula chapters, and note that the .html files will
>> > probably not translate well, while the .wml should though they have
>> > stuff at the beginning and end that is not really html.
>>
>> I haven't worked with Latex but I'm willing to learn. Someone with more
>> experience may be a better fit for this project, though.
>>
>> There is a short list of HTML to Latex converters at
>> http://www.tug.org/utilities/texconv/pctotex.html
>>
>> I looked at the first two:
>>
>> Tried html2tex.c, described at
>> http://home.wxs.nl/~faase009/html2tex.html. Compiled and ran it on
>> bacula/doc/html-manual/index.html. It complained about the <div></div>
>> tags. I think that sort of thing is configurable, at least to tell it
>> what to do when it comes across a tag it doesn't understand. I'll look
>> at it further and let you know what I find.
>>
>> The Perl script at http://html2latex.sourceforge.net/ requires some perl
>> modules I don't have (no big problem) but it seems to me to be a bit
>> harder to use. I'll try it as a second choice.
>>
>> Do you have a plan to test the accuracy of html->latex conversion? I see
>> that there are Latex syntax checkers and Latex to postscript converters.
>> I assume using both would be reasonable: check the syntax, then convert
>> the Latex output to postscript and see what it looks like on screen and
>> printed. This checking is a two-stage conversion so finding the source
>> of problems could be more difficult.
>>
>> Any thoughts?
>
> Since you guys are Perl programmers, I tried the Perl program on source
> forge first, but was unable to get it to build. I loaded about 10
> different Perl packages, but it always dies on the tags test when
> building.
>
> So, I downloaded and built the html2tex.c program. No problem. It works,
> and it seems to crunch bacula .wml files directly with no problem. It
> doesn't look like it preserves multiple file cross-references, but I
> need to look into that some more. At a first look, the latex (or TeX)
> output looks pretty good. I haven't tried running it through LaTeX
> though.
>
> There have been several other suggestions, at least one sent to me
> off-list, and I thank everyone, but at this point, I think I am going
> directly to LaTeX. It is not complicated, relatively easy to learn, and
> converts into *everything* imaginable.
>
> Now the task is to do a bit more research on possible conversion tools,
> and all help testing them is *very* welcome. I'd like to decided on a
> tool this week, then begin the conversion.
>
> My idea for the conversion is: *everything* will be done automatically
> to the original .wml manual. That is, we can modify the conversion
> tool(s) and the .wml, but from that point on for the moment, I would
> like no hand editing.
>
> This will permit us to create the new manual and test it with improving
> degree of success. For example, I don't expect that in the beginning
> everything will work or be converted or look pretty. If we concentrate
> on keeping everything automatic (using sed scripts if necessary), then
> we can simply improve the conversion tools, and re-apply to get a better
> manual ... This will also allow us to apply the same process to the
> chapters that have been translated into French, insuring that we end up
> with a French document too.
>
> For the first cut, I would like to keep the same chapter structure --
> i.e. 1 .tex file for each .wml file. At some point, I would like to
> split the manual into a number of different manuals:
>
> Tutorial
> Overview and quickstart
> Installation
> User's Guided (the main descriptive part)
> Reference Manual (definition of *all* directives)
> Developers Manual
>
> At the moment all the above exist, with the exception that the User's
> Guide and the Reference Manual are totally mixed together. The other
> sections could be relatively easily split out.
I worked with html2tex a bit this morning and did a conversion of the all
the wml files into one LaTeX file, as a trial run.
A skeleton file must be generated to tell html2tex which files to convert,
and some conversion rules. I've attached a simple one for this project.
This is only a rough starting point. Indent levels can be specified for
each input file, but I left them all at 1.
The conversion produced quite a bit of output to the console, which listed
conversion errors. Mostly looks like things it didn't know what to do with,
or things it interpreted as html syntax errors. I'm sure these will have to
be examined to see if 1) they are really syntax errors, or 2) valid html
syntax that html2tex doesn't know about. html2tex can be told what to do
with html tags it doesn't recognize, such as the <div> tag. This will take
care of some of #2, and we can probably modify the html2tex source to fix
the rest. This may prove to be more trouble than it's worth for some
errors, any it may be fine just to live with those errors as long as they
don't cause problems in the LaTeX output.
I'm not sure what to do about #1. If the xml files are static at this point
we could modify them to fix the problems, do the conversion , and be done.
Not a very general solution. If the xml files are subject to change,
whatever is wrong with them now will probably be repeated next time they
are generated. Possibly, as you mentioned, this could be a job for sed.
I can continue working on this if you like, but I don't want to be
duplicating your efforts. Please let me know.
Karl
|