Not sure where '^\376\067\0\043' came from. I've pulled down the newest doc2html.pl version (http://cvs.sourceforge.net/viewcvs.py/htdig/htdig/contrib/doc2html/doc2html.pl) and reconfigured. Adding the "binmode FILE;" line in sub read_magic has corrected the problem.
 
Most of the document's parsed are clean, however, their are a number that contain characters as displayed below. Any ideas on how to clean this up further?
 | | | | | „ | d Ð
 8 8 8 8 L ” а а а а ÐŒ k ÐŒ а а а а а 8
 Ñž Ñž Ñž Ñž Ñž Ñž Ñž Ñž Ñž Ñž Ñž Ñž Ñž Ñ‚ Ñž Ñž Ñ‚ Ñ‚ Ñ‚ Ñž Ñž Ñž Ñ‚ Ñž Ñ‚ Ñ‚ Ñž – Ñ• Ñ‚ Ñ‚
 
--- NOTE for developers ---
In doc2html.pl (all versions?); around line 253; in the if($XLS2HTML) statement; the $cmdl = "$cmd -fw $input" line uses options "-fw". It appears those are no longer valid in xls2csv (Catdoc Version 0.93.3).
 
 
Thanks very much for your suggestion!

 
 

From: David Adams [mailto:D.J.Adams@soton.ac.uk]
Sent: Thursday, July 22, 2004 4:02 AM
To: Wendt, Trevor; htdig-general@lists.sourceforge.net
Subject: Re: [htdig] Issues when parsing msword docs

Trevor,
 
The magic number for Word documents (Word6 & later) is \320\317\021\340, the same as Excel spreadsheets.  That is why doc2html.pl has to check both magic number and MIME-type.
 
I don't recognise  '^\376\067\0\043', could it be the magic number for some form of Word for MacIntosh document?
 
Doc2html.pl should read the start of each file as binary to get the magic number, but on some systems it reads it as text. Add a line to sub read_magic so that it becomes:
 
  open(FILE, "< $Input") || die "Can't open file $Input\n";
  binmode FILE;
  read FILE,$Magic,256;
  close FILE;
 
Let us know how you get on.
 
David Adams
Corporate Information Services
Information Systems Services
University of Southampton
----- Original Message -----
From: Wendt, Trevor
To: htdig-general@lists.sourceforge.net
Sent: Wednesday, July 21, 2004 10:38 PM
Subject: [htdig] Issues when parsing msword docs

I'm working upgrading to the new 3.2.0b6 version of htdig. Running into issues on when parsing msword docs (doc, xls).
 
When running doc2html.pl via command line directly on a word doc I see the $MAGIC number's aren't being determined correctly for the doc file. Instead of pulling: $magic = '^\376\067\0\043'; (word6,7/97) it pulls: ÐÏࡱá>þÿ        .0þÿÿÿ-ÿÿÿÿÿÿÿ(continuous)
 
Similar with xls, when running doc2html.pl via command line the $MAGIC numbers aren't: $magic = '^\320\317\021\340'; as expected but: ÐÏࡱá>þÿ        ÿÿÿþÿÿÿ(continuous)
 
It looks like doc2html.pl is receiving these as binary files instead text. Doing the command "file -i" on the documents confirms the mime type to be application/msword for both.
 
Doing a octal dump on the first line of the file does reveal the correct $magic numbers:
$od -b /opt/htdig/contrib/doc2html/todays-news.doc | head -1
0000000 320 317 021 340 241 261 032 341 000 000 000 000 000 000 000 000
 
Running the doc file directly through catdoc returns pristine text output as does running the xls file directly through xls2csv.
 
Server is RH Linux Enterprise with perl 5.8.0
 
Any ideas on how to counter and/or correct this problem?
 
Thanks for any suggestions.