I see what you mean - that certainly doesn't look quite right!
Before we go any further, can I ask if you have tried indexing this file as 'plain' HTML? I know that it doesn't look quite right, but it would appear to me that the content should be okay for htdig's own html parser, if things are set up correctly. Since we know that the config wasn't correct (at first) for the mime-type etc it would be worth checking that over - no point doing an MHT -> HTML translation if it is HTML to begin with!
I am afraid that I don't have a working example, but  describes how to target a file at the internal parser
PS None of my messages have been coming back to me via the list - have you been getting one copy or two?

From: [] On Behalf Of Ainhoa L
Sent: Monday, February 11, 2008 3:19 PM
To: Brockington,MJ,Michael,JPGA4X R
Subject: Re: [htdig] Htdig and MHT files

Yeah you are right, I think it doesn't like the output at all. Instead of the words it is taking as words:
word: Read@0
word: 8192@33
word: from@67
word: document@100

So I suppose htdig just doesn't really like the output of the parser. I'm attaching the output of parser (executed manually) and the output of dig just in case you have any more ideas :)

Thanks a lot!


On Feb 11, 2008 12:16 PM, <> wrote:
My first instinct would now be to check the parser output - try adding another  v  to your config, (and possibly restricting your indexing to just this one file) and check the log output - it may be that htdig does not like the output from your PERL script.  explains what the output means. I seem to recall you saying that you had already tested that it ran on its own, but possibly there is something not right there, or a typo in the config that neither of us can see.

From: Ainhoa L []
Sent: Monday, February 11, 2008 9:33 AM

To: Brockington,MJ,Michael,JPGA4X R
Subject: Re: [htdig] Htdig and MHT files

Hi Mike,
Yes you were right, I was missing that part and I didn't even noticed!
I changed the config file and wrote this:

application/pdf->text/html /usr/local/apache/htdocs/htdig-3.1.6/contrib/ \

application/vnd.wap.xhtml+xml->text/html /opt/vin/

vnd.wap.xhtml+xml was the MIME type for my mht documents.
So I run dig and everything seems to go fine, having at the end:

(I am doing this in a test folder)
But when I go to the search page, it won't find words inside the mht files. It works for the pdf, txt and html ones, but can't find the words that are in the mht ones.
I suppose I am missing something here... do I need to setup any other settings for the search engine?
Thanks a lot for all your help,