The file I sent you was the output of the mht2html conversor. Indexing MHT files like html didn't work and still doesn't work. I just realized this wasn't a parser but a conversor, so i changed the config file like this:
application/vnd.wap.xhtml+xml /opt/vin/mht2html.pl
which didn't work either and I got the same dig output as before.
Also tried with that mht2html.pl program modified so it would output the list of words like its explained here:http://www.htdig.org/attrs.html#external_parsers
Again that didn't work and I was left with same output for dig.
I also tried using the -internal version. No luck either.
I'm probably doing something very stupid or very wrong here, otherwise I don't understand....
Maybe it's the perl program? It seems its giving me some output that is not good or readable by htdig, but to me it seems like regular html, which should be able to be used by htdig to index. Sadly I don't know much perl, so I just modify little parts here and there and then I test the output is what I think it should be.



PS I got one of your messages twice, it wasn't the last one but I got it after that one. weird!

On Feb 11, 2008 5:13 PM, <michael.brockington@bt.com> wrote:
I see what you mean - that certainly doesn't look quite right!
Before we go any further, can I ask if you have tried indexing this file as 'plain' HTML? I know that it doesn't look quite right, but it would appear to me that the content should be okay for htdig's own html parser, if things are set up correctly. Since we know that the config wasn't correct (at first) for the mime-type etc it would be worth checking that over - no point doing an MHT -> HTML translation if it is HTML to begin with!
I am afraid that I don't have a working example, but http://www.htdig.org/attrs.html#external_parsers  describes how to target a file at the internal parser
PS None of my messages have been coming back to me via the list - have you been getting one copy or two?

From: htdig-general-bounces@lists.sourceforge.net [mailto:htdig-general-bounces@lists.sourceforge.net] On Behalf Of Ainhoa L
Sent: Monday, February 11, 2008 3:19 PM

To: Brockington,MJ,Michael,JPGA4X R
Cc: htdig-general@lists.sourceforge.net
Subject: Re: [htdig] Htdig and MHT files

Yeah you are right, I think it doesn't like the output at all. Instead of the words it is taking as words:
word: Read@0
word: 8192@33
word: from@67
word: document@100

So I suppose htdig just doesn't really like the output of the parser. I'm attaching the output of parser (executed manually) and the output of dig just in case you have any more ideas :)

Thanks a lot!


On Feb 11, 2008 12:16 PM, <michael.brockington@bt.com> wrote:
My first instinct would now be to check the parser output - try adding another  v  to your config, (and possibly restricting your indexing to just this one file) and check the log output - it may be that htdig does not like the output from your PERL script. www.htdig.org  explains what the output means. I seem to recall you saying that you had already tested that it ran on its own, but possibly there is something not right there, or a typo in the config that neither of us can see.

From: Ainhoa L [mailto:ainhoitxu@gmail.com]
Sent: Monday, February 11, 2008 9:33 AM

To: Brockington,MJ,Michael,JPGA4X R
Cc: htdig-general@lists.sourceforge.net
Subject: Re: [htdig] Htdig and MHT files

Hi Mike,
Yes you were right, I was missing that part and I didn't even noticed!
I changed the config file and wrote this:

application/pdf->text/html /usr/local/apache/htdocs/htdig-3.1.6/contrib/parsepdf.pl \

application/vnd.wap.xhtml+xml->text/html /opt/vin/mht2html.pl

vnd.wap.xhtml+xml was the MIME type for my mht documents.
So I run dig and everything seems to go fine, having at the end:

(I am doing this in a test folder)
But when I go to the search page, it won't find words inside the mht files. It works for the pdf, txt and html ones, but can't find the words that are in the mht ones.
I suppose I am missing something here... do I need to setup any other settings for the search engine?
Thanks a lot for all your help,