Hi Mike,
Yes you were right, I was missing that part and I didn't even noticed!
I changed the config file and wrote this:

application/pdf->text/html /usr/local/apache/htdocs/htdig-3.1.6/contrib/parsepdf.pl \

application/vnd.wap.xhtml+xml->text/html /opt/vin/mht2html.pl

vnd.wap.xhtml+xml was the MIME type for my mht documents.
So I run dig and everything seems to go fine, having at the end:


0/http://172.26.0.169/testdig/
1/http://172.26.0.169/testdig/About_comments_eex3.mht
2/http://172.26.0.169/testdig/aster.pdf
3/http://172.26.0.169/testdig/beepmacro.mht
4/http://172.26.0.169/testdig/index.txt
5/http://172.26.0.169/testdig/test.html
 
(I am doing this in a test folder)
 
But when I go to the search page, it won't find words inside the mht files. It works for the pdf, txt and html ones, but can't find the words that are in the mht ones.
 
I suppose I am missing something here... do I need to setup any other settings for the search engine?
 
Thanks a lot for all your help,
 
Ainhoa
 
On Feb 8, 2008 12:58 PM, <michael.brockington@bt.com> wrote:
Ainhoa,
Can I ask you to check whether _new_ PDF's are getting indexed correctly?
 
I notice that the syntax used in the very first, commented, line of the  external_parsers  section looks different to the rest:

application/pdf->text/html /usr/local/bin/conv_doc.pl

Note the 'arrow' and mime-type bit after application/pdf. All of the  external_parsers  declarations in my config have this same bit, which makes me suspect that none of your declarations will be working just now, though if you have not rebuilt your databases from scratch this may not be obvious. You probably want to be using at least -vv  (two letter v's) to get verbose output from the dig process - this should tell you what is happening during the indexing. My other thought is to check whether the mht files are being served to you with that mime-type - this won't work correctly if not, and you may need more than one  external_parsers  declaration to cover all possibilities.

Regards,
Mike



From: Ainhoa L [mailto:ainhoitxu@gmail.com]
Sent: Wednesday, February 06, 2008 5:29 PM
To: Brockington,MJ,Michael,JPGA4X R
Cc: htdig-general@lists.sourceforge.net
Subject: Re: [htdig] Htdig and MHT files

Hi Mike,
 
You are talking about the version with the mht parser, right?
I write here an extract of where I mention mht things and I attach the whole file and the parser (originally the parser would create files for the files appearing in the mht. I modified it so it will only output the code in the htm file). Maybe this parser I modified is sending some other garbage that can't be read by the indexer?
 
bad_extensions: .wav .gz .z .sit .au .zip .tar .hqx .exe .com .gif \
.jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg .mov .avi .css
 
valid_extensions: .html .htm .shtml .php .uhtml .phtml .txt .pdf .mht
 
external_parsers: application/postscript /usr/local/apache/htdocs/htdig-3.1.6/contrib/parsepdf.pl\ application/pdf /usr/local/apache/htdocs/htdig-3.1.6/contrib/parsepdf.pl \
application/mht /opt/vin/mht2html2.pl
 
Thanks a lot for your help!
Regards,
 
Ainhoa


 
On Feb 5, 2008 9:58 PM, <michael.brockington@bt.com> wrote:
Can you show us at least an extract of your config file - as you describe it this should work.

Regards,
Mike


-----Original Message-----
From: htdig-general-bounces@lists.sourceforge.net on behalf of Ainhoa L
Sent: Tue 2/5/2008 4:09 PM
To: htdig-general@lists.sourceforge.net
Subject: [htdig] Htdig and MHT files

Hi! Maybe this is a very stupid question but, is it possible to index mht
files with htdig?
I have tried with the mht in the valid_extensions list, etc. Obviously htdig
doesn't take them as html and refuses to index them. I looked for a parser
and found a mht2html parser, modified it so it just sends through output the
html. I added it to the parsers in the htdig config file. This didn't work,
although the parser returns valid html...
I would like to know if there is any way to index mht files with htdig?
Thanks a lot for your help.