application/pdf->text/html /usr/local/apache/htdocs/htdig-3.1.6/contrib/parsepdf.pl \
application/vnd.wap.xhtml+xml->text/html /opt/vin/mht2html.plvnd.wap.xhtml+xml was the MIME type for my mht documents.
Ainhoa,Can I ask you to check whether _new_ PDF's are getting indexed correctly?I notice that the syntax used in the very first, commented, line of the external_parsers section looks different to the rest:
Note the 'arrow' and mime-type bit after application/pdf. All of the external_parsers declarations in my config have this same bit, which makes me suspect that none of your declarations will be working just now, though if you have not rebuilt your databases from scratch this may not be obvious. You probably want to be using at least -vv (two letter v's) to get verbose output from the dig process - this should tell you what is happening during the indexing. My other thought is to check whether the mht files are being served to you with that mime-type - this won't work correctly if not, and you may need more than one external_parsers declaration to cover all possibilities.
From: Ainhoa L [mailto:firstname.lastname@example.org]
Sent: Wednesday, February 06, 2008 5:29 PM
To: Brockington,MJ,Michael,JPGA4X R
Subject: Re: [htdig] Htdig and MHT filesHi Mike,You are talking about the version with the mht parser, right?I write here an extract of where I mention mht things and I attach the whole file and the parser (originally the parser would create files for the files appearing in the mht. I modified it so it will only output the code in the htm file). Maybe this parser I modified is sending some other garbage that can't be read by the indexer?bad_extensions: .wav .gz .z .sit .au .zip .tar .hqx .exe .com .gif \.jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg .mov .avi .cssvalid_extensions: .html .htm .shtml .php .uhtml .phtml .txt .pdf .mhtexternal_parsers: application/postscript /usr/local/apache/htdocs/htdig-3.1.6/contrib/parsepdf.pl\ application/pdf /usr/local/apache/htdocs/htdig-3.1.6/contrib/parsepdf.pl \application/mht /opt/vin/mht2html2.plThanks a lot for your help!Regards,Ainhoa
On Feb 5, 2008 9:58 PM, <email@example.com> wrote:
Can you show us at least an extract of your config file - as you describe it this should work.
From: firstname.lastname@example.org on behalf of Ainhoa L
Sent: Tue 2/5/2008 4:09 PM
Subject: [htdig] Htdig and MHT files
Hi! Maybe this is a very stupid question but, is it possible to index mht
files with htdig?
I have tried with the mht in the valid_extensions list, etc. Obviously htdig
doesn't take them as html and refuses to index them. I looked for a parser
and found a mht2html parser, modified it so it just sends through output the
html. I added it to the parsers in the htdig config file. This didn't work,
although the parser returns valid html...
I would like to know if there is any way to index mht files with htdig?
Thanks a lot for your help.