The file I sent you was the output of the mht2html conversor. Indexing MHT
files like html didn't work and still doesn't work. I just realized this
wasn't a parser but a conversor, so i changed the config file like this:
which didn't work either and I got the same dig output as before.
Also tried with that mht2html.pl program modified so it would output the
list of words like its explained here:
Again that didn't work and I was left with same output for dig.
I also tried using the -internal version. No luck either.
I'm probably doing something very stupid or very wrong here, otherwise I
Maybe it's the perl program? It seems its giving me some output that is not
good or readable by htdig, but to me it seems like regular html, which
should be able to be used by htdig to index. Sadly I don't know much perl,
so I just modify little parts here and there and then I test the output is
what I think it should be.
PS I got one of your messages twice, it wasn't the last one but I got it
after that one. weird!
On Feb 11, 2008 5:13 PM, <michael.brockington@...> wrote:
> I see what you mean - that certainly doesn't look quite right!
> Before we go any further, can I ask if you have tried indexing this file
> as 'plain' HTML? I know that it doesn't look quite right, but it would
> appear to me that the content should be okay for htdig's own html parser, if
> things are set up correctly. Since we know that the config wasn't correct
> (at first) for the mime-type etc it would be worth checking that over - no
> point doing an MHT -> HTML translation if it is HTML to begin with!
> I am afraid that I don't have a working example, but
> http://www.htdig.org/attrs.html#external_parsers describes how to target
> a file at the internal parser
> PS None of my messages have been coming back to me via the list - have you
> been getting one copy or two?
> *From:* htdig-general-bounces@... [mailto:
> htdig-general-bounces@...] *On Behalf Of *Ainhoa L
> *Sent:* Monday, February 11, 2008 3:19 PM
> *To:* Brockington,MJ,Michael,JPGA4X R
> *Cc:* htdig-general@...
> *Subject:* Re: [htdig] Htdig and MHT files
> Yeah you are right, I think it doesn't like the output at all. Instead of
> the words it is taking as words:
> word: Read@...
> word: 8192@...
> word: from@...
> word: document@...
> So I suppose htdig just doesn't really like the output of the parser. I'm
> attaching the output of parser (executed manually) and the output of dig
> just in case you have any more ideas :)
> Thanks a lot!
> On Feb 11, 2008 12:16 PM, <michael.brockington@...> wrote:
> > Ainhoa,
> > My first instinct would now be to check the parser output - try adding
> > another v to your config, (and possibly restricting your indexing to just
> > this one file) and check the log output - it may be that htdig does not like
> > the output from your PERL script. http://www.htdig.org explains what the
> > output means. I seem to recall you saying that you had already tested that
> > it ran on its own, but possibly there is something not right there, or a
> > typo in the config that neither of us can see.
> > Regards,
> > Mike
> > ------------------------------
> > *From:* Ainhoa L [mailto:ainhoitxu@...]
> > *Sent:* Monday, February 11, 2008 9:33 AM
> > *To:* Brockington,MJ,Michael,JPGA4X R
> > *Cc:* htdig-general@...
> > *Subject:* Re: [htdig] Htdig and MHT files
> > Hi Mike,
> > Yes you were right, I was missing that part and I didn't even noticed!
> > I changed the config file and wrote this:
> > application/pdf->text/html /usr/local/apache/htdocs/htdig-3.1.6/contrib/parsepdf.pl
> > \
> > application/vnd.wap.xhtml+xml->text/html /opt/vin/mht2html.pl
> > vnd.wap.xhtml+xml was the MIME type for my mht documents. So I run dig
> > and everything seems to go fine, having at the end:
> > 0/http://172.26.0.169/testdig/
> > 1/http://172.26.0.169/testdig/About_comments_eex3.mht
> > 2/http://172.26.0.169/testdig/aster.pdf
> > 3/http://172.26.0.169/testdig/beepmacro.mht
> > 4/http://172.26.0.169/testdig/index.txt
> > 5/http://172.26.0.169/testdig/test.html
> > (I am doing this in a test folder)
> > But when I go to the search page, it won't find words inside the mht
> > files. It works for the pdf, txt and html ones, but can't find the words
> > that are in the mht ones.
> > I suppose I am missing something here... do I need to setup any other
> > settings for the search engine?
> > Thanks a lot for all your help,
> > Ainhoa