According to meista knut:
> I=B4m having some problems indexing some powerpoint and excel-files.
> For example I have the powerpoint file file1.ppt when running rundig
> -v it says not HTML.
> When I save the same file as file1.pps it shows me the correct size of
> the file but there are no -, + or *. And when I search for some
> special words, which are in the powerpoint file, via htsearch, there
> is no match.
> 19:19:2:http://my.dom.ain/.../file1.pps: size =3D 437248
> 20:20:2:http://my.dom.ain/.../file1.ppt: not HTML
> I have doc2html.pl installed and it runs with every word file (.doc)
> and as parser I have xlhtml and ppthtml.
> Even when I run xlhtml respectively ppthtml via console; it shows me a
> correct HTML-page.
> What is wrong with my configuration please help me.
Well, all I can say to that is what is your configuration? Without
seeing your whole external_parsers attribute setting, about all I can
do is guess. The most likely problem is that whatever your web server
is returning as Content-Type for *.ppt files doesn't match what you've
entered in your external_parsers definition.
As for there being no -, + or * listed, that's normal, as these indicated
what htdig is doing with hypertext links it finds in a document. It
normally won't get any of these from any external converter or parser, bu=
typically only from HTML files. See http://www.htdig.org/FAQ.html#q5.26
To really get a handle on what htdig is doing, you'll need more
verbose output. To get headers listed in the output, which you'll
need to look at to figure this out, you should use at least -vvv
(see http://www.htdig.org/FAQ.html#q4.1) to get server responses.
This will generate a lot of output, but you can bring that under control
by temporarily setting start_url to just the URL(s) of one or a few
Gilles R. Detillieux E-mail: <grdetil@...>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada)