From: meista k. <b_...@we...> - 2003-10-23 10:03:15
|
<html><base href=3D"https://freemail.web.de/"/><style>p {margin: 0px}</style= ><body style=3D'font-size:9pt; font-family:Verdana; font-family: Verdana' ><= P>Hi,</P><P>I=B4m having some problems indexing some powerpoint and excel-fi= les.</P><P>For example I have the powerpoint file file1.ppt when running&n= bsp;rundig -v it says not HTML.</P><P>When I save the same file as fi= le1.pps it shows me the correct size of the file but there are no -, + or = *. And when I search for some special words, which are in the powerpoint f= ile, via htsearch, there is no match.</P><P>19:19:2:http://my.dom.ain/.../= file1.pps: size =3D 437248<BR>20:20:2:http://my.dom.ain/.../file1.ppt:= not HTML<BR></P><P>I have doc2html.pl installed and it runs with ev= ery word file (.doc) and as parser I have xlhtml and ppthtml.</P><P>E= ven when I run xlhtml respectively ppthtml via console; it shows me&n= bsp;a correct HTML-page.</P><P>What is wrong with my configuration pl= ease help me.</P></body></html> =5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F=5F= =5F=5F=5F=5F<br>Zwei Mal Platz 1 mit dem jeweils besten Testergebnis! WEB.DE FreeM= ail<br>und WEB.DE Club bei Stiftung Warentest! <A HREF=3D"http://f.web.de/=3Fm= c=3D021183"><B>http://f.web.de/=3Fmc=3D021183</B></A> |
From: Gilles D. <gr...@sc...> - 2003-10-23 19:23:16
|
According to meista knut: > I=B4m having some problems indexing some powerpoint and excel-files. > = =20 > For example I have the powerpoint file file1.ppt when running rundig > -v it says not HTML. > = =20 > When I save the same file as file1.pps it shows me the correct size of > the file but there are no -, + or *. And when I search for some > special words, which are in the powerpoint file, via htsearch, there > is no match. > = =20 > 19:19:2:http://my.dom.ain/.../file1.pps: size =3D 437248 > 20:20:2:http://my.dom.ain/.../file1.ppt: not HTML > = =20 > I have doc2html.pl installed and it runs with every word file (.doc) > and as parser I have xlhtml and ppthtml. > = =20 > Even when I run xlhtml respectively ppthtml via console; it shows me a > correct HTML-page. > = =20 > What is wrong with my configuration please help me. Well, all I can say to that is what is your configuration? Without seeing your whole external_parsers attribute setting, about all I can do is guess. The most likely problem is that whatever your web server is returning as Content-Type for *.ppt files doesn't match what you've entered in your external_parsers definition. As for there being no -, + or * listed, that's normal, as these indicated what htdig is doing with hypertext links it finds in a document. It normally won't get any of these from any external converter or parser, bu= t typically only from HTML files. See http://www.htdig.org/FAQ.html#q5.26 To really get a handle on what htdig is doing, you'll need more verbose output. To get headers listed in the output, which you'll need to look at to figure this out, you should use at least -vvv (see http://www.htdig.org/FAQ.html#q4.1) to get server responses. This will generate a lot of output, but you can bring that under control by temporarily setting start_url to just the URL(s) of one or a few .ppt files. --=20 Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |