You say that pdf2html.pl works from the command line, but does doc2html.pl work from the command line for PDF files?
 
"noindex" is not relevant in the case of PDF files, but the following might be:
 
    The PDF document contained no indexable text
    The PDF document was too large - see the max_doc_size: statement
 
Do also consult the FAQ at <http://www.htdig.org/FAQ.html>.
 
David Adams
University of Southampton
 
----- Original Message -----
From: Dominique Fourtune
To: htdig-general@lists.sourceforge.net
Sent: Thursday, December 18, 2003 5:19 PM
Subject: [htdig] "deleted no excerpts " with pdf files

Hello everybody, I need help

I'm using htdig 3.1.6, to parse html pages created by Apache mod-autoindex

I can't merge pdf files, I get always error message " Deleted no excerpts"

I'm using doc2html.pl, it is OK for .doc files, but not for pdf files

pdf2html.pl on command line parses pdf files and creates html files

I found this old post :

According to Paul COURBIS:
> When I run htmerge, I get a lot of messages :
> Deleted, no excerpt: xxx/http...
>
> What does it mean ? Why does htmerge suppress so many documents from the
> database ? As far as I understand english it seems that it means that
> there's no keyword for these pages, despite the fact that when I connect
> to it there's a lot of text...

The most common causes of this are:
 - a noindex directive somewhere in the document
 - the document was disallowed by robots.txt
 - the server_max_docs limit was reached before this document could be parsed

You'd need to correlate the htmerge -v output back to the htdig -v (or -vv)
output to see which of these conditions occurred.
I think the first reason is the good one (I have no robots), but I need help to go further : what is a noindex directive ?

Thanks a lot
-- 
Dominique FOURTUNE - ADEME Département MDE
05 55 10 27 49 - fourtune@ademe.fr
Les ordinateurs marchent très bien sans Microsoft, et pour moins cher : passez à Linux !