From: <ch...@in...> - 2004-08-26 12:54:38
|
hi, On HP-UX D270 and htdig 3.16, i have no problem for indexing our intranet= 's=20 site. Now, i would like to indexing pdf file with xpdf. I change the parameters in htdig.conf and in pdf2html.pl. When i 'm indexing the site (or a short part of the site), i have this me= ssage : DB2 problem ....PANIC : Invalid argument=20 /opt/www/htdig/bin/rundig[36] : 23928 Memory fault (coredump) and later some DB2 problem... : missing or empty key value specified. When i use -vvv, i see that some pdf file is pushing : ----- pick: interligne.xxx.fr, # servers =3D 1 4:4:1:http://interligne.xxx.fr/5/5.1/OrientationsComInterne.pdf: Retrieva= l=20 command for http://interligne.xxx.fr/5/5.1/OrientationsComInterne.pdf:=20 GET /5/5.1/OrientationsComInterne.pdf HTTP/1.0 User-Agent: htdig/3.1.6 (ch...@xx...) Referer: http://interligne.xxx.fr/5/5.1/ Host: interligne.xxx.fr Header line: HTTP/1.1 200 OK Header line: Date: Thu, 26 Aug 2004 07:51:43 GMT Header line: Server: Apache/1.3.19 (Unix) PHP/4.0.5 Header line: Last-Modified: Tue, 15 Jun 2004 06:31:01 GMT Converted Tue, 15 Jun 2004 06:31:01 GMT to Tue, 15 Jun 2004 06:31:01 Header line: ETag: "4741-13bdf-40ce97a5" Header line: Accept-Ranges: bytes Header line: Content-Length: 80863 Header line: Connection: close Header line: Content-Type: application/pdf Header line:=20 returnStatus =3D 0 Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 8192 from document Read 7135 from document Read a total of 80863 bytes size =3D 80863 pick: interligne.xxx.fr, # servers =3D 1 ----------- At a moment, the text change with this and any document is pushing : Deleted, no excerpt: 0/http://interligne.xxx.fr/5/5.1 1/http://interligne.xxx.fr/5/5.1/ Deleted, no excerpt: 5/http://interligne.xxx.fr/5/5.1/Colloque.cfm Deleted, no excerpt: 7/http://interligne.xxx.fr/5/5.1/DG916.PDF ........ Some one have a idea what can be the matter ? Thanks ! |
From: Neal R. <ne...@ri...> - 2004-08-31 18:31:15
|
For the PDF error, please make sure that the PDF document in question can have text extracted from it. Download the file and run pdf2html.pl on it. If that works make sure that max_doc_size is bigger than the PDF document. (it looks like it is). Not all PDFs have text in them. Some PDFs are images of text. You can read them on the screen, but there are no characters/words in the PDF to extract and index. As for the second error, try putting these in your htdig.conf file. wordlist_compress: false wordlist_compress_zlib: false We have a mystery bug in the BDB code. Thanks Thanks Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 On Thu, 26 Aug 2004 ch...@in... wrote: > hi, > > On HP-UX D270 and htdig 3.16, i have no problem for indexing our intranet's > site. > Now, i would like to indexing pdf file with xpdf. > I change the parameters in htdig.conf and in pdf2html.pl. > When i 'm indexing the site (or a short part of the site), i have this message : > > DB2 problem ....PANIC : Invalid argument > /opt/www/htdig/bin/rundig[36] : 23928 Memory fault (coredump) > > and later some > DB2 problem... : missing or empty key value specified. > > When i use -vvv, i see that some pdf file is pushing : > > ----- > pick: interligne.xxx.fr, # servers = 1 > 4:4:1:http://interligne.xxx.fr/5/5.1/OrientationsComInterne.pdf: Retrieval > command for http://interligne.xxx.fr/5/5.1/OrientationsComInterne.pdf: > GET /5/5.1/OrientationsComInterne.pdf HTTP/1.0 > User-Agent: htdig/3.1.6 (ch...@xx...) > Referer: http://interligne.xxx.fr/5/5.1/ > Host: interligne.xxx.fr > > Header line: HTTP/1.1 200 OK > Header line: Date: Thu, 26 Aug 2004 07:51:43 GMT > Header line: Server: Apache/1.3.19 (Unix) PHP/4.0.5 > Header line: Last-Modified: Tue, 15 Jun 2004 06:31:01 GMT > Converted Tue, 15 Jun 2004 06:31:01 GMT to Tue, 15 Jun 2004 06:31:01 > Header line: ETag: "4741-13bdf-40ce97a5" > Header line: Accept-Ranges: bytes > Header line: Content-Length: 80863 > Header line: Connection: close > Header line: Content-Type: application/pdf > Header line: > returnStatus = 0 > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 8192 from document > Read 7135 from document > Read a total of 80863 bytes > size = 80863 > pick: interligne.xxx.fr, # servers = 1 > > ----------- > > At a moment, the text change with this and any document is pushing : > Deleted, no excerpt: 0/http://interligne.xxx.fr/5/5.1 > 1/http://interligne.xxx.fr/5/5.1/ > Deleted, no excerpt: 5/http://interligne.xxx.fr/5/5.1/Colloque.cfm > Deleted, no excerpt: 7/http://interligne.xxx.fr/5/5.1/DG916.PDF > ........ > > > Some one have a idea what can be the matter ? > Thanks ! > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by BEA Weblogic Workshop > FREE Java Enterprise J2EE developer tools! > Get your free copy of BEA WebLogic Workshop 8.1 today. > http://ads.osdn.com/?ad_idP47&alloc_id808&op,ick > _______________________________________________ > ht://Dig Developer mailing list: > htd...@li... > List information (subscribe/unsubscribe, etc.) > https://lists.sourceforge.net/lists/listinfo/htdig-dev > > |