From: Gilles D. <gr...@sc...> - 2001-12-11 19:51:48
|
According to Franck Collineau: > I have launched rundig -v and i have the messages below. > When i do a search with a key word that is in the pages, it doesn't find > anything !! > > Is my indexation good ? Well, it doesn't sound like it indexed correctly, but it's hard to say for sure. The fact that htmerge didn't remove all PDF files from the database suggests that htdig did get something from these files, but I guess the question is what and how much. To see the actual words that htdig grabs from each document and keeps in the index, you'd need to run with -vvvv, which would generate a lot of output, but then you could look through the output to see if it's finding all the words it should. Another thing you can do is look through your db.wordlist file to see what words are in there. If a search for one of the words in there still fails to find a match, it would suggest that either htmerge isn't correctly building the db.words.db database from db.wordlist, or htsearch isn't correctly searching this database. On the other hand, if htsearch results are consistent with what you see in db.wordlist, but this file doesn't contain all the words it should be getting from the PDFs, then the problem is with htdig and the external parser or external converter script you're using, or perhaps with the PDF files themselves. In an earlier message, you had asked about doc2html.pl, so I assume that's what you're using. Are you sure you've set the external_parsers attribute correctly? Do you get the correct output when you run doc2html.pl manually on some of these PDF files? > New server: r-lx-collineau.rd.francetelecom.fr, 80 > 0:0:0:http://r-lx-collineau.rd.francetelecom.fr/web/essai: redirect > 1:1:0:http://r-lx-collineau.rd.francetelecom.fr/web/essai/: ++++++++++++++++ > size = 898 > 2:2:1:http://r-lx-collineau.rd.francetelecom.fr/web/essai/page06.pdf: size = > 84559 ... > 17:17:1:http://r-lx-collineau.rd.francetelecom.fr/web/essai/page33.pdf: size > = 145221 > htmerge: Sorting... > htmerge: Removing doc #0 > htmerge: Merging... > > Deleted, no excerpt: 0/http://r-lx-collineau.rd.francetelecom.fr/web/essai > htmerge: 10 The only document htmerge deleted from the database is the directory name above, which caused a redirect. This is to be expected. The fact that none of the PDFs were deleted suggests that there is an excerpt that htdig got from these files, so they do contain text, and the parser is finding some of this text. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |