From: Dan L. <da...@la...> - 2002-11-20 16:21:54
|
On 13 Nov 2002 at 21:48, Gilles Detillieux wrote: > According to Dan Langille: > > I have indexed a mailing list archive. My next goal is to nightly > > update that index by indexing the entire month's archive and then > > merging that into the main database. At present, there are about 4 > > years of data. I'm seeking comments on my approach. > > You may also want to have a look at how we do it for the htdig-general > and htdig-dev archives: > > http://www.htdig.org/files/contrib/scripts/README.geoupdate-ungeoify > http://www.htdig.org/files/contrib/scripts/geoupdate.sh My results are interesting, and confusing. The problem is incomplete results. I will explain as I go along. My base working directory is /usr/local/htdig. The "production" databases is: [dan@undef:/usr/local/htdig] $ ls -l databases/ total 214713 drwxr-xr-x 2 dan dan 512 Nov 20 09:43 adsl-update -rw-r--r-- 1 dan dan 70942720 Nov 10 11:17 db.docdb -rw-r--r-- 1 dan dan 1940480 Nov 10 11:17 db.docs.index -rw-r--r-- 1 dan dan 80104230 Nov 10 11:17 db.wordlist -rw-r--r-- 1 dan dan 66720768 Nov 10 11:17 db.words.db I believe that the initial dig and merge are operating correctly. The following files are created: [dan@undef:/usr/local/htdig] $ ls -l databases/adsl-update/ total 817 -rw-r--r-- 1 dan dan 221184 Nov 20 11:00 db.docdb -rw-r--r-- 1 dan dan 9216 Nov 20 11:00 db.docs.index -rw-r--r-- 1 dan dan 234216 Nov 20 11:00 db.wordlist -rw-r--r-- 1 dan dan 344064 Nov 20 11:00 db.words.db It is the next merge which is the cause of the problem. The command is "/usr/local/bin/htmerge -a -c htdig-unixathome.org-adsl.conf -m adsl-update.conf" issued from /usr/local/htdig. This results in: htmerge: Unable to open word list file '/usr/local/htdig/databases/db.wordlist.work' That I suspect is the result of the -a option on htmerge. The documentation says: Use alternate work files. Tells htdig to append .work to database files, causing a second copy of the database to be built. This allows the original files to be used by htsearch during the indexing run. Why is it expecting a .work file on input? I note that the geoupdate.sh script leaves behinad only db.docdb.work and db.wordlist.work. To supply the file, I do this in the databases subdirectory: cp db.wordlist db.wordlist.work Running the htmerge then results in incomplete results. Specifically, the db.docdb.work and db.docs.index.work is way too small when compared to the original files [dan@undef:/usr/local/htdig] $ ls -l databases/ total 358226 drwxr-xr-x 3 dan dan 512 Nov 20 11:02 adsl-update -rw-r--r-- 1 dan dan 70942720 Nov 10 11:17 db.docdb -rw-r--r-- 1 dan dan 209920 Nov 20 11:09 db.docdb.work -rw-r--r-- 1 dan dan 1940480 Nov 10 11:17 db.docs.index -rw-r--r-- 1 dan dan 9216 Nov 20 11:09 db.docs.index.work -rw-r--r-- 1 dan dan 80104230 Nov 10 11:17 db.wordlist -rw-r--r-- 1 dan dan 79987264 Nov 20 11:09 db.wordlist.work -rw-r--r-- 1 dan dan 66720768 Nov 10 11:17 db.words.db -rw-r--r-- 1 dan dan 66639872 Nov 20 11:09 db.words.db.work I also tried this approach cd /usr/local/htdig/databases cp db.docdb db.docdb.work cp db.docs.index db.docs.index.work cp db.wordlist db.wordlist.work cp db.words.db db.words.db.work Then I reran the merge to obtain a better result: [dan@undef:/usr/local/htdig/databases] $ ls -l total 429961 drwxr-xr-x 3 dan dan 512 Nov 20 11:02 adsl-update -rw-r--r-- 1 dan dan 70942720 Nov 10 11:17 db.docdb -rw-r--r-- 1 dan dan 71108608 Nov 20 11:15 db.docdb.work -rw-r--r-- 1 dan dan 1940480 Nov 10 11:17 db.docs.index -rw-r--r-- 1 dan dan 1945600 Nov 20 11:15 db.docs.index.work -rw-r--r-- 1 dan dan 80104230 Nov 10 11:17 db.wordlist -rw-r--r-- 1 dan dan 80314085 Nov 20 11:15 db.wordlist.work -rw-r--r-- 1 dan dan 66720768 Nov 10 11:17 db.words.db -rw-r--r-- 1 dan dan 66883584 Nov 20 11:15 db.words.db.work What am I not understanding? Thank you. -- Dan Langille : http://www.langille.org/ |