From: Dan C. <dan...@so...> - 2002-01-23 08:28:35
|
Hi all, This is my first post and I've only been using Ht://Dig for 2 days, so please go easy on me. :) First, a little about my system - I'm running Ht://Dig 3.1.5 on a Red Hat 7.2 Linux PC. The standard sort command that is installed is 2.0.14. I was playing around with 3 databases which all worked fine individually. However, I wished to merge them into a single database so that I could search across all 3 at once. Looking through the archive I found a post from Gilles explaining how to do this with htmerge and the -m switch. I tried this and all seemed to work beautifully; I was able to search across all databases. However I noticed that some of the documents that SHOULD have shown up in a particular search did not. They showed up in their individual databases, and running the htmerge program with full verbosity showed they were being merged into the conglomerate database. Looking through the db.wordlist file showed that the word I was searching for, 'support' was not correctly sorted, as there were also some instances of 'supported' which were mixed in amongst them. I couldn't work out how this could happen until I tried playing with the sort command for a while. As one of the posts from the archive says, sort is assumed to sort across the whole line, but this simply does not produce the intended effect, at least on my (very common) system. Here's a part of the db.wordlist I get from htmerge as it stands: supply i:29 l:796 w:204 support i:20 l:111 w:2585 c:4 support i:69 l:35 w:201890 c:4 support i:70 l:11 w:4650 c:6 supported i:57 l:710 w:290 supported i:38 l:797 w:203 support i:59 l:240 w:760 support i:18 l:797 w:203 support i:29 l:799 w:201 support i:73 l:869 w:131 sure i:20 l:656 w:344 surname i:30 l:115 w:1607 c:2 surname i:31 l:115 w:1607 c:2 Here's what you get when you use the straight /bin/sort on it: supply i:29 l:796 w:204 supported i:38 l:797 w:203 supported i:57 l:710 w:290 support i:18 l:797 w:203 support i:20 l:111 w:2585 c:4 support i:29 l:799 w:201 support i:59 l:240 w:760 support i:69 l:35 w:201890 c:4 support i:70 l:11 w:4650 c:6 support i:73 l:869 w:131 sure i:20 l:656 w:344 surname i:30 l:115 w:1607 c:2 surname i:31 l:115 w:1607 c:2 And here's what's actually needed to get correct results from htsearch supply i:29 l:796 w:204 support i:18 l:797 w:203 support i:20 l:111 w:2585 c:4 support i:29 l:799 w:201 support i:59 l:240 w:760 support i:69 l:35 w:201890 c:4 support i:70 l:11 w:4650 c:6 support i:73 l:869 w:131 supported i:38 l:797 w:203 supported i:57 l:710 w:290 sure i:20 l:656 w:344 surname i:30 l:115 w:1607 c:2 surname i:31 l:115 w:1607 c:2 The reason 'supported' appears above 'support' is because the whole line is being used as the key and the 'e' in 'supported' comes before the 'i' in the next field. Below is a patch for htmerge/words.cc that appends a '--key=1,1' parameter to the sort command in htmerge. This seems to fix the problem. Gilles mentions elsewhere that the intended behaviour is to sort by the first field, then second, etc. so you may wish to include those parameters also. No idea if this would work on any systems apart from my own (or even if the problem exists in different versions, etc). Obviously it would be better if this parameter were in the Makefile or something and was configured as necessary by make, but I don't know enough about it. 59a60,72 > > > // START PATCH > // patch added by Dan Cutting (da...@wo...) 23/01/2002 to make htmerge > // use first field of wordlist to sort instead of entire line which could lead > // to incorrectly sorted database. this in turn leads to missing results > // from searches. NB: this has not been tested on any systems (apart from my own > // Linux box) and is designed purely for sort version 2.0.14. Other versions > // will probably also work, but no guarantees! Try it from the command line first. > command << " --key=1,1"; > /// END PATCH > > Regards, Dan Cutting dan...@so... ********************************************************************** visit http://www.solution6.com visit http://www.eccountancy.com - everything for accountants. UK Customers - http://www.solution6.co.uk ********************************************************************* This email message (and attachments) may contain information that is confidential to Solution 6. If you are not the intended recipient you cannot use, distribute or copy the message or attachments. In such a case, please notify the sender by return email immediately and erase all copies of the message and attachments. Opinions, conclusions and other information in this message and attachments that do not relate to the official business of Solution 6 are neither given nor endorsed by it. ********************************************************************* |