From: Gilles D. <gr...@sc...> - 2002-08-15 14:26:33
|
According to Paul Smith: > I have a special situation where my corpus to be indexed contains strings > like > > Please see number 72/111,222 for more ... > > I would like my users to be able to perform successful searches on terms > like: > > 72/111,222 > 72/111222 or > 72111222 > > At first, the solution appears easy. Set > > allow_numbers: true > > in htdig.conf. Doing this, however, reveals a problem: htdig refuses to > index the target string (72/111,222) as a single entity. That is, no matter > what combination of conf directives I use (see next), htdig always indexes > 72/111,222 into two terms: one is 72111 and another is 222. [I should note, > I believe this is what is happening...I can successfully search on 72111 and > I can successfully search on 222.] That is, htdig recognizes that I want to > index the numbers in the corpus, but it insists that strings like 72/111,222 > are two separate numbers. > > I have tried these config directives: > > valid_punctuation: , > extra_word_characters: , > > in all the permutations. Unfortunately, I can't get htdig to index > 72/111,222 as a single entry: 72111222 > > At the very worse, if my users can't perform all three types of searches > (72/111,222 72/111222 72111222), I would accept if they would succeed on the > last. > > I did try some limited locale: en_GB experiments to see if I could make the > comma treated as a decimal, but still no positive result. htdig still > insists on parsing 72/111,222 as two words. > > Your thoughts would be appreciated. What you need is to set allow_numbers to true, and make sure that both "/" and "," are in valid_punctuation, but neither is in extra_word_characters. That way, 72/111,222 can be searched as 72/111,222 or 72111,222 or 72/111222 or 72111222. Note, however, that because of htdig's handling of valid_punctuation, the number 72/111,222 will not only go into the index as 72111222, but it will also go in as 72111, 111222, 72 (if minimum_word_length is 2), 111 and 222. So, searches for parts of one of these compound numbers will still yield a match. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |