From: Jutta W. <jw...@wi...> - 2005-07-31 23:38:36
|
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Am 31.07.2005 um 17:33 schrieb Aleksey Cheusov: > In order to make UTF-8 search faster, all index_words > are lowercased, > are sorted in lexicographic order (byte-to-byte) > and all characters except alphanumeric ones and spaces are removed > (unless --allchars option is specified). I put in allchars now in addidion to utf8. But it does not help much. "Software in the Public Interest, Inc." is not found as I stil have to remove "," to find "Debian-Entwickler". Okay, now I do not have to remove "-" any longer, it seems. > When dictfmt is used to create ASCII dictionary, > different sorting order is used (compatible with dictd-1.5.5 and > earlier > which supported ASCII databases only), > i.e. all characters > are kept in index_words and sorting order corresponds to 'sort -df'. That is alphabetical order which is locale dependent as far as I know. I think my ordering, I have already done is quite well: sorting by numerical value without allkeys.txt. > The way UTF-8 dictionaries are built allows to make search much faster > but has a number of serious disadvantages and bad side effects. > Two simple examples: > the word 'AT&T' is represented in .index of UTF-8 dictionary > as 'ATT' and also returned by MATCH command as 'ATT', If the index is sortet by numerical value: Is there really an advantage in removing some of the characters? Seems I have to find out which characters that are or just go by try and error. I cannot read the c code well enough to find out from it. > the second example is german nouns which > are represented in lowercase in .index file There are also names and acronyms and other languages using uppercase. > JW> I have seen, that utf-8 dictionarties all have one empty line > at the > JW> beginning of the fdicht file > 0 />head /var/ftp/pub/dictd/geology_en-ru.index > 00databasealphabet YVC c > 00databaseinfo Ba o5 Sorry for my mistyping: the ".dict" file was meant and not the index. But as far as I have seen now after your mail I just can leave that empty line out. > There is a number of special headwords in .index I'll try to keep a list of them. > Such headwords are used as a flags. > In particular 00-database-utf8 say that this database is UTF-8 one. Ah, okay. I see now I can use them without caring any position. They all can have A\tB as position - or maybe, none? > JW> 00-database-utf8 seems not to be accepted in the index (seems the > JW> same reason as above) and is not used in the dict file as entry. > What's point? Ah, I did not express that well enough: I have to write 00databaseutf8 now instead of 00-database-utf8, as I have done before. the reason "above" seems to be that I have to remove the hyphons for building up the index. I am not sure. But other languages may use special signs like hyphons in words, too. So the removal may be a real problem. For English language i know, that wrinting words with and without hyphon gives them a different meaning. For German language hyphons only are for words composed out of more than one word (names). greetings Jutta - -- http://www.witch.westfalen.de http://witch.muensterland.org -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (Darwin) iEYEARECAAYFAkLtRNwACgkQOgZ5N97kHkd/WgCgouMVPDOo3sCfNtVFnV5Yn64I VA0An2+4xACnlf1WdX5/kaKA3xOa2VhP =uJxQ -----END PGP SIGNATURE----- |