Re: [dict-beta] Charset range utf8

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Am 31.07.2005 um 17:33 schrieb Aleksey Cheusov:

> In order to make UTF-8 search faster, all index_words
> are lowercased,
> are sorted in lexicographic order (byte-to-byte)
> and all characters except alphanumeric ones and spaces are removed
> (unless --allchars option is specified).

I put in allchars now in addidion to utf8.
But it does not help much.

"Software in the Public Interest, Inc." is not found as I stil have  
to remove "," to find "Debian-Entwickler".
Okay, now I do not have to remove "-" any longer, it seems.

> When dictfmt is used to create ASCII dictionary,
> different sorting order is used (compatible with dictd-1.5.5 and  
> earlier
> which supported ASCII databases only),
> i.e. all characters
> are kept in index_words and sorting order corresponds to 'sort -df'.

That is alphabetical order which is locale dependent as far as I  
know. I think my ordering, I have already done is quite well: sorting  
by numerical value without allkeys.txt.

> The way UTF-8 dictionaries are built allows to make search much faster
> but has a number of serious disadvantages and bad side effects.
> Two simple examples:
> the word 'AT&T' is represented in .index of UTF-8 dictionary
> as 'ATT' and also returned by MATCH command as 'ATT',

If the index is sortet by numerical value: Is there really an  
advantage in removing some of the characters?
Seems I have to find out which characters that are or just go by try  
and error. I cannot read the c code well enough to find out from it.

> the second example is german nouns which
> are represented in lowercase in .index file

There are also names and acronyms and other languages using uppercase.

>  JW> I have seen, that utf-8 dictionarties all have one empty line  
> at the
>  JW> beginning of the fdicht file

> 0 />head /var/ftp/pub/dictd/geology_en-ru.index
> 00databasealphabet      YVC     c
> 00databaseinfo  Ba      o5

Sorry for my mistyping: the ".dict" file was meant and not the index.  
But as far as I have seen now after your mail I just can leave that  
empty line out.

> There is a number of special headwords in .index

I'll try to keep a list of them.

> Such headwords are used as a flags.
> In particular 00-database-utf8 say that this database is UTF-8 one.

Ah, okay. I see now I can use them without caring any position. They  
all can have A\tB as position - or maybe, none?

>  JW> 00-database-utf8 seems not to be accepted in the index (seems the
>  JW> same reason as above) and is not used in the dict file as entry.
> What's point?

Ah, I did not express that well enough:

I have to write 00databaseutf8 now instead of 00-database-utf8, as I  
have done before.
the reason "above" seems to be that I have to remove the hyphons for  
building up the index.

I am not sure. But other languages may use special signs like hyphons  
in words, too. So the removal may be a real problem.
For English language i know, that wrinting words with and without  
hyphon gives them a different meaning. For German language hyphons  
only are for words composed out of more than one word (names).

greetings

Jutta

- -- 
http://www.witch.westfalen.de
http://witch.muensterland.org

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (Darwin)

iEYEARECAAYFAkLtRNwACgkQOgZ5N97kHkd/WgCgouMVPDOo3sCfNtVFnV5Yn64I
VA0An2+4xACnlf1WdX5/kaKA3xOa2VhP
=uJxQ
-----END PGP SIGNATURE-----