Thread: RE: [Algorithms] Dictionary compression

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

From what I remember of my Computing Science degree, you would probably =
be best using a Trie. The following link has a pretty good description
http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Tree/Trie/

Search time is _very_ fast, not sure about the degree of compression you =
get out of it.

cheers
K

-----Original Message-----
From: Wojciech Wylon [mailto:ww...@ga...]
Sent: 27 February 2003 10:38
To: gda...@li...
Subject: [Algorithms] Dictionary compression

For one of our project I need to check if word exist in the directory.
And it has to be done (very) fast. We have dictionary of 2.000.000 words
(it is about 26MB of data in text file).
So far I did use following steps to create the data structure that allow
very rapid checking of word existence:
1. I sorted dictionary
2. I split dictionary into groups - each having N (20) words,
3. Each group I compressed eg.
Before:

abaka
abakach
abakami
abakan

after:
abaka
5ch
5mi
5an

4. For fast finding group to which given word belongs I use two
solutions:
	a) I create kind of tree (in nodes there is word to compare and
links
	  to sons, in leaves there are indexes of group),
	b) I created hash table=20

Solution 4a)=20
All data takes about 8.5MB, on my computer(1.4MH) I can check about
200.000 random fetches per sec.

Solution 4b)
All data takes about 20.MB, on my computer(1.4MH) I can check about
300.000 random fetches per sec.

IMHO It is still too slow. There should be the way to optimize it. Maybe
I should have prepare some dictionaries (for words that appear with
different=20
Frequency)

Any ideas? Any links for publications?

Wojciech Wylon
Ganymede Technologies

-------------------------------------------------------
This SF.NET email is sponsored by:
SourceForge Enterprise Edition + IBM + LinuxWorld =3D Something 2 See!
http://www.vasoftware.com
_______________________________________________
GDAlgorithms-list mailing list
GDA...@li...
https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
Archives:
http://sourceforge.net/mailarchive/forum.php?forum_id=3D6188