Thread: RE: [Algorithms] Dictionary compression
Brought to you by:
vexxed72
From: Ken C. <ken...@vi...> - 2003-02-27 10:44:47
|
From what I remember of my Computing Science degree, you would probably = be best using a Trie. The following link has a pretty good description http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Tree/Trie/ Search time is _very_ fast, not sure about the degree of compression you = get out of it. cheers K -----Original Message----- From: Wojciech Wylon [mailto:ww...@ga...] Sent: 27 February 2003 10:38 To: gda...@li... Subject: [Algorithms] Dictionary compression For one of our project I need to check if word exist in the directory. And it has to be done (very) fast. We have dictionary of 2.000.000 words (it is about 26MB of data in text file). So far I did use following steps to create the data structure that allow very rapid checking of word existence: 1. I sorted dictionary 2. I split dictionary into groups - each having N (20) words, 3. Each group I compressed eg. Before: abaka abakach abakami abakan after: abaka 5ch 5mi 5an 4. For fast finding group to which given word belongs I use two solutions: a) I create kind of tree (in nodes there is word to compare and links to sons, in leaves there are indexes of group), b) I created hash table=20 Solution 4a)=20 All data takes about 8.5MB, on my computer(1.4MH) I can check about 200.000 random fetches per sec. Solution 4b) All data takes about 20.MB, on my computer(1.4MH) I can check about 300.000 random fetches per sec. IMHO It is still too slow. There should be the way to optimize it. Maybe I should have prepare some dictionaries (for words that appear with different=20 Frequency) Any ideas? Any links for publications? Wojciech Wylon Ganymede Technologies ------------------------------------------------------- This SF.NET email is sponsored by: SourceForge Enterprise Edition + IBM + LinuxWorld =3D Something 2 See! http://www.vasoftware.com _______________________________________________ GDAlgorithms-list mailing list GDA...@li... https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list Archives: http://sourceforge.net/mailarchive/forum.php?forum_id=3D6188 |
From: Gareth L. <GL...@cl...> - 2003-02-27 11:17:47
|
here comes Gareth "doesn't know maths and didn't go to University, so his ideas are more simple"* Lewin. (I assume your problem is one of speed of searching, not of memory requirements ?) Have all the words sorted in one big list. then have a table, depending on your tests, either just for the first letter, or the first two letters, or the first 3 (No need for more). It's just a static array saying where the first word that starts with those letters is. Since it's a static array, you can jump directly the correct index with simple maths. On the compression side, do you care about case ? If not you only need 26 values (4 bit) to store each character, pretty fast 2 to 1 compression there. Regards, Gareth Lewin (*) When I say simple I don't mean that as a slight to you math heads and university veterans (I wish I had your knowledge), I mean it draws from a more limited bank of information, but often I solve a problem in a way that works just as well. > -----Original Message----- > From: Wojciech Wylon [mailto:ww...@ga...] > Sent: 27 February 2003 10:38 > To: gda...@li... > Subject: [Algorithms] Dictionary compression > > > For one of our project I need to check if word exist in the directory. > And it has to be done (very) fast. We have dictionary of > 2.000.000 words > (it is about 26MB of data in text file). > So far I did use following steps to create the data structure > that allow > very rapid checking of word existence: > 1. I sorted dictionary > 2. I split dictionary into groups - each having N (20) words, > 3. Each group I compressed eg. > Before: > > abaka > abakach > abakami > abakan > > after: > abaka > 5ch > 5mi > 5an > > 4. For fast finding group to which given word belongs I use two > solutions: > a) I create kind of tree (in nodes there is word to compare and > links > to sons, in leaves there are indexes of group), > b) I created hash table > > Solution 4a) > All data takes about 8.5MB, on my computer(1.4MH) I can check about > 200.000 random fetches per sec. > > Solution 4b) > All data takes about 20.MB, on my computer(1.4MH) I can check about > 300.000 random fetches per sec. > > IMHO It is still too slow. There should be the way to > optimize it. Maybe > I should have prepare some dictionaries (for words that appear with > different > Frequency) > > Any ideas? Any links for publications? > > > Wojciech Wylon > Ganymede Technologies > > > > > ------------------------------------------------------- > This SF.NET email is sponsored by: > SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See! > http://www.vasoftware.com > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_id=6188 > |
From: Wojciech W. <ww...@ga...> - 2003-02-27 12:46:18
|
> (I assume your problem is one of speed of searching, not of memory > requirements ?) > > Have all the words sorted in one big list. then have a table, depending on > your tests, either just for the first letter, or the first two letters, or > the first 3 (No need for more). It's just a static array saying where the > first word that starts with those letters is. > Since it's a static array, you can jump directly the correct index with > simple maths. [Wojciech Wylon] Hmm, eg. for word prefix "Ant" I have 2600 words!! I have not tested it (yet) but I suppose that this solution can be a bit slower than one of my current... But you give me new idea for data structures - I will have to think yet about that... > On the compression side, do you care about case ? If not you only need 26 > values (4 bit) to store each character, pretty fast 2 to 1 compression > there. [Wojciech Wylon] a) I have more than 26 letters in alphabet ;) b) compression is quite important: a) less memory used - larger chance that everything will be in cache when needed. b) On one computer we will some application using that code. I can share dictionary among different application. But probably we will have some different dictionaries at the same time on one computer. And there is quite a lot of thing that have to be done yet... I remember that once there were one very clever trick used for dictionary compression/access - it was described in Programming Perils book. Does anyone remember that algorithm (or have that book)? Best regards, Wojciech Wylon |
From: Willem H. de B. <Wi...@mu...> - 2003-02-27 11:41:15
|
"(I wish I had your knowledge)" Go to the library and get yourself a couple of books that are used in CS or Maths degrees. There, problem solved :) > -----Original Message----- > From: Gareth Lewin [mailto:GL...@cl...] > Sent: 27 February 2003 11:18 > To: gda...@li... > Subject: RE: [Algorithms] Dictionary compression > > > here comes Gareth "doesn't know maths and didn't go to > University, so his > ideas are more simple"* Lewin. > > (I assume your problem is one of speed of searching, not of memory > requirements ?) > > Have all the words sorted in one big list. then have a table, > depending on > your tests, either just for the first letter, or the first > two letters, or > the first 3 (No need for more). It's just a static array > saying where the > first word that starts with those letters is. > > Since it's a static array, you can jump directly the correct > index with > simple maths. > > On the compression side, do you care about case ? If not you > only need 26 > values (4 bit) to store each character, pretty fast 2 to 1 compression > there. > > Regards, Gareth Lewin > > (*) When I say simple I don't mean that as a slight to you > math heads and > university veterans (I wish I had your knowledge), I mean it > draws from a > more limited bank of information, but often I solve a problem > in a way that > works just as well. > > > -----Original Message----- > > From: Wojciech Wylon [mailto:ww...@ga...] > > Sent: 27 February 2003 10:38 > > To: gda...@li... > > Subject: [Algorithms] Dictionary compression > > > > > > For one of our project I need to check if word exist in the > directory. > > And it has to be done (very) fast. We have dictionary of > > 2.000.000 words > > (it is about 26MB of data in text file). > > So far I did use following steps to create the data structure > > that allow > > very rapid checking of word existence: > > 1. I sorted dictionary > > 2. I split dictionary into groups - each having N (20) words, > > 3. Each group I compressed eg. > > Before: > > > > abaka > > abakach > > abakami > > abakan > > > > after: > > abaka > > 5ch > > 5mi > > 5an > > > > 4. For fast finding group to which given word belongs I use two > > solutions: > > a) I create kind of tree (in nodes there is word to compare and > > links > > to sons, in leaves there are indexes of group), > > b) I created hash table > > > > Solution 4a) > > All data takes about 8.5MB, on my computer(1.4MH) I can check about > > 200.000 random fetches per sec. > > > > Solution 4b) > > All data takes about 20.MB, on my computer(1.4MH) I can check about > > 300.000 random fetches per sec. > > > > IMHO It is still too slow. There should be the way to > > optimize it. Maybe > > I should have prepare some dictionaries (for words that appear with > > different > > Frequency) > > > > Any ideas? Any links for publications? > > > > > > Wojciech Wylon > > Ganymede Technologies > > > > > > > > > > ------------------------------------------------------- > > This SF.NET email is sponsored by: > > SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See! > > http://www.vasoftware.com > > _______________________________________________ > > GDAlgorithms-list mailing list > > GDA...@li... > > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > > Archives: > > http://sourceforge.net/mailarchive/forum.php?forum_id=6188 > > > > > ------------------------------------------------------- > This SF.NET email is sponsored by: > SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See! > http://www.vasoftware.com > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_id=6188 > |
From: Richard F. <alg...@th...> - 2003-02-27 13:06:15
|
> > On the compression side, do you care about case ? If not you > > only need 26 > > values (4 bit) to store each character, pretty fast 2 to 1 > compression > > there. lol |
From: Ville M. <wi...@hy...> - 2003-02-27 12:07:33
|
>here comes Gareth "doesn't know maths and didn't go to University, so his >ideas are more simple"* Lewin. >On the compression side, do you care about case ? If not you only need 26 >values (4 bit) to store each character, pretty fast 2 to 1 compression >there. 26 values > 4 bits =) cheers, -wili Ville Miettinen Lead Programmer Hybrid Graphics, Ltd. http://www.hybrid.fi -----Original Message----- From: Gareth Lewin [mailto:GL...@cl...] Sent: 27. helmikuuta 2003 13:18 To: gda...@li... Subject: RE: [Algorithms] Dictionary compression here comes Gareth "doesn't know maths and didn't go to University, so his ideas are more simple"* Lewin. (I assume your problem is one of speed of searching, not of memory requirements ?) Have all the words sorted in one big list. then have a table, depending on your tests, either just for the first letter, or the first two letters, or the first 3 (No need for more). It's just a static array saying where the first word that starts with those letters is. Since it's a static array, you can jump directly the correct index with simple maths. On the compression side, do you care about case ? If not you only need 26 values (4 bit) to store each character, pretty fast 2 to 1 compression there. Regards, Gareth Lewin (*) When I say simple I don't mean that as a slight to you math heads and university veterans (I wish I had your knowledge), I mean it draws from a more limited bank of information, but often I solve a problem in a way that works just as well. > -----Original Message----- > From: Wojciech Wylon [mailto:ww...@ga...] > Sent: 27 February 2003 10:38 > To: gda...@li... > Subject: [Algorithms] Dictionary compression > > > For one of our project I need to check if word exist in the directory. > And it has to be done (very) fast. We have dictionary of > 2.000.000 words > (it is about 26MB of data in text file). > So far I did use following steps to create the data structure > that allow > very rapid checking of word existence: > 1. I sorted dictionary > 2. I split dictionary into groups - each having N (20) words, > 3. Each group I compressed eg. > Before: > > abaka > abakach > abakami > abakan > > after: > abaka > 5ch > 5mi > 5an > > 4. For fast finding group to which given word belongs I use two > solutions: > a) I create kind of tree (in nodes there is word to compare and > links > to sons, in leaves there are indexes of group), > b) I created hash table > > Solution 4a) > All data takes about 8.5MB, on my computer(1.4MH) I can check about > 200.000 random fetches per sec. > > Solution 4b) > All data takes about 20.MB, on my computer(1.4MH) I can check about > 300.000 random fetches per sec. > > IMHO It is still too slow. There should be the way to > optimize it. Maybe > I should have prepare some dictionaries (for words that appear with > different > Frequency) > > Any ideas? Any links for publications? > > > Wojciech Wylon > Ganymede Technologies > > > > > ------------------------------------------------------- > This SF.NET email is sponsored by: > SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See! > http://www.vasoftware.com > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_id=6188 > ------------------------------------------------------- This SF.NET email is sponsored by: SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See! http://www.vasoftware.com _______________________________________________ GDAlgorithms-list mailing list GDA...@li... https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list Archives: http://sourceforge.net/mailarchive/forum.php?forum_id=6188 |
From: Gareth L. <GL...@cl...> - 2003-02-27 12:19:57
|
> 26 values > 4 bits =) Um, yes, 5 bits. See what I meant ! Still can compress almost 2 to 1 ( well 5 to 8 ). Not sure what the original intent of the problem is. Is he worried about memory paging of the file ( A memory mapped file would work file IMO ) ? Or is he worried about the time it takes to find a word ? |
From: Wojciech W. <ww...@ga...> - 2003-02-27 12:49:26
|
I am worried about time. Wojciech Wylon > -----Original Message----- > From: gda...@li... [mailto:gdalgorithms- > lis...@li...] On Behalf Of Gareth Lewin > Sent: Thursday, February 27, 2003 1:20 PM > To: gda...@li... > Subject: RE: [Algorithms] Dictionary compression > > > 26 values > 4 bits =) > > Um, yes, 5 bits. > > See what I meant ! > > Still can compress almost 2 to 1 ( well 5 to 8 ). > > Not sure what the original intent of the problem is. Is he worried about > memory paging of the file ( A memory mapped file would work file IMO ) ? > Or > is he worried about the time it takes to find a word ? > > > ------------------------------------------------------- > This SF.NET email is sponsored by: > SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See! > http://www.vasoftware.com > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_id=6188 |
From: Richard F. <alg...@th...> - 2003-02-27 16:27:17
|
then why not just crunch the word into a bit pattern: e.g. cat (with 26 letter alphabet) is c (3) + a (1) * 27 + t (20) * 27^2 so cat = 14610 (3910 in hex) and you set up a 256tree to acces the values... so in this case, you are looking for a value in the 0x10 value, that is true for 0x39, if it is true, then you need to make sure that there is a root available at the 39 value. if so, then you know there is a true value for 0x3910.... struct treeNode { bool haveRoot; treeNode *subnode[ 256 ]; }; // to check treeNode *root value = Crunch( inString ); treeNode *tempNode = root; for(;;) { if( tempNode == NULL ) return false; tempNode = tempNode->subNode[ value &0xFF ]; value /= 256; // OR value >>= 8; if you are paranoid. if( !value ) { return tempNode->haveRoot; } } which for a 13 letter long dictionary (most of it) is only 8 loops max... > -----Original Message----- > From: gda...@li... > [mailto:gda...@li...] On > Behalf Of Wojciech Wylon > Sent: 27 February 2003 12:51 > To: gda...@li... > Subject: RE: [Algorithms] Dictionary compression > > > I am worried about time. > > Wojciech Wylon > > > > > -----Original Message----- > > From: gda...@li... > [mailto:gdalgorithms- > > lis...@li...] On Behalf Of Gareth Lewin > > Sent: Thursday, February 27, 2003 1:20 PM > > To: gda...@li... > > Subject: RE: [Algorithms] Dictionary compression > > > > > 26 values > 4 bits =) > > > > Um, yes, 5 bits. > > > > See what I meant ! > > > > Still can compress almost 2 to 1 ( well 5 to 8 ). > > > > Not sure what the original intent of the problem is. Is he worried > about > > memory paging of the file ( A memory mapped file would work > file IMO ) > ? > > Or > > is he worried about the time it takes to find a word ? > > > > > > ------------------------------------------------------- > > This SF.NET email is sponsored by: > > SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See! > > http://www.vasoftware.com > > _______________________________________________ > > GDAlgorithms-list mailing list > > GDA...@li... > > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > > Archives: > > http://sourceforge.net/mailarchive/forum.php?forum_id=6188 > > > > ------------------------------------------------------- > This SF.NET email is sponsored by: > SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See! > http://www.vasoftware.com > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_id=6188 > |
From: Brian M. <Bri...@wa...> - 2003-02-27 12:22:13
|
I don't have the book to hand (it's at home) but based off memory, the book "Managing Gigabytes" is worth a look for this. http://www.cs.mu.oz.au/mg/ Memory is never the best reference... but I remember it being the best reference I've come across for the practicalites of large databases, and compression. HTH -Brian. |
From: Gareth L. <GL...@cl...> - 2003-02-27 12:54:18
|
> Hmm, eg. for word prefix "Ant" I have 2600 words!! I have not > tested it > (yet) but I suppose that this solution can be a bit slower than one of > my current... > But you give me new idea for data structures - I will have to > think yet > about that... You could use the last 3 letters of the word, that is surely less common. Also how slow is stricmp 2600 times ? How fast does this need to be ? > a) I have more than 26 letters in alphabet ;) More than 32 ? Also, it was just a quite and dirty compression idea. > b) compression is quite important: > a) less memory used - larger chance that everything will be in > cache > when needed. I would really like to know how often you are going to be doing this ? how many queries a second for example ? > b) On one computer we will some application using that code. I > can > share dictionary among different application. But probably we > will > have some different dictionaries at the same time on one > computer. > And there is quite a lot of thing that have to be done yet... That's why I would use memory mapped files. Would work a treat :) May I ask why not just use ispell or microsoft speller ? |
From: Wojciech W. <ww...@ga...> - 2003-02-27 13:35:39
|
> > b) compression is quite important: > > a) less memory used - larger chance that everything will be in > > cache > > when needed. > > I would really like to know how often you are going to be doing this ? how > many queries a second for example ? [Wojciech Wylon] I assume that around 50.000-100.000 per second. Ok we can buy more and faster hardware but it is not the best solution... > > b) On one computer we will some application using that code. I > > can > > share dictionary among different application. But probably we > > will > > have some different dictionaries at the same time on one > > computer. > > And there is quite a lot of thing that have to be done yet... > > That's why I would use memory mapped files. Would work a treat :) > > May I ask why not just use ispell or microsoft speller ? [Wojciech Wylon] a) It will work on Linux b) when used as external process ispell will be too slow. Ok. I can rework (C->C++ + publish source) it and try to use it as a lib but IMHO it will be too slow! WW > ------------------------------------------------------- > This SF.NET email is sponsored by: > SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See! > http://www.vasoftware.com > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_id=6188 |
From: Gareth L. <GL...@cl...> - 2003-02-27 12:56:17
|
Hey ! I just thought of an idea. Why not a tree. Like a quad tree but split along all the different letters of the alphabet ? (This is probably been thought of before). But it's pretty fast to test. Each leaf has NUM_LETTERS child nodes. You. Can jump thru your tree way fast. Locality of memory might be a problem here, not sure. > -----Original Message----- > From: Wojciech Wylon [mailto:ww...@ga...] > Sent: 27 February 2003 12:51 > To: gda...@li... > Subject: RE: [Algorithms] Dictionary compression > > > I am worried about time. > > Wojciech Wylon > > > > > -----Original Message----- > > From: gda...@li... > [mailto:gdalgorithms- > > lis...@li...] On Behalf Of Gareth Lewin > > Sent: Thursday, February 27, 2003 1:20 PM > > To: gda...@li... > > Subject: RE: [Algorithms] Dictionary compression > > > > > 26 values > 4 bits =) > > > > Um, yes, 5 bits. > > > > See what I meant ! > > > > Still can compress almost 2 to 1 ( well 5 to 8 ). > > > > Not sure what the original intent of the problem is. Is he worried > about > > memory paging of the file ( A memory mapped file would work > file IMO ) > ? > > Or > > is he worried about the time it takes to find a word ? > > > > > > ------------------------------------------------------- > > This SF.NET email is sponsored by: > > SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See! > > http://www.vasoftware.com > > _______________________________________________ > > GDAlgorithms-list mailing list > > GDA...@li... > > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > > Archives: > > http://sourceforge.net/mailarchive/forum.php?forum_id=6188 > > > > ------------------------------------------------------- > This SF.NET email is sponsored by: > SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See! > http://www.vasoftware.com > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_id=6188 > |
From: Justin Heyes-J. <jus...@ge...> - 2003-02-27 14:22:22
|
What about a tree of valid letters. You start off with a root node then the next level of the tree would be all the letters of the alphabet. The next level would be all legal next letters for each letter and so on. The algorithm for spell check would be to start at the first level of the tree and find your first letter, if it is found then go to next level and check for the second letter. Repeat until you match your last letter or run out of tree. Has anyone got time to do runtime anaylsis on these? hehehe. This should be O(Nlog(N)) I would think. A tree for this dictionary is below. Hope the tab formatting stays .... a apple big bad bag cat root a p p l e b a d g c a t ----- Original Message ----- From: "Gareth Lewin" <GL...@cl...> To: <gda...@li...> Sent: Thursday, February 27, 2003 12:56 PM Subject: RE: [Algorithms] Dictionary compression > Hey ! I just thought of an idea. Why not a tree. Like a quad tree but split > along all the different letters of the alphabet ? > > (This is probably been thought of before). > > But it's pretty fast to test. Each leaf has NUM_LETTERS child nodes. You. > Can jump thru your tree way fast. > > Locality of memory might be a problem here, not sure. > > > -----Original Message----- > > From: Wojciech Wylon [mailto:ww...@ga...] > > Sent: 27 February 2003 12:51 > > To: gda...@li... > > Subject: RE: [Algorithms] Dictionary compression > > > > > > I am worried about time. > > > > Wojciech Wylon > > > > > > > > > -----Original Message----- > > > From: gda...@li... > > [mailto:gdalgorithms- > > > lis...@li...] On Behalf Of Gareth Lewin > > > Sent: Thursday, February 27, 2003 1:20 PM > > > To: gda...@li... > > > Subject: RE: [Algorithms] Dictionary compression > > > > > > > 26 values > 4 bits =) > > > > > > Um, yes, 5 bits. > > > > > > See what I meant ! > > > > > > Still can compress almost 2 to 1 ( well 5 to 8 ). > > > > > > Not sure what the original intent of the problem is. Is he worried > > about > > > memory paging of the file ( A memory mapped file would work > > file IMO ) > > ? > > > Or > > > is he worried about the time it takes to find a word ? > > > > > > > > > ------------------------------------------------------- > > > This SF.NET email is sponsored by: > > > SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See! > > > http://www.vasoftware.com > > > _______________________________________________ > > > GDAlgorithms-list mailing list > > > GDA...@li... > > > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > > > Archives: > > > http://sourceforge.net/mailarchive/forum.php?forum_id=6188 > > > > > > > > ------------------------------------------------------- > > This SF.NET email is sponsored by: > > SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See! > > http://www.vasoftware.com > > _______________________________________________ > > GDAlgorithms-list mailing list > > GDA...@li... > > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > > Archives: > > http://sourceforge.net/mailarchive/forum.php?forum_id=6188 > > > > > ------------------------------------------------------- > This SF.NET email is sponsored by: > SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See! > http://www.vasoftware.com > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_id=6188 > > |
From: Willem K. <wk...@gm...> - 2003-02-27 14:34:51
|
What you and Gareth describe is exactly what Ken Cropper already posted A Trie http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Tree/Trie/ Sure would be the way I'd approach Willem > -----Original Message----- > From: gda...@li... > [mailto:gda...@li...] On > Behalf Of Justin Heyes-Jones > Sent: Thursday, February 27, 2003 15:21 PM > To: gda...@li... > Subject: Re: [Algorithms] Dictionary compression > > > What about a tree of valid letters. You start off with a root > node then the next level of the tree would be all the letters > of the alphabet. The next level would be all legal next > letters for each letter and so on. > > The algorithm for spell check would be to start at the first > level of the tree and find your first letter, if it is found > then go to next level and check for the second letter. Repeat > until you match your last letter or run out of tree. > > Has anyone got time to do runtime anaylsis on these? hehehe. > This should be > O(Nlog(N)) I would think. > > A tree for this dictionary is below. Hope the tab formatting > stays .... > > a > apple > big > bad > bag > cat > > root > a > p > p > l > e > b > a > d > g > c > a > t > > ----- Original Message ----- > From: "Gareth Lewin" <GL...@cl...> > To: <gda...@li...> > Sent: Thursday, February 27, 2003 12:56 PM > Subject: RE: [Algorithms] Dictionary compression > > > > Hey ! I just thought of an idea. Why not a tree. Like a > quad tree but > split > > along all the different letters of the alphabet ? > > > > (This is probably been thought of before). > > > > But it's pretty fast to test. Each leaf has NUM_LETTERS > child nodes. > > You. Can jump thru your tree way fast. > > > > Locality of memory might be a problem here, not sure. > > > > > -----Original Message----- > > > From: Wojciech Wylon [mailto:ww...@ga...] > > > Sent: 27 February 2003 12:51 > > > To: gda...@li... > > > Subject: RE: [Algorithms] Dictionary compression > > > > > > > > > I am worried about time. > > > > > > Wojciech Wylon > > > > > > > > > > > > > -----Original Message----- > > > > From: gda...@li... > > > [mailto:gdalgorithms- > > > > lis...@li...] On Behalf Of Gareth Lewin > > > > Sent: Thursday, February 27, 2003 1:20 PM > > > > To: gda...@li... > > > > Subject: RE: [Algorithms] Dictionary compression > > > > > > > > > 26 values > 4 bits =) > > > > > > > > Um, yes, 5 bits. > > > > > > > > See what I meant ! > > > > > > > > Still can compress almost 2 to 1 ( well 5 to 8 ). > > > > > > > > Not sure what the original intent of the problem is. Is > he worried > > > about > > > > memory paging of the file ( A memory mapped file would work > > > file IMO ) > > > ? > > > > Or > > > > is he worried about the time it takes to find a word ? > > > > > > > > > > > > ------------------------------------------------------- > > > > This SF.NET email is sponsored by: > > > > SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 > > > > See! http://www.vasoftware.com > > > > _______________________________________________ > > > > GDAlgorithms-list mailing list > > > > GDA...@li... > > > > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > > > > Archives: > > > > http://sourceforge.net/mailarchive/forum.php?forum_id=6188 > > > > > > > > > > > > ------------------------------------------------------- > > > This SF.NET email is sponsored by: > > > SourceForge Enterprise Edition + IBM + LinuxWorld = > Something 2 See! > > > http://www.vasoftware.com > > > _______________________________________________ > > > GDAlgorithms-list mailing list > > > GDA...@li... > > > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > > > Archives: > http://sourceforge.net/mailarchive/forum.php?> forum_id=6188 > > > > > > > > > > > ------------------------------------------------------- > > This SF.NET email is sponsored by: > > SourceForge Enterprise Edition + IBM + LinuxWorld = > Something 2 See! > > http://www.vasoftware.com > > _______________________________________________ > > GDAlgorithms-list mailing list > GDA...@li... > > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > > Archives: > > http://sourceforge.net/mailarchive/forum.php?forum_id=6188 > > > > > > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > GDAlgorithms-list mailing list GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: http://sourceforge.net/mailarchive/forum.php?forum_id=6188 > |
From: Gareth L. <GL...@cl...> - 2003-02-27 14:10:19
|
Hey, I already admitted I was wrong. > -----Original Message----- > From: Richard Fabian [mailto:alg...@th...] > Sent: 27 February 2003 13:01 > To: gda...@li... > Subject: RE: [Algorithms] Dictionary compression > > > > > On the compression side, do you care about case ? If not you > > > only need 26 > > > values (4 bit) to store each character, pretty fast 2 to 1 > > compression > > > there. > > lol > > > > ------------------------------------------------------- > This SF.NET email is sponsored by: > SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See! > http://www.vasoftware.com > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_id=6188 > |
From: Richard F. <alg...@th...> - 2003-02-28 10:05:28
|
my mail is SOOOOOOOOOOOOOOOOOOO slow at the mment, o replied as soon as that hit my mail box. :S I posted the following mail at just after three :( > -----Original Message----- > From: gda...@li... > [mailto:gda...@li...] On > Behalf Of Gareth Lewin > Sent: 27 February 2003 14:10 > To: gda...@li... > Subject: RE: [Algorithms] Dictionary compression > > > Hey, I already admitted I was wrong. > > > -----Original Message----- > > From: Richard Fabian [mailto:alg...@th...] > > Sent: 27 February 2003 13:01 > > To: gda...@li... > > Subject: RE: [Algorithms] Dictionary compression > > > > > > > > On the compression side, do you care about case ? If not you > > > > only need 26 > > > > values (4 bit) to store each character, pretty fast 2 to 1 > > > compression > > > > there. > > > > lol > > > > > > > > ------------------------------------------------------- > > This SF.NET email is sponsored by: > > SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See! > > http://www.vasoftware.com > > _______________________________________________ > > GDAlgorithms-list mailing list > > GDA...@li... > > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > > Archives: > > http://sourceforge.net/mailarchive/forum.php?forum_id=6188 > > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_id=6188 > |
From: Gareth L. <GL...@cl...> - 2003-02-27 14:13:59
|
> [Wojciech Wylon] > I assume that around 50.000-100.000 per second. Ok we can buy more and > faster hardware but it is not the best solution... WOW That's a lot of queries. I assume this is a server ? You sure you could handle the bandwidth of that ? Any chance you could tell us what you are doing ? :) You really want to keep it all in memory for that type of access I would think. Don't worry too much about cache hits, as (I assume) no matter what you do, the queries are going to come in from all over the board anyway. The subdivision idea I posted earlier seems the best idea to me still. > May I ask why not just use ispell or microsoft speller ? > [Wojciech Wylon] > a) It will work on Linux Ok > b) when used as external process ispell will be too slow. Ok. I can > rework (C->C++ + publish source) it and try to use it as a > lib but IMHO > it will be too slow! I don't know enough about ispell to tell you if it was fast enough, but I doubt it was tested for the amount of tests you're talking about. |
From: Wojciech W. <ww...@ga...> - 2003-02-27 14:35:30
|
> > [Wojciech Wylon] > > I assume that around 50.000-100.000 per second. Ok we can buy more and > > faster hardware but it is not the best solution... > > WOW > > That's a lot of queries. I assume this is a server ? You sure you could > handle the bandwidth of that ? Any chance you could tell us what you are > doing ? :) [Wojciech Wylon] Nothing special - some word games. We also wanted to use dictionary for chat automatic moderation (it is easy to remove vulgar words but people use to "cheat a bit" e.g. write: "vul!ga*v". Dictionary will help to detect such behaviour... And these 50-100.000 are upper estimates of queries. WW p.s. I just test Bloom filters - I need yet to test its false positive behavior but they are the best solution so far (and they take really low amount of memory).... |
From: Gareth L. <GL...@cl...> - 2003-02-27 14:29:51
|
Which is basically my subdivision idea I think. Obviously you could have a no-valid child-leaf for a specific character. And yes, I still think this is the way to go. (You forgot 'big' in your tree) > -----Original Message----- > From: Justin Heyes-Jones [mailto:jus...@ge...] > Sent: 27 February 2003 14:21 > To: gda...@li... > Subject: Re: [Algorithms] Dictionary compression > > > What about a tree of valid letters. You start off with a root > node then the > next level of the tree would be all the letters of the > alphabet. The next > level would be all legal next letters for each letter and so on. > > The algorithm for spell check would be to start at the first > level of the > tree and find your first letter, if it is found then go to > next level and > check for the second letter. Repeat until you match your last > letter or run > out of tree. > > Has anyone got time to do runtime anaylsis on these? hehehe. > This should be > O(Nlog(N)) I would think. > > A tree for this dictionary is below. Hope the tab formatting > stays .... > > a > apple > big > bad > bag > cat > > root > a > p > p > l > e > b > a > d > g > c > a > t > > ----- Original Message ----- > From: "Gareth Lewin" <GL...@cl...> > To: <gda...@li...> > Sent: Thursday, February 27, 2003 12:56 PM > Subject: RE: [Algorithms] Dictionary compression > > > > Hey ! I just thought of an idea. Why not a tree. Like a > quad tree but > split > > along all the different letters of the alphabet ? > > > > (This is probably been thought of before). > > > > But it's pretty fast to test. Each leaf has NUM_LETTERS > child nodes. You. > > Can jump thru your tree way fast. > > > > Locality of memory might be a problem here, not sure. > > > > > -----Original Message----- > > > From: Wojciech Wylon [mailto:ww...@ga...] > > > Sent: 27 February 2003 12:51 > > > To: gda...@li... > > > Subject: RE: [Algorithms] Dictionary compression > > > > > > > > > I am worried about time. > > > > > > Wojciech Wylon > > > > > > > > > > > > > -----Original Message----- > > > > From: gda...@li... > > > [mailto:gdalgorithms- > > > > lis...@li...] On Behalf Of Gareth Lewin > > > > Sent: Thursday, February 27, 2003 1:20 PM > > > > To: gda...@li... > > > > Subject: RE: [Algorithms] Dictionary compression > > > > > > > > > 26 values > 4 bits =) > > > > > > > > Um, yes, 5 bits. > > > > > > > > See what I meant ! > > > > > > > > Still can compress almost 2 to 1 ( well 5 to 8 ). > > > > > > > > Not sure what the original intent of the problem is. Is > he worried > > > about > > > > memory paging of the file ( A memory mapped file would work > > > file IMO ) > > > ? > > > > Or > > > > is he worried about the time it takes to find a word ? > > > > > > > > > > > > ------------------------------------------------------- > > > > This SF.NET email is sponsored by: > > > > SourceForge Enterprise Edition + IBM + LinuxWorld = > Something 2 See! > > > > http://www.vasoftware.com > > > > _______________________________________________ > > > > GDAlgorithms-list mailing list > > > > GDA...@li... > > > > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > > > > Archives: > > > > http://sourceforge.net/mailarchive/forum.php?forum_id=6188 > > > > > > > > > > > > ------------------------------------------------------- > > > This SF.NET email is sponsored by: > > > SourceForge Enterprise Edition + IBM + LinuxWorld = > Something 2 See! > > > http://www.vasoftware.com > > > _______________________________________________ > > > GDAlgorithms-list mailing list > > > GDA...@li... > > > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > > > Archives: > > > http://sourceforge.net/mailarchive/forum.php?forum_id=6188 > > > > > > > > > ------------------------------------------------------- > > This SF.NET email is sponsored by: > > SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See! > > http://www.vasoftware.com > > _______________________________________________ > > GDAlgorithms-list mailing list > > GDA...@li... > > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > > Archives: > > http://sourceforge.net/mailarchive/forum.php?forum_id=6188 > > > > > > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_id=6188 > |
From: Simon O'C. <si...@cr...> - 2003-02-27 15:33:03
|
Hmm, so one of the main purposes of such a large dictionary is to spot deliberate misspellings of "bad" words ?. Couldn't you just use something like soundex to store the dictionary and translate queries (probably have to modify it a bit to include things like "l337 5p34k") - result: a much smaller dictionary and able to catch things which may not be in the dictionary like "phcuk u") At its simplest the soundex code could also act as a nice hash and sort key Simon O'Connor Creative Asylum Limited www.creative-asylum.com > -----Original Message----- > From: Wojciech Wylon [mailto:ww...@ga...] > Sent: 27 February 2003 14:37 > To: gda...@li... > Subject: RE: [Algorithms] Dictionary compression > > > > > [Wojciech Wylon] > > > I assume that around 50.000-100.000 per second. Ok we can buy more > and > > > faster hardware but it is not the best solution... > > > > WOW > > > > That's a lot of queries. I assume this is a server ? You sure you > could > > handle the bandwidth of that ? Any chance you could tell us what you > are > > doing ? :) > [Wojciech Wylon] > Nothing special - some word games. We also wanted to use > dictionary for > chat automatic moderation (it is easy to remove vulgar words > but people > use to "cheat a bit" e.g. write: > > "vul!ga*v". > > Dictionary will help to detect such behaviour... > > And these 50-100.000 are upper estimates of queries. > > WW > > p.s. I just test Bloom filters - I need yet to test its false positive > behavior but they are the best solution so far (and they take > really low > amount of memory).... > > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_id=6188 > |
From: Wojciech W. <ww...@ga...> - 2003-02-27 16:00:49
|
a) it is not for English ;) b) We have two words "rachuj" - it is ok! And "chuj" it is very vulgar - both sound in similar way. Wojciech Wylon > -----Original Message----- > From: gda...@li... [mailto:gdalgorithms- > lis...@li...] On Behalf Of Simon O'Connor > Sent: Thursday, February 27, 2003 4:25 PM > To: 'gda...@li...' > Subject: RE: [Algorithms] Dictionary compression > > > Hmm, so one of the main purposes of such a large dictionary is to spot > deliberate misspellings of "bad" words ?. > > Couldn't you just use something like soundex to store the dictionary and > translate queries (probably have to modify it a bit to include things like > "l337 5p34k") - result: a much smaller dictionary and able to catch things > which may not be in the dictionary like "phcuk u") > > At its simplest the soundex code could also act as a nice hash and sort > key > > Simon O'Connor > Creative Asylum Limited > www.creative-asylum.com > > > -----Original Message----- > > From: Wojciech Wylon [mailto:ww...@ga...] > > Sent: 27 February 2003 14:37 > > To: gda...@li... > > Subject: RE: [Algorithms] Dictionary compression > > > > > > > > [Wojciech Wylon] > > > > I assume that around 50.000-100.000 per second. Ok we can buy more > > and > > > > faster hardware but it is not the best solution... > > > > > > WOW > > > > > > That's a lot of queries. I assume this is a server ? You sure you > > could > > > handle the bandwidth of that ? Any chance you could tell us what you > > are > > > doing ? :) > > [Wojciech Wylon] > > Nothing special - some word games. We also wanted to use > > dictionary for > > chat automatic moderation (it is easy to remove vulgar words > > but people > > use to "cheat a bit" e.g. write: > > > > "vul!ga*v". > > > > Dictionary will help to detect such behaviour... > > > > And these 50-100.000 are upper estimates of queries. > > > > WW > > > > p.s. I just test Bloom filters - I need yet to test its false positive > > behavior but they are the best solution so far (and they take > > really low > > amount of memory).... > > > > > > > > > > ------------------------------------------------------- > > This sf.net email is sponsored by:ThinkGeek > > Welcome to geek heaven. > > http://thinkgeek.com/sf > > _______________________________________________ > > GDAlgorithms-list mailing list > > GDA...@li... > > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > > Archives: > > http://sourceforge.net/mailarchive/forum.php?forum_id=6188 > > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_id=6188 |
From: Gareth L. <GL...@cl...> - 2003-02-27 15:53:33
|
> At its simplest the soundex code could also act as a nice > hash and sort key *bang head*. My BBS did that when I programmed in Pascal and was 14 years old. Of course ! So simple. I did it for filename searches. Each "app" had a soundex key. Was useful cause it was in Israel, and some Israelis obviously aren't very good at English spelling (English not being their native tongue ) |
From: Gareth L. <GL...@cl...> - 2003-02-27 16:05:44
|
Bless you. > -----Original Message----- > From: Wojciech Wylon [mailto:ww...@ga...] > Sent: 27 February 2003 16:03 > To: gda...@li... > Subject: RE: [Algorithms] Dictionary compression > > > a) it is not for English ;) > b) We have two words "rachuj" - it is ok! And "chuj" it is > very vulgar - > both sound in similar way. > > > Wojciech Wylon |
From: Simon O'C. <si...@cr...> - 2003-02-27 17:14:28
|
It should still be useful for optimising the query though: a table of fixed length soundex codes sorted into soundex order (acting as hash values) along with pointers into the main dictionary. Adapting the method for Polish vowels and consonants will reduce the false positives too (I think - unfortunately I don't speak Polish ;o). BTW: using the "plain" English form of soundex codes "chuj" and "rachuj" are different: C200 and R220. Then again "cox" is C200, which would no doubt please Tony and his kinfolk round the world if your game refused to accept their surname ;o) Simon O'Connor Creative Asylum Limited www.creative-asylum.com > -----Original Message----- > From: Wojciech Wylon [mailto:ww...@ga...] > Sent: 27 February 2003 16:03 > To: gda...@li... > Subject: RE: [Algorithms] Dictionary compression > > > a) it is not for English ;) > b) We have two words "rachuj" - it is ok! And "chuj" it is > very vulgar - > both sound in similar way. > > > Wojciech Wylon > > > > > -----Original Message----- > > From: gda...@li... > [mailto:gdalgorithms- > > lis...@li...] On Behalf Of Simon O'Connor > > Sent: Thursday, February 27, 2003 4:25 PM > > To: 'gda...@li...' > > Subject: RE: [Algorithms] Dictionary compression > > > > > > Hmm, so one of the main purposes of such a large dictionary > is to spot > > deliberate misspellings of "bad" words ?. > > > > Couldn't you just use something like soundex to store the dictionary > and > > translate queries (probably have to modify it a bit to > include things > like > > "l337 5p34k") - result: a much smaller dictionary and able to catch > things > > which may not be in the dictionary like "phcuk u") > > > > At its simplest the soundex code could also act as a nice hash and > sort > > key > > > > Simon O'Connor > > Creative Asylum Limited > > www.creative-asylum.com > > > > > -----Original Message----- > > > From: Wojciech Wylon [mailto:ww...@ga...] > > > Sent: 27 February 2003 14:37 > > > To: gda...@li... > > > Subject: RE: [Algorithms] Dictionary compression > > > > > > > > > > > [Wojciech Wylon] > > > > > I assume that around 50.000-100.000 per second. Ok we can buy > more > > > and > > > > > faster hardware but it is not the best solution... > > > > > > > > WOW > > > > > > > > That's a lot of queries. I assume this is a server ? > You sure you > > > could > > > > handle the bandwidth of that ? Any chance you could tell us what > you > > > are > > > > doing ? :) > > > [Wojciech Wylon] > > > Nothing special - some word games. We also wanted to use > > > dictionary for > > > chat automatic moderation (it is easy to remove vulgar words > > > but people > > > use to "cheat a bit" e.g. write: > > > > > > "vul!ga*v". > > > > > > Dictionary will help to detect such behaviour... > > > > > > And these 50-100.000 are upper estimates of queries. > > > > > > WW > > > > > > p.s. I just test Bloom filters - I need yet to test its false > positive > > > behavior but they are the best solution so far (and they take > > > really low > > > amount of memory).... > > > > > > > |