From: Schaeferhaus <sch...@ya...> - 2008-01-16 20:48:54
|
>when I have array-of-array, etc, and must=0A>think carefully, or fix the c= ode. =0A=0AHier an example of walking in an array-of-array as i would use:= =0Avoid *judyx=3DNULL,*judyy=3DNULL;=0AWord_t *px,*py;=0Aunsigned x,y;=0A..= .=0Afor(x=3D0,px =3D JudyLFirst(judyx, &x, NULL); px; px =3D JudyLNext(judy= x, &x, NULL)) {=0A for(y=3D0,py =3D JudyLFirst(*px, &y, NULL); py; py =3D = JudyLNext(*px, &y, NULL)) {=0A printf("x=3D%u, y=3D%u\n", x, y);=0A .= ..=0A }=0A}=0A=0AThis interface call is much simpler as the macros. You ca= n agree that=0Ausing this interface is more natural.=0A=0A>"Judy-star" (as = in Judy*()), along the lines of, JudyL()'s values are=0A>long-words, etc.= =0A=0AWhere can i find this. Actualy i'm using array-of-array of unsigned t= o emulate 64-bits ints.=0A=0A>> - The JudyNext and JudyPrev can be improved= in speed, by using a=0A>> current cursor-structure as parameter.=0A>What d= oes that mean? =0A=0AI don't know how Judy works internally, but i think y= ou do a search first (like JudyGet) of parameter passed to JudyNext/JudyPre= v and then return the next/prev element. With the cursor interface you=0Aca= n store the current pointers (tree path) in the structure and use this info= at the next call (without a new search at each call).=0A=0A>But would you = have it return a valuep, and the caller still copies their data into=0A the= =0A>value area after the Judy insertion call returns, or would you have the= =0A>caller pass a pointer to the record-to-save itself, and have Judy do=0A= the=0A>copying? Etc.=0A=0AI think the best way is that the caller copies= their data into the returned valuep area.=0AAs actually done with the ret= urned Word_t pointer.=0A=0A----- Urspr=FCngliche Mail ----=0AVon: Alan Silv= erstein <aj...@fr...>=0AAn: jud...@li...; schaeferhaus= @yahoo.de=0ACC: dou...@ya...; yj...@cl...=0AGesendet: M= ittwoch, den 16. Januar 2008, 19:40:11 Uhr=0ABetreff: Re: AW: New Judy vers= ion=0A=0A> I don't know what is the eternity problem with judy-interface...= =0A=0AThanks, it's nice to hear that for a change. :-)=0A=0A> I think it i= s not a good idea to use the macro interface as=0A> recommended in the docs= (it's not faster)...=0A=0AActually buried in the macros there is a LITTLE = performance boost for=0Asmall arrays; see Judy.h. I'd forgotten it was eve= n there. But in=0Ageneral you are right.=0A=0AI'm now making major use of = libJudy in a contract job, with their=0Ablessings. I chose to use the macr= os to keep it short, simple, and=0Apalatable as possible. But, as I've wri= tten before, even I get lost in=0Athe abstractions at times, when I have ar= ray-of-array, etc, and must=0Athink carefully, or fix the code. Plus I end= up having to do strange=0Athings like this to make it happy:=0A=0A JSLI= (valuep, arrayp, index1);=0A PPvoid_t subpp =3D valuep;=0A JSLI(value= p, *subpp, index2);=0A ...=0A=0AThe macro actually does (&*subpp). But = you gotta get the typecast=0Aright, also for indexes in some cases. Using = functions would be more=0Averbose, but more explicit, and also the for/next= loops to walk arrays=0Awould be tidier.=0A=0A> The only drawbacks i found = in judy:=0A> - you cannot store more than sizeof(*pval) as value in the = array.=0A> You must use the pval indirectly (storing the pointer).=0A=0AYes= , although this falls directly out of libJudy being honed for=0Aperformance= for what it does best, as a low-level engine. Somewhere in=0Athe docs is = discussion of a potential API we talked about a lot, called=0A"Judy-star" (= as in Judy*()), along the lines of, JudyL()'s values are=0Along-words, etc.= =0A=0AOn one hand, you can argue that there's no net performance gain by=0A= doing=0Athis. For example, to free() an array, you still must walk all th= e=0Amembers first to free() their (non-Judy) substructures. My new code=0A= using libJudy has this, for example, JudySL -> JudySL -> struct, and I=0Amu= st free() each struct myself.=0A=0AOn the other hand, if libJudy itself had= a built-in concept of=0A "variable=0Asize value", it MIGHT be faster, as w= ell as simpler for the coder.=0A Then=0Athe libJudy insert, delete, and fr= ee routines would manage the memory=0Afor you, and locality of operations w= ould be maximized. But would you=0Ahave it return a valuep, and the caller= still copies their data into=0A the=0Avalue area after the Judy insertion = call returns, or would you have the=0Acaller pass a pointer to the record-t= o-save itself, and have Judy do=0A the=0Acopying? Etc.=0A=0AAnyway, the pr= oject (at HP) was canceled before we got into this kind=0A of=0Aenhancement= .=0A=0A> - The JudyNext and JudyPrev can be improved in speed, by using a= =0A> current cursor-structure as parameter.=0A=0AWhat does that mean? They= are already very fast because the next=0Aiteration starts with the previou= s index, and the libJudy nodes are=0Aalmost always in the CPU cache, ready = to go.=0A=0AThanks,=0AAlan Silverstein=0A=0A=0A=0A=0A=0A Jetzt Mails s= chnell in einem Vorschaufenster =FCberfliegen. Dies und viel mehr bietet da= s neue Yahoo! Mail - www.yahoo.de/mail |
From: Schaeferhaus <sch...@ya...> - 2008-01-16 22:03:12
|
Bonsoir,=0A=0Ai don't want to make a personal discussion with you. This is = wast time for the readers.=0AOk, using judy is very easy, you declare the j= udy array as:=0A=0Avoid *judy =3D NULL;=0Athen you always pass the address = of the array when calling the changing functions like=0AJudyIns, JudyDel,= JudyFreeArray and pass the judy-array directly when calling the reading fu= nctions:=0AJudyGet, JudyFirst, JudyNext, JudyPrev. See the examples in my p= revious posts.=0ABe aware that the JudyIns, JudyDel and JudyFreeArray can c= hange the content of the judy-array pointer.=0AYou can always pass NULL as = the last parameter. (Instead of JError), you have only to check the return = value and ignore the JError structure.=0A=0A----- Urspr=FCngliche Mail ----= =0AVon: "yj...@cl..." <yj...@cl...>=0AAn: schaeferhau= s...@ya...; jud...@li...=0ACC: dou...@ya...=0A= Gesendet: Mittwoch, den 16. Januar 2008, 22:24:26 Uhr=0ABetreff: Re: AW: Ne= w Judy version=0A=0AHi all,=0A=0ASoory for the delay. =0AI'll try to be mor= e precise when I'm talking about interfaces.=0A=0A>I don't know what is the= eternity problem with judy-interface.=0A=0AEternity you said? No problem a= t all. Just a matter of good design as=0A we're =0Anow in 2008 ;-)=0A=0A> I= 'm using=0A>Judy-arrays since years without problem (100.000.000 urls). =0A= =0AExcellent. Nice to hear this kind of usage. =0A=0ALet me share with you = something. I'm working on a web archiving field. =0A100M URLs in considered= as a sample for us (too small for what we're =0Adelivering). We deployed a= distributed Web server on the top of 129 =0Amachines for fast archives acc= ess during the last 3 years. All indexes =0Aloaded on RAM.=0A=0AAnd the ser= ver use what ? =0AJudy of course ;-)=0A=0AWe're able to access "billions" o= f documents in a fraction of time (<=0A 1sec).=0A=0AJudy performance isn't = the problem at all. We all love it.=0AThis is what I'm trying to say. Don't= misunderstand me please.=0A=0A> I think it is not a good idea to use the = macro interface as=0A recommended =0Ain the docs (it's not faster).=0A=0ACo= mpletly agree with you on this point.=0A=0A> I use the Judy-functions direc= tly=0A>without problems. Pass "&judy-array" to insert and delete functions= =0A and =0Ajudy-array to Get-functions.=0A>Ex.=0A>void *judy =3D NULL;=0A>= Word_t *pval;=0A>char s[257];=0A>pval =3D JudySLIns(&judy, s, NULL); asser= t(pval);=0A>*pval =3D *pval + 1;=0A>....=0A>pval =3D JudySLGet(judy, s, NUL= L);=0A>if(pval) { printf("%d", *pval ); }=0A>....=0A>pval =3D JudySLDel(&= judy, s, NULL);=0A>JudySLFreeArray(&judy, NULL);=0A>=0A>The interface is ve= ry easy to use and no changes are necessary. You =0Asimply declare =0A>void= *judy-array =3D NULL and use the functions directly.basta!=0A=0AI can't im= agine that to be simple. This is difficult and error prone=0A (pointer, =0A= reference here and there, NULL). =0AThink about new comers to Judy. =0AThin= k about you the first time you used Judy.=0A=0A=0A>The only drawbacks i fou= nd in judy:=0A>- you cannot store more than sizeof(*pval) as value in th= e array.=0A>You must use the pval indirectly (storing the pointer).=0A>- Th= e JudyNext and JudyPrev can be improved in speed, by using a =0Acurrent cur= sor-structure as parameter.=0A=0A=0AI'd like to add another item to the lis= t. INTERFACES.=0ALet start thinking about a clean sets of interfaces and af= ter that,=0A start =0Acoding.=0A=0AN.B: apologize for my english, It's my t= hird language.=0A=0Acheers=0AYoun=E8s=0A=0A=0A=0A=0A=0A=0A Jetzt Mails= schnell in einem Vorschaufenster =FCberfliegen. Dies und viel mehr bietet = das neue Yahoo! Mail - www.yahoo.de/mail |
From: Alan S. <aj...@fr...> - 2008-01-16 21:29:36
|
> void *judyx=NULL,*judyy=NULL; > Word_t *px,*py; > unsigned x,y; > ... > for(x=0,px = JudyLFirst(judyx, &x, NULL); px; px = JudyLNext(judyx, &x, NULL)) { > for(y=0,py = JudyLFirst(*px, &y, NULL); py; py = JudyLNext(*px, &y, NULL)) { > printf("x=%u, y=%u\n", x, y); > ... > } > } > > This interface call is much simpler as the macros. You can agree that > using this interface is more natural. Right, that was my point. The macros don't lend themselves well to for() loops, you end up writing while() loops... >> "Judy-star" (as in Judy*()), along the lines of, JudyL()'s values are >> long-words, etc. > Where can i find this. The code does not exist, sorry. There MIGHT (or might not) be some discussion of it in the SourceForge documents. It never got past the "what if" discussion stage. I know, you could write your own wrapper functions to simulate this, but it wouldn't be as nice as if libJudy could do it for you. The wrappers would have to cover any form of call you needed -- insert, retrieve, free... Note well the difference between longer or variable-length KEYS (indexes), the previously discussed topic -- also discussed length-associated versus length-terminated strings or other indexes -- and longer or variable-length VALUES. Ultimately it would be nice to have an API supporting both: "Map this M-bit key to an N-bit value." > Actualy i'm using array-of-array of unsigned to emulate 64-bits ints. Sure, that sounds reasonable... If you mean the keys/indexes are the 64-bit objects... If you are mapping from 64-bit to 64-bit, can you just use the 64-bit versions of the Judy libraries? > I don't know how Judy works internally, but i think you do a search > first (like JudyGet) of parameter passed to JudyNext/JudyPrev and then > return the next/prev element. With the cursor interface you can store > the current pointers (tree path) in the structure and use this info at > the next call (without a new search at each call). Ah, OK. Well, of course libJudy internal data is volatile, as the manual entries warn you. But in fact they only change upon insert, delete, or free. You're right that each Next call starts with locating the index in the tree... But it's important to note that for a cache-smart algorithm like libJudy, the overhead time to do this after a previous lookup is virtually zero. That's one reason we didn't worry about giving you back some kind of quickref/"cursor" for the next call. > I think the best way is that the caller copies their data into the > returned valuep area. As actually done with the returned Word_t > pointer. OK. Thanks, Alan |
From: Aleksey C. <vl...@gm...> - 2009-01-31 18:12:39
|
1 year after >> I don't know how Judy works internally, but i think you do a search >> first (like JudyGet) of parameter passed to JudyNext/JudyPrev and then >> return the next/prev element. With the cursor interface you can store >> the current pointers (tree path) in the structure and use this info at >> the next call (without a new search at each call). ... > But it's important to note that for a > cache-smart algorithm like libJudy, the overhead time to do this after a > previous lookup is virtually zero. That's one reason we didn't worry > about giving you back some kind of quickref/"cursor" for the next call. It's true that Judy uses cache-smart algorithm. This is why Judy's iteration is rather efficient. But CPU usage may also be significant constituent. Two years ago, I wrote C++ stl-like templates that implement hash tables (maps and sets) using Judy as a backend instead of linear arrays. I have benchmarks comparing time of insert/delet/lookup operations and... iteration too. Benchmarks are here http://judyhash.sourceforge.net/benchmarks/bench_size65599.html http://judyhash.sourceforge.net/benchmarks/ See the graphic about an iteration. std::map and std::set (AVL tree) and google's SPARCE abnd DENSE hashes show a better performance. I think "cursor" idea for First/Last/Next/Prev functions will make Judy even more efficient especially for the case when there are lots of sequencial keys (n, n+1 etc.) keys in the array. I personally don't see why it cannot be done or why it is conseptually wrong. The fact that Judy is cache-efficient doesn't make an idea wrong. -- Best regards, Aleksey Cheusov. |
From: Doug B. <dou...@ya...> - 2009-02-01 08:06:17
|
Aleksey: First Three things: 1) You don't need to wait a year to ping me next time. I rarely loose email, but apparently lost this one. I really like to respond to feedback when I can. 2) I am very embarrassed that I did not know about "judyhash". My first impression is I am very pleased with you work. I think I have been asked recently the Judy STL question and answered "I dont know". I now have a good answer. 3) Some early feedback: Your performance graphs seem to only go to 1.6 Million and are not LOG(10) on the X axis. My current testing of JudyL goes to 1.2 Billion and Judy1 is way above that. Your measurements would be in the first pixel or two on a Linear X-Y plot. > I personally don't see why it cannot be done or why it is conseptually wrong. The fact that Judy is cache-efficient doesn't make an idea wrong. The idea (smarter/faster JudyNext()) has been pondered way back in 2001-2002. I did some breadboards then with not very acceptable results. The instruction times were a much bigger portion of the overall performance then. Today, I see 10+ instructions being executed per nanosecond (4.2GHz Core 2 duo - E8600), while memory speeds are still near a 100 nanosecond. This trend is likely to continue -- suggesting even less improvement of smarter/faster JudyNext(). Alan Silverstein (co-developer of Judy) even wrote Judy array dump/restore routines that were included in the Judy package, but never documented. I felt they were not fast enough compared to using a "JudyNext in a loop" to justify releasing it. Plus the user API was very "iffy". Actually, I keep thinking about the idea, but it always fell to low on my priority list -- especially now. I am working on an improved version of Judy up to my eye-balls. Finally, the breadboards are showing results. I believe the new Judy will be 2 X faster when it gets released. Preliminary results of Judy1 is showing Judy1Test() times in the 20nS range when in-cache and 130nS when out-of-cache (population in the Billions). This is contrasted to the speed of a pure bitmap being accessed with the same data set of 5nS in-cache and 70nS out-of-cache (One (1) cache-line fill time with a 512MB Bitmap). I am excited about this because it shows random "Get" times of large populations of less that 2 cache-line fills. (Note: "in-cache" means the 6MByte cache on this machine, there are smaller/faster caches). So "wrong" is not the problem, low priorty is. I will certainally will take another look at this when the new Judy is more mature. I would enjoy any feedback you could give on the Judy API. My C++ knowledge could fit in one hand. An example: a JudyNext could be written to return a small array (<100) of Indexes/Values? How to invalidate it when someone does an Insert/Delete/Free? What is a good API for C++? Thanks for your interest, Doug PS. I got curious on what is the current speed of JudyNext. It measures about 25-40nS on a random data set. I am using a 4.2GHz - 16GB core 2 duo - Intel E8600. Price in the ~$1000 range. That clocks out to about 120MB-240MB of Indexes per second. That would keep up with current storage devices. However, I will keep tabs on how JudyNext is doing during future development. ----- Original Message ---- From: Aleksey Cheusov <vl...@gm...> To: Jud...@li... Sent: Sunday, February 1, 2009 1:10:16 AM Subject: Re: AW: AW: New Judy version 1 year after >> I don't know how Judy works internally, but i think you do a search >> first (like JudyGet) of parameter passed to JudyNext/JudyPrev and then >> return the next/prev element. With the cursor interface you can store >> the current pointers (tree path) in the structure and use this info at >> the next call (without a new search at each call). ... > But it's important to note that for a > cache-smart algorithm like libJudy, the overhead time to do this after a > previous lookup is virtually zero. That's one reason we didn't worry > about giving you back some kind of quickref/"cursor" for the next call. It's true that Judy uses cache-smart algorithm. This is why Judy's iteration is rather efficient. But CPU usage may also be significant constituent. Two years ago, I wrote C++ stl-like templates that implement hash tables (maps and sets) using Judy as a backend instead of linear arrays. I have benchmarks comparing time of insert/delet/lookup operations and... iteration too. Benchmarks are here http://judyhash.sourceforge.net/benchmarks/bench_size65599.html http://judyhash.sourceforge.net/benchmarks/ See the graphic about an iteration. std::map and std::set (AVL tree) and google's SPARCE abnd DENSE hashes show a better performance. I think "cursor" idea for First/Last/Next/Prev functions will make Judy even more efficient especially for the case when there are lots of sequencial keys (n, n+1 etc.) keys in the array. I personally don't see why it cannot be done or why it is conseptually wrong. The fact that Judy is cache-efficient doesn't make an idea wrong. -- Best regards, Aleksey Cheusov. ------------------------------------------------------------------------------ This SF.net email is sponsored by: SourcForge Community SourceForge wants to tell your story. http://p.sf.net/sfu/sf-spreadtheword _______________________________________________ Judy-devel mailing list Jud...@li... https://lists.sourceforge.net/lists/listinfo/judy-devel |
From: john s. <sk...@us...> - 2009-02-01 12:33:40
|
On 01/02/2009, at 6:06 PM, Doug Baskins wrote: > Aleksey: > >> I personally don't see why it cannot be done or why it is >> conseptually > wrong. The fact that Judy is cache-efficient doesn't make an idea > wrong. > > The idea (smarter/faster JudyNext()) has been pondered way back in > 2001-2002. I did some breadboards then with not very acceptable > results. I think this is to be expected. Roughly speaking: lets say we have a "cursor" 10 levels deep, implying 10 cache "blocks". With the existing Judy, searching again from scratch will be extremely fast *provided* these 10 blocks are still in the cache. Buffering information in "user memory" will actually slow Judy down because of the extra code and memory (and hence cache) required to do so. IMHO this is the design principle of Judy: let the *machine cache* remember useful state transparently to the user software. > > I would enjoy any feedback you could give on the Judy API. My C++ > knowledge could fit in one hand. An example: a JudyNext could be > written to return a small array (<100) of Indexes/Values? How to > invalidate it when someone does an Insert/Delete/Free? IN C++ "Invalidate" is a concept, NOT a physical thing, this is not different from C. In C if you free the memory a pointer points at the pointer is invalidated by the operation: this does not change the pointer, it makes using it unsafe. Judy is superior, because you CANNOT invalidate a key: it is always valid to try to fetch the associated (or next) item, an error is returned if there is none. Using an *invalid* key such as a pointer to a free'd block never returns an error (it usually either segfaults of returns wrong data). Because of "invalidation" thing, cursors DO NOT work with multi-threaded access (one thread can invalidate another thread's cursor and there is no way to tell). Whereas the current API simply requires serialisation of operations: the hardware automatically synchronises caches, even across multiple CPUs. Let me put this another way: the current API does not provide the user with any *state* other than the Judy array itself. The operations are stateless (eg "find next thing after a given one -- the given one doesn't have to exist"). Cursors are state, which is very hard to synchronise between threads. This means it is enough to mutex the store containing the Judy array pointer between threads to ensure any change to the "top pointer" is shared. > What is > a good API for C++? Much the same as the current C one. It correctly hides most of the private state already, using functional abstraction. The biggest confusion in the current interface IMHO is that "void*" and "void**" are variously used, which is confusing. In theory the base data types are K -- the key D -- the data A -- the array and some operations, such as "insert" take a K and return a D* (the place to put the data) whilst fetching returns a plain D. A template would fix this. However in practice, Judy keys are always machine words and data is either a single bit or a machine word, so for example D and K would usually either be a long (assuming that is the length of a void*), or a pointer to something, and these constraints can't be expressed for templates easily (at best, specialisations, but it isn't clear it is worthwhile). -- john skaller sk...@us... |
From: john s. <sk...@us...> - 2009-02-01 13:08:32
|
On 01/02/2009, at 6:06 PM, Doug Baskins wrote: > What is > a good API for C++? BTW: the one thing I'd ask for in a new Judy is a version which accepts length controlled strings (instead of or as well as NULL terminated strings). This should be quite easy, you pass the keys as two values: a pointer and a length count. When "recursing" along the string, the internal API just subtracts one from the length count. When the length is zero, you're at the end of the string. The *logic* would be identical to the current JudyS, only the data type of a key would change sligthtly. This API allows arbitrary binary keys (not just strings). The only trick is that the user has to ensure the keys are clean, eg padding bytes are always zero, and the keys are canonical. It would be interesting to compare "Judy LCS with length=4" for 32 bit data with JudyL on a 64 bit extension of that data: 4 byte keys instead of 8, but the overhead of one extra parameter (the length) to be passed around a loop instead of using loop unrolled 8 times (using the program counter to keep track of the tree depth). -- john skaller sk...@us... |
From: Doug B. <dou...@ya...> - 2009-02-01 17:16:31
|
John & Aleksey: Ok, I looked at the code and came up with this idea. I need some suggestions on what the API will look like. I have an initial suggestion: Word_t Index[20]; // Array to hold 1 passed and 19 returned Indexes Index[0] = 19; // Max number of returned Indexes - any numb 1..inf, // but Judy will limit it to about 0..64 PValue = JudyLNextA(PJLArray, Index, &JError); A maximum of 19 returned Indexes. Index[0] is initially 19 and must be restored on every call since it will be changed by JudyLNextA to the actual number returned. Index[1] is the requested calling parameter and must be updated with: Index[1] = Index[Index[0]]; Index[0] = 19; With every call. PValue will simply be a pointer to the array of Values associated with Index[1]..Index[n]. I suspect this would only take about a day or two to implement and would be a lot faster than the current Judy*Next(). It would also be an extremely fast way to empty (iterate?) a Judy Array. The number 19 could be any number, but Judy will decide what the actual number returned -- usually dictated by the number of Values stored consecutively. I think that it is usually a maximum of 64. The additional time of the routine will be slightly greater than the original JudyLNext() basically the time it takes to fill "Index[]". JudyLPrev(), if done will be a little weirder, because the Values are stored in ascending order, the Indexes must be returned the same way. I guess the next previous has to be returned in the: Index[Index[0]]; and only Index[0] needs updating to the "19" in every call. Not so weird after all. As far as being valid data, the same rules apply as before -- the next modification of the array (Insert, Delete, Free) makes the data very suspect. This method also be used to retrieve/store a number of Values without multiple calls to JudyLGet. I do not see any reason why all 4 api (first/last/prev/next) should not be done. As always it seems, I don't know why I did not think of this before. Doug Baskins Doug Baskins <dou...@ya...> ----- Original Message ---- From: john skaller <sk...@us...> To: Doug Baskins <dou...@ya...> Cc: Jud...@li...; Aleksey Cheusov <vl...@gm...> Sent: Sunday, February 1, 2009 7:33:19 PM Subject: Re: AW: AW: New Judy version On 01/02/2009, at 6:06 PM, Doug Baskins wrote: > Aleksey: > >> I personally don't see why it cannot be done or why it is >> conseptually > wrong. The fact that Judy is cache-efficient doesn't make an idea > wrong. > > The idea (smarter/faster JudyNext()) has been pondered way back in > 2001-2002. I did some breadboards then with not very acceptable > results. I think this is to be expected. Roughly speaking: lets say we have a "cursor" 10 levels deep, implying 10 cache "blocks". With the existing Judy, searching again from scratch will be extremely fast *provided* these 10 blocks are still in the cache. Buffering information in "user memory" will actually slow Judy down because of the extra code and memory (and hence cache) required to do so. IMHO this is the design principle of Judy: let the *machine cache* remember useful state transparently to the user software. > > I would enjoy any feedback you could give on the Judy API. My C++ > knowledge could fit in one hand. An example: a JudyNext could be > written to return a small array (<100) of Indexes/Values? How to > invalidate it when someone does an Insert/Delete/Free? IN C++ "Invalidate" is a concept, NOT a physical thing, this is not different from C. In C if you free the memory a pointer points at the pointer is invalidated by the operation: this does not change the pointer, it makes using it unsafe. Judy is superior, because you CANNOT invalidate a key: it is always valid to try to fetch the associated (or next) item, an error is returned if there is none. Using an *invalid* key such as a pointer to a free'd block never returns an error (it usually either segfaults of returns wrong data). Because of "invalidation" thing, cursors DO NOT work with multi-threaded access (one thread can invalidate another thread's cursor and there is no way to tell). Whereas the current API simply requires serialisation of operations: the hardware automatically synchronises caches, even across multiple CPUs. Let me put this another way: the current API does not provide the user with any *state* other than the Judy array itself. The operations are stateless (eg "find next thing after a given one -- the given one doesn't have to exist"). Cursors are state, which is very hard to synchronise between threads. This means it is enough to mutex the store containing the Judy array pointer between threads to ensure any change to the "top pointer" is shared. > What is > a good API for C++? Much the same as the current C one. It correctly hides most of the private state already, using functional abstraction. The biggest confusion in the current interface IMHO is that "void*" and "void**" are variously used, which is confusing. In theory the base data types are K -- the key D -- the data A -- the array and some operations, such as "insert" take a K and return a D* (the place to put the data) whilst fetching returns a plain D. A template would fix this. However in practice, Judy keys are always machine words and data is either a single bit or a machine word, so for example D and K would usually either be a long (assuming that is the length of a void*), or a pointer to something, and these constraints can't be expressed for templates easily (at best, specialisations, but it isn't clear it is worthwhile). -- john skaller sk...@us... ------------------------------------------------------------------------------ This SF.net email is sponsored by: SourcForge Community SourceForge wants to tell your story. http://p.sf.net/sfu/sf-spreadtheword _______________________________________________ Judy-devel mailing list Jud...@li... https://lists.sourceforge.net/lists/listinfo/judy-devel |
From: Doug B. <dou...@ya...> - 2009-02-02 03:20:27
|
John: I assume that you want it sorted by some means since JudyHS() already does it. I also assume that you want a lexicographical sort. If you pass anything other than a uint8_t string, the results will be different in big and little Endian machines. Things like JudySLNext wont be of much use with non-uint8_t strings. Binary (non-uint8_t) sorting becomes tricky and machine dependent in any case. I don't want to spend the rest of my life explaining that to people. There are other problems too, but I can't remember them right now. I have already decided to use 4 bytes decodes in 64 bit versions of JudySL on the new version. JudySL is surprisingly fast because of the low entropy of sorted text strings and get worse the longer number of decode bytes. Sorted numbers also have very low entropy. I might use that more effectively with the new Judy -- if I have time? Well, I have the rest of my life. My ultimate goal is to finally put hash methods to rest. Doug Doug Baskins <dou...@ya...> ----- Original Message ---- From: john skaller <sk...@us...> To: Doug Baskins <dou...@ya...> Cc: Jud...@li...; Aleksey Cheusov <vl...@gm...> Sent: Sunday, February 1, 2009 8:08:11 PM Subject: Re: AW: AW: New Judy version On 01/02/2009, at 6:06 PM, Doug Baskins wrote: > What is > a good API for C++? BTW: the one thing I'd ask for in a new Judy is a version which accepts length controlled strings (instead of or as well as NULL terminated strings). This should be quite easy, you pass the keys as two values: a pointer and a length count. When "recursing" along the string, the internal API just subtracts one from the length count. When the length is zero, you're at the end of the string. The *logic* would be identical to the current JudyS, only the data type of a key would change sligthtly. This API allows arbitrary binary keys (not just strings). The only trick is that the user has to ensure the keys are clean, eg padding bytes are always zero, and the keys are canonical. It would be interesting to compare "Judy LCS with length=4" for 32 bit data with JudyL on a 64 bit extension of that data: 4 byte keys instead of 8, but the overhead of one extra parameter (the length) to be passed around a loop instead of using loop unrolled 8 times (using the program counter to keep track of the tree depth). -- john skaller sk...@us... ------------------------------------------------------------------------------ This SF.net email is sponsored by: SourcForge Community SourceForge wants to tell your story. http://p.sf.net/sfu/sf-spreadtheword _______________________________________________ Judy-devel mailing list Jud...@li... https://lists.sourceforge.net/lists/listinfo/judy-devel |
From: john s. <sk...@us...> - 2009-02-02 12:11:21
|
On 02/02/2009, at 1:20 PM, Doug Baskins wrote: > John: > > I assume that you want it sorted by some means since JudyHS() > already does it. Not sure I remember what is what: I'm looking for maps: uint8_t * -> WORD uint8_t * -> bool where X* is the Kleene closure (i.e. a string) of X. Judy already provides this for NULL terminated strings, I want an 8 bit clean version that uses a count instead. -- john skaller sk...@us... |
From: Alan S. <aj...@fr...> - 2009-02-02 19:11:31
|
> Not sure I remember what is what: I'm looking for maps... I wrote a paper back in 2000-2001 discussing issues and ideas for how to do variable-length keys in libJudy. It might be worth a look if you can find it. Of course it focused on how to do them natively, whereas JudySL is an array of array of JudyL. It's not hard now to create variable-length libJudy keys using JudyL arrays if you don't mind the first JudyL array being a sort-by-length, that is, everything in each subtree is a binary string of the same length (in bytes or bits, your choice). If you also want lexicographical or at least MSB to LSB sorting, that's a much harder problem. The paper discussed this at length. Cheers, Alan Silverstein |
From: zooko <zo...@zo...> - 2009-02-15 15:14:32
|
Dear Doug Baskins: Are you planning to open-source your new version of Judy? If so, a good practice is to go ahead and publish what you've got, even if it doesn't compile, is ugly, doesn't really show what you mean, and so on. This practice of "release early, release often" has a lot of good effects, one of which is to assure people that the ultimate result, when it arrives, will be open-source. This sort of thing could be important for some long-term plans. For example, I have a scheme in mind for a new data structure, inspired by the success of JudyTrees but probably quite different in its design. If I'm sure that your next project will be open source, then I think of you as a partner -- two different teams exploring adjacent areas of the design space. If I think that you (or someone) might assert exclusive rights over your next project, then I think of that project as a potentially dangerous competitor to mine, and I want to withhold information from you. My own project is in such a nascent stage that it is hardly worth mentioning and I wouldn't want to waste your time with it until I've done some actual experiments so I can report real numbers. But I thought I would mention it so as to encourage you to "work in public" by publishing your incomplete work, in order to elicit such cooperation from people other than me who may already have more to offer. Regards, Zooko Wilcox-O'Hearn |
From: Doug B. <dou...@ya...> - 2009-02-15 18:37:19
|
Zooko: Yes, the new version will be open sourced. I am retired and living my dream of doing research on high speed data structures that support the things that Judy does. I want my work to be public and free forever. My goal is to make it so fast and efficient that even hash methods cannot measure up to the speed of Judy. I do not think it will be very long before I will have new prototype versions of Judy available for inspection and testing. My current design is somewhat different then the published Judy. It takes far more advantage of the CPU Cache. I am also trying to write it so that it is easier to read and understand than the current Judy. It will be interesting to see if people measure it as faster than hashing methods. My current measurements are very encouraging. I hope I get a lot of feedback and help with the API. Some things, such as Next/Prev will be many times faster than the current Judy. A comment on your project. One thing I have learned over and over again is: there are always better ways to do things even when you think you got the best way possible designed. I remember thinking about how to take better advantage of the cache I was not using on the current Judy back in 2001. I now feel very naive, because it was a fairly simple/obvious thing to do. Most of the things in the new Judy design will appear as "that's a fairly obvious way of doing it". But it took me years to think of it -- stupid me, I was asking the wrong questions. Thanks for you interest, Doug Doug Baskins <dou...@ya...> ----- Original Message ---- From: zooko <zo...@zo...> To: Doug Baskins <dou...@ya...> Cc: Jud...@li... Sent: Sunday, February 15, 2009 10:12:03 PM Subject: Re: AW: AW: New Judy version Dear Doug Baskins: Are you planning to open-source your new version of Judy? If so, a good practice is to go ahead and publish what you've got, even if it doesn't compile, is ugly, doesn't really show what you mean, and so on. This practice of "release early, release often" has a lot of good effects, one of which is to assure people that the ultimate result, when it arrives, will be open-source. This sort of thing could be important for some long-term plans. For example, I have a scheme in mind for a new data structure, inspired by the success of JudyTrees but probably quite different in its design. If I'm sure that your next project will be open source, then I think of you as a partner -- two different teams exploring adjacent areas of the design space. If I think that you (or someone) might assert exclusive rights over your next project, then I think of that project as a potentially dangerous competitor to mine, and I want to withhold information from you. My own project is in such a nascent stage that it is hardly worth mentioning and I wouldn't want to waste your time with it until I've done some actual experiments so I can report real numbers. But I thought I would mention it so as to encourage you to "work in public" by publishing your incomplete work, in order to elicit such cooperation from people other than me who may already have more to offer. Regards, Zooko Wilcox-O'Hearn ------------------------------------------------------------------------------ Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA -OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise -Strategies to boost innovation and cut costs with open source participation -Receive a $600 discount off the registration fee with the source code: SFAD http://p.sf.net/sfu/XcvMzF8H _______________________________________________ Judy-devel mailing list Jud...@li... https://lists.sourceforge.net/lists/listinfo/judy-devel |