From: skaller <sk...@us...> - 2007-05-25 16:42:25
|
Hi, well after many posts to Alan S., I finally found my problem .. RTFM. JudyFirst is actually find key greater or equal, which requires an initial value ;( Why do we have the almost useless JudySL functions (no one uses null terminated string anymore), and then JudyHS, which uses a length but can't iterate (in any order) over the available keys? I'd sure like JudySL to use a length instead of a null terminator, and JudyHS to be able to iterate over all the keys. -- John Skaller <skaller at users dot sf dot net> Felix, successor to C++: http://felix.sf.net |
From: <zo...@zo...> - 2007-05-25 16:46:35
|
John Skaller wrote: > > Why do we have the almost useless JudySL functions (no one uses > null terminated string anymore), and then JudyHS, which uses a > length but can't iterate (in any order) over the available keys? > > I'd sure like JudySL to use a length instead of a null terminator, > and JudyHS to be able to iterate over all the keys. Me, too! Regards, Zooko |
From: Alan S. <aj...@fr...> - 2007-05-25 19:00:14
|
John et al, > ...well after many posts to Alan S., I finally found my problem .. > RTFM. JudyFirst is actually find key greater or equal, which requires > an initial value ;( We've been exchanging email offline. I'm finally getting geared up to trying being more helpful to John. My own takeaway on the J1F() issue is that the API, code, and docs are still good and correct, but as always, it's possible to have a "UI disconnect" and an incorrect mental model. It happens to everyone, and it's hard to know how to prevent it. > Why do we have the almost useless JudySL functions (no one uses null > terminated string anymore)... This is a highly subjective statement, and you know that, right? :-) There's a writeup somewhere about using JudyL as the base for long (but fixed) or variable-length key to value mappers. Here's a paraphrase of it from memory... You can think of JudyL as enabling technology for a ridiculously wide, shallow tree where each node has a fanout of 2^32 or 2^64 (your native word size). An interesting problem is how to know the length of a length-associated rather than length-terminated key/index WHILE keeping them sorted in lexicographic order (if you care). Judy being a by-expanse (radix) tree, it naturally radix-sorts indexes. (True sort(1) capability, of course, is not just lexicographic, so Judy is not an appropriate technology "just for sorting" in the general case.) As discussed in the paper, C-style null-terminated strings are sorted, and that's what JudySL provides. Why have it? It works for C programs, it's handy, and it was a good demo too. Now, Perl-type strings allow any char in the value, including a null char, and thus are what I call length-associated. The simple way to store these kinds of variable-length keys is to start with a top-level JudyL array by length. Each value points to a subarray of array... of keys all of that same length. The paper talks about ways of simultaneously lexico-sorting length-associated variable-length keys, and knowing when to terminate by length. I never found a really "pretty" solution. You just have to have some kind of indication "out at the end" that you are terminating. All of this grew out of me reading academic papers late in the Judy project and realizing that, at least to academics, the long/variable length key problem was the only one worth addressing. I think they glossed over practical reality, but what the heck... One more note. I always have to remind MYSELF that, yes, it's fine and dandy and efficient to "waste a whole Judy array" when you need it, such as for the initial sort-by-length. That's the beauty of the Judy library as an engine for solving higher-level problems. It's efficient in space and time for 0,1,2,... N (peta-level) indexes. > ...and then JudyHS, which uses a length but can't iterate (in any > order) over the available keys? I know little about JudyHS, Doug added it after we LGPL'd the stuff. Doug? > I'd sure like JudySL to use a length instead of a null terminator, and > JudyHS to be able to iterate over all the keys. So, go write what you need. :-) The JudySL source code should be a fine example. I do recommend you find and read my short paper on variable length keys before going too far. Cheers, Alan Silverstein |
From: skaller <sk...@us...> - 2007-05-26 02:35:37
|
On Fri, 2007-05-25 at 13:00 -0600, Alan Silverstein wrote: > John et al, > > Why do we have the almost useless JudySL functions (no one uses null > > terminated string anymore)... > > This is a highly subjective statement, and you know that, right? :-) Nope. Today we have Unicode, and people use binary data a lot. Null terminated strings are *out* in quality code, in most places. C++ doesn't use them, and most code people say the code is '8 bit clean' to mean that you can put 0 char in a string and the code will still work. So if the claim is subjective it's backed by considerable experience in with many different systems. > There's a writeup somewhere about using JudyL as the base for long (but > fixed) or variable-length key to value mappers. Of course, JudyL can be used to do this, either by making a tree of JudyL arrays, or better, a hybrid data structure using some Judy fanout followed by some other data structure such as a hashtable or just a plain list. However, I played with my own trie implementation, and had no trouble with variable length strings, where the length was an index -- you just have to pass the length around. In fact 'JudyL' should really be a special case of this with length = sizeof(void*). > Now, Perl-type strings allow any char in the value, including a null > char, and thus are what I call length-associated. The simple way to > store these kinds of variable-length keys is to start with a top-level > JudyL array by length. Each value points to a subarray of array... of > keys all of that same length. There are certainly ways to use the existing technology, no doubt about that. The question is why the most general interface is not provided. JudyHS stores the right kind of data but there are no iterators, which limits its utility. JudySL provides the iterators but limits the data. > One more note. I always have to remind MYSELF that, yes, it's fine and > dandy and efficient to "waste a whole Judy array" when you need it, such > as for the initial sort-by-length. That's the beauty of the Judy > library as an engine for solving higher-level problems. It's efficient > in space and time for 0,1,2,... N (peta-level) indexes. The only reason not to provide a SINGLE interface with variable bit length keys is optimisation. JudyL + length count then subsumes Judy1, JudySL and JudyHS. People wishing to use 0 terminated strings just have to call strlen() on the argument, and learn to stop using crappy data structures. Secure software never uses 0 terminated strings .. it's not safe to do so. More generally, one can envisage a structure vector: { fan1, fan2, ... fan_rest } which are the number of bits in each fanout, however this might be hard to implement efficiently. in addition the size of the stored data could be variable: it's a pain using a pointer to two words of data, since it costs 3 words and create a memory management problem. Of course the POINT of Judy is to heavily optimise special cases chosen for their utility, so I'm not advocating the most general possible interface. .... Just wishing JudySL used a length count, not null termination, since 99% of all uses will be storing binary data which mandates inclusion of null bytes. Felix strings, for example, are C++ strings, and are specified as 8 bit clean so I can't put them in a JudySL array, and I can't use JudyHS either since there are no iterators. -- John Skaller <skaller at users dot sf dot net> Felix, successor to C++: http://felix.sf.net |
From: Alan S. <aj...@fr...> - 2007-05-27 00:11:28
|
John et al, > Today we have Unicode, and people use binary data a lot. Null > terminated strings are *out* in quality code, in most places. C++ > doesn't use them, and most code people say the code is '8 bit clean' > to mean that you can put 0 char in a string and the code will still > work. Fine. I think it all depends on context. Ironically, today I'm being paid to retrofit libJudy to some old C software that is very much valued in daily production use, and JudySL has a definite place in that context. (First time since 2002 or so that I've actually used Judy for something.) > Of course, JudyL can be used to do this, either by making a tree of > JudyL arrays, or better, a hybrid data structure using some Judy > fanout followed by some other data structure such as a hashtable or > just a plain list. Right. > In fact 'JudyL' should really be a special case of this with length = > sizeof(void*). As I described, nothing stops you from building JudyVL (let's call it, where V = variable length key) using JudyL as the engine for your tree nodes. > There are certainly ways to use the existing technology, no doubt > about that. The question is why the most general interface is not > provided. Because... - the original motivation for libJudy was to map words to words, and that alone is a common and interesting problem - we wrestled with some very difficult lower-level concepts for a long time, that had to do with bits, bytes, and words, to get to the Judy IV that was LGPL'd - variable-length keys weren't high enough on our list at the time - we ran out of time (although I did explore the topic, as related previously) It might just come down to expectations. If you accept Judy for what it is, not expect it to be more than it is, it still has great value, both as an implementation and (once you really understand the data structures and algorithms), as a philosophical education. ("How did they do that?") The latter I tried to explain as best I could in the Shop Manual, in lieu of time/energy to create formal academic-style documents. > The only reason not to provide a SINGLE interface with variable bit > length keys is optimisation. JudyL + length count then subsumes > Judy1, JudySL and JudyHS. Don't be so quick to judge that. As I learned from Doug, you don't know anything about performance until you look at what's actually generated and run. You might find that, unless you screw it up somehow by being careless, clever, or not knowing enough, if you build a JudyVL on top of JudyL, the resulting code and run-time path looks very much like what you might create if you somehow engineered it all at a lower level "inside the Judy tree". That's why I like to say that you can think of JudyL as being very much like a hash table of hash table... where the hash algorithm is, "index on next byte" (fast and simple although the nodes are often "compressed"), and the synonyms are handled by the next level hash table. Turns out the code acts very much like that. > People wishing to use 0 terminated strings just have to call strlen() > on the argument, and learn to stop using crappy data structures. Your quickness to demean anything you don't like, grows tiresome. > Just wishing JudySL used a length count, not null termination, since > 99% of all uses will be storing binary data which mandates inclusion > of null bytes. Then it wouldn't be JudySL. Check out the JudyNL code that I think is already in the library but undocumented. IF I recall right, it's in there, although Doug wouldn't let me fully support and document it. Alan |
From: skaller <sk...@us...> - 2007-05-27 01:16:38
|
On Sat, 2007-05-26 at 18:11 -0600, Alan Silverstein wrote: > > People wishing to use 0 terminated strings just have to call strlen() > > on the argument, and learn to stop using crappy data structures. > > Your quickness to demean anything you don't like, grows tiresome. As an ex member of the ISO C and C++ standardisations committees I would say I'm entitled to demean C and C++ .. since I'm one of the people that helped create these monsters .. they're my monsters to criticize :) -- John Skaller <skaller at users dot sf dot net> Felix, successor to C++: http://felix.sf.net |