From: Alan S. <aj...@fr...> - 2014-02-24 05:25:06
|
Turns out I kept around, but forgot, two different libJudy source trees for myself; one the early official SourceForge one, and another a slightly earlier initial snapshot. (Didn't have to go to a CDROM at all.) In the latter I found these files of interest (per the HP internal source tree layout): doc/ext/JudyNL_3x.htm doc/int/JudyNL_issues src/JudyNL/README src/JudyNL/JudyNL.h src/JudyNL/JudyNL.c test/manual/JudyNLCheck.c So I built a little tarchive, just 71 Kb, no point in even compressing it. If anyone wants a copy, email to me only and I'll send it back to you as an attachment. But first note the text appended below, which is the majority of the issues file (from Dec 2001) mentioned above. Bear in mind that JudyNL is "just" an array-of-array... of JudyL, where the top level keys are key lengths, and each successive subarray carries four successive bytes of the keys. Cheers, Alan Silverstein ------------ ...Doug was not happy about several things, if I recall right: * JudyNL sorts indexes by length first rather than lexicographically (left to right). I have a paper written that discusses how we might support variable-size keys with lexicographic sorting (available on request), but have not implemented it, and it's painful to look at the data structs required. Dunno how efficiently it would operate either. * JudyNL insists that index lengths be whole words, though for HPUX libc compatibility the length is expressed in bytes (which must be a multiple of the word size). Now, you could modify the JudyNL code to express length however you like (bits, bytes, words, nibbles...), the only catch (additional work not now performed) is needing to null-pad indexes that are not whole words to ensure single unique insertion in the meta-trie, since the underlying trie nodes = JudyL arrays operate on words. JudySL already includes this type of padding, but JudyNL has no need for it. * JudyNL is possibly overkill for a given application because it allows a variety of index sizes in a single array, at a cost in complexity and overhead, when perhaps any given user or application might want only one or a set of single-size index arrays (fixed-depth trees), like Judy2L. * Unlike JudySL, JudyNL has no provision internally for "shortcut leaves" at this time, meaning a unique suffix still descends M (< N) additional levels of meta-trie, equivalent to single-child branch nodes in a simple trie. Not too big a deal if N is relatively small, whereas JudySL strings are more likely (?) to be quite long and have long unique suffixes. * Due to not shipping it, JudyNL code is now out of date with respect to error handling and also has no automated regression tests. * And ultimately, no user had requested this feature, there was no "pull", unlike JudySL. We created JudyNL on my initiative, after I recognized the difference between length-terminated and length-associated variable-size indexes, I decided to implement both (and internally they are quite different). |
From: john s. <sk...@us...> - 2014-02-24 05:59:02
|
On 24/02/2014, at 4:24 PM, Alan Silverstein wrote: > Bear in mind that JudyNL is "just" an array-of-array... of JudyL, where > the top level keys are key lengths, and each successive subarray carries > four successive bytes of the keys. It had better be 8 bytes on a 32 bit machine! > > Cheers, > Alan Silverstein > > ------------ > > ...Doug was not happy about several things, if I recall right: > > * JudyNL sorts indexes by length first rather than lexicographically > (left to right). Yep, that sucks. > * JudyNL is possibly overkill for a given application because it allows > a variety of index sizes in a single array, at a cost in complexity > and overhead, when perhaps any given user or application might want > only one or a set of single-size index arrays (fixed-depth trees), > like Judy2L. Yes, but the user can more or less easily implement that. At least, given JudyL I can implement it easily :) > * And ultimately, no user had requested this feature, there was no > "pull", unlike JudySL. We created JudyNL on my initiative, after I > recognized the difference between length-terminated and > length-associated variable-size indexes, I decided to implement both > (and internally they are quite different). Doug's right on the ordering. The whole point is to get lexicographically ordered C++ strings which might include nul bytes. -- john skaller sk...@us... http://felix-lang.org |
From: Alan S. <aj...@fr...> - 2014-02-24 06:34:23
|
>> Bear in mind that JudyNL is "just" an array-of-array... of JudyL, where >> the top level keys are key lengths, and each successive subarray carries >> four successive bytes of the keys. > It had better be 8 bytes on a 32 bit machine! Right, sorry... I was just being casual. And, did you mean 8 bytes on a 64-bit machine? :-) >> JudyNL sorts indexes by length first rather than lexicographically >> (left to right). > Yep, that sucks. In general the problem of how to EFFICIENTLY sort variable-length keys lexicographically, where the length (end) of each key is not marked within the key itself, is a very hard one. Somehow you must associate a length with each key, while not having it affect the sort order except in the sense that a pure substring comes before any of its superstrings. If you know how libJudy and other radix trees work, they don't replicate common leading portions (bits) of similar keys, which saves a lot of storage space, and often some time too. (Compare with hashing, where entire keys are stored in every node for synonym disambiguation.) The paper I wrote on this subject long ago talked about tricky ways to modify the JudyL nodes to know when any given key ended within the node, without looking for a bit/byte pattern within the key itself. Ya know, I just thought of a new (maybe) idea. What if, similar to COBS or base-64 but without actually modifying/encoding the key, some pattern within the key (appended to the end of every key) conditionally marked the end of the key? Choose some byte or multi-byte value likely to be rare in the real world -- although in general, a priori, there's no such value, you might as well just use a single null byte. Anyway upon seeing this pattern you need some way (native within the tree, can't build it on top of vanilla JudyL for example) to check if it's truly a terminator or just an accident. Hmm... But since EVERY key has SOME termination (length), it's like you end up needing a whole other "length dictionary" tree to somehow reference when checking for a possible termination. Already inefficient, and I can't begin to guess what the lookup itself might consist of. As long as I'm rambling, some of this is coming back to me now. Recall that a key NOT terminating in a given JudyL subarray level has associated a value that's a pointer to a sub-subarray for all keys sharing the same prefix through this point. However for each 4-byte or 8-byte word (portion of keys) that "continues through" this JudyL level (superstrings), there can be myriad other keys (substrings) that terminate at this level (length) with, say, 1-3 or 1-7 null padding bytes appended. Each of those is unique at this level, but how do you distinguish, say in a 4-byte word, the value 0x11121300 (key ends after 0x13 byte) from the same byte pattern that's the leading part (all four bytes) of a longer key? You only get one value = subarray pointer from JudyL. Now remember how since libJudy objects are always at least 4 bytes in size (or was it actually 8, even on a 32-bit word machine), the 2 or more least significant bits of any pointer can hold secret information about where the pointer points. (These bits are masked off before dereferencing the pointer.) I think my notion was that, first if there was any unused bit pattern (not sure there was), you could tell "key(s) end here" from "superstrings only" in the value word. In the first case, the pointer would go to a new object, not a JudyL array, that would describe the key(s) terminating here, plus point to a subsidiary JudyL array with the superstrings. You know, this could be constructed using unmodified JudyL arrays, but adding some new data structure too that holds "leaf" information for leaves when they terminate. Also similar to JudySL, it would help a lot to shortcut long tail ends of unique keys/substrings (remainders). Got all that? :-) Anyway in principle this is not an important problem, because different users want to sort different kinds of keys in many different ways, including handling multibyte characters and embedded escape codes. In practice, though, we all know how useful plain old lexicographical sorting often is. >> ...perhaps any given user or application might want only one or a set >> of single-size index arrays (fixed-depth trees), like Judy2L. > Yes, but the user can more or less easily implement that. At least, > given JudyL I can implement it easily :) Right, that's what Judy2L did, it was just a "two word deep tree." I should have those sources too if anyone cares. Cheers, Alan |
From: john s. <sk...@us...> - 2014-02-24 07:29:04
|
On 24/02/2014, at 5:34 PM, Alan Silverstein wrote: First: lets assume multiples of word size only. This is WLOG due to the Ocaml trick: store the length of the string modulo word size at the end of the string. This preserves lexicographical ordering. For example a single byte 100 is encoded as (100,0,0,1) Four bytes: (100,101,102,103),(0,0,0,0) [32 bit machine] Or something like this .. :) > bytes appended. Each of those is unique at this level, but how do you > distinguish, say in a 4-byte word, the value 0x11121300 (key ends after > 0x13 byte) from the same byte pattern that's the leading part (all four > bytes) of a longer key? You only get one value = subarray pointer from > JudyL. There are two issues, one is how to track the length when recursing down the tree. That's easy in principle, just maintain a depth counter. It's probably hard in practice. The other problem is even easier. If there is a string ending in this node, then set the low order but of the JudyL value. Mask it back to zero when looking for the next array. Use NULL if there are no keys with that proper prefix (with the low bit set). Hmm .. this could be done now using JudySL. Instead of storing an integer value, store a pointer to a structure containing a flag and an integer value. If the flag is set, the null byte really was the end of the string, use the integer value. If the flag is not set, its not the end of the string, cast the integer value to a pointer which is the next JudySL array. Null butes would be very expensive :) -- john skaller sk...@us... http://felix-lang.org |
From: Alan S. <aj...@fr...> - 2014-02-24 21:28:52
|
John et al, thanks for playing -- > First: lets assume multiples of word size only. Not really nice, although presently built into JudyNL of course. > This is WLOG due to the Ocaml trick... WLOG? Ocaml? > ...store the length of the string modulo word size at the end of the > string. OK, that means on a 4-byte machine, a value of 0-3 at the end of the string? How do you distinguish this from any random bits within the key? > This preserves lexicographical ordering. Right. > > For example a single byte 100 is encoded as > > (100,0,0,1) OK, somehow the 1 at the end means "only the first byte matters, and my key/string ends there." What about a superstring containing (100,0,0,1) at this point and then continuing? Maybe that's why you said assume word-chunks only? > Four bytes: > > (100,101,102,103),(0,0,0,0) > > [32 bit machine] Lost me there, is the last 0 the marker for end of string? > Or something like this .. :) Or something. :-) > There are two issues, one is how to track the length when recursing > down the tree. That's easy in principle, just maintain a depth > counter. It's probably hard in practice. No, as you go down the tree for any reason, you know how deep you are, if you care, but that's not really the issue. The crux problem is how to EFFICIENTLY distinguish any terminating substrings (padded say with nulls) from continuing superstrings that happen to contain the same bytes. > The other problem is even easier. If there is a string ending in this > node, then set the low order but of the JudyL value. Mask it back to > zero when looking for the next array. Use NULL if there are no keys > with that proper prefix (with the low bit set). Careful with that, look at the header files, there are already bit patterns stored within Judy node pointers that libJudy itself uses. This includes a root pointer to a whole JudyL array. > Hmm .. this could be done now using JudySL. Instead of storing an > integer value, store a pointer to a structure containing a flag and an > integer value. Sure, but now you doubled the size of the tree, more or less, or at least the depth, with intermediate nodes between every JudyL array. (Well maybe not 2X in full reality since each JudyL array has 1-4? internal nodes => cache line misses.) > If the flag is set, the null byte really was the end of the string, > use the integer value. If the flag is not set, its not the end of the > string, cast the integer value to a pointer which is the next JudySL > array. Null butes would be very expensive :) Yeah, plus if you solve this at all, you really want to support arbitrary string lengths, just like JudySL. Cheers, Alan |
From: john s. <sk...@us...> - 2014-02-24 23:59:46
|
On 25/02/2014, at 8:28 AM, Alan Silverstein wrote: > John et al, thanks for playing -- > >> First: lets assume multiples of word size only. > > Not really nice, although presently built into JudyNL of course. > >> This is WLOG due to the Ocaml trick... > > WLOG? Ocaml? WLOG is mathspeak for Without Loss of Generality. Sorry :) Ocaml is a programming language. Ocaml strings are multiple words with null padding at the end plus an indicator how many bytes of padding got added. > >> ...store the length of the string modulo word size at the end of the >> string. > > OK, that means on a 4-byte machine, a value of 0-3 at the end of the > string? How do you distinguish this from any random bits within the > key? There's a length control, but it counts machine words of the string not bytes. > >> Or something like this .. :) > > Or something. :-) The data structure is well document in Ocaml. So we know it works :) > > Careful with that, look at the header files, there are already bit > patterns stored within Judy node pointers that libJudy itself uses. > This includes a root pointer to a whole JudyL array. Oops :) -- john skaller sk...@us... http://felix-lang.org |
From: Alan S. <aj...@fr...> - 2014-02-25 00:34:29
|
John et al, > WLOG is mathspeak for Without Loss of Generality. Sorry :) OK, WTF. :-) > Ocaml is a programming language. Ocaml strings are multiple words > with null padding at the end plus an indicator how many bytes of > padding got added. OK, I guess if you know the language, then the format is a good metaphor. So how do you know when you're in the last word, to say it's time to stop instead of continuing? Any pattern you can invent with nulls and a count, I can assert to exist somewhere in the middle of a longer string, so long as all bytes 0..255 are valid in strings. (No distinguished values like a terminator.) So far this just sounds to me like a variation on C length-terminated (by nulls) with a fancier termination code. Anyway if I had more time and energy, I'd be inclined to revisit this topic and maybe even write some CODE. Well, maybe someday when I'm caught up with O(N^2) other retirement tasks! OK, at least you motivated me to glance at Judy.h. I found this: // The JLAP_INVALID pattern never appears in a JudyL array pointer. Hence // applications that build trees of JudyL arrays can use this pattern to // distinguish a JudyL array value that is a pointer to another JudyL array // from one that is a pointer to some other, application-defined object. OK I remember now, there's ONE pointer LSB pattern an application can use to embed a non-Judy pointer (JudyL value) in a meta-trie (array of array of JudyL). This should be sufficient! Basically if all common substrings through some point in the meta-trie continue at least one level deeper, then you just make the present JudyL value be a pointer to another JudyL array. But if one or more substrings terminate "at this level" then you store a pointer (to your own user data) with JLAP_INVALID to mark it as such. Of course your meta-library must encode/decode this above libJudy while descending the meta-trie for any operation. "Your own user data" is some kind of node/struct that must include at least one of: A value area for 1..4 or 1..8 terminating substrings (depending on word size), and possibly also one of: A value area holding a pointer to a subsidiary JudyL array for superstrings. For example, suppose I store ALL of these trailing substrings: ... 0x01 ... 0x0100 ... 0x010000 ... 0x01000000 ... 0x01000000 0x05... ... 0x0102 ... 0x010200 ... 0x01020000 ... 0x01020000 0x05... ... 0x010203 ... 0x01020300 ... 0x01020300 0x05... ... 0x01020304 ... 0x01020304 0x05... Each group above occupies one key in a given JudyL array (meta-trie node), but as you can see, in the worst case (the first group), four different terminating strings plus one or more superstrings is represented. All appear under a four-byte key of 0x01000000 (since nulls are appended to shorter substrings). The value associated with that key is a JLAP_INVALID pointer to my own data structure where I list which combinations are valid, like this (informally): slot 1: yes + value # meaning 0x01 exists. slot 2: yes + value # meaning 0x0100 exists (and so on). slot 3: yes + value slot 4: yes + value slot 5: yes + value (which is a pointer to a JudyL array) Having never coded this, I don't know if it would pan out, or how it would perform. The size would have to be at least 5 bits + 5 words, or really 6 words, on a 32-bit machine. Just to be clearer, suppose I only stored these three strings/keys in an entire improved JudyNL array: length = 1: 0x01 length = 5: 0x01000000 0x05 length = 8: 0x01000000 0x05060708 The 0x01000000 key in the top JudyL array would point (with JLAP_INVALID) to a JudyNL special node that says: slot 1: yes + value # meaning 0x01 exists, and here's its value area. slot 2: no slot 3: no slot 4: no slot 5: yes + value (which is a pointer to a JudyL array containing 0x05 and 0x05060708) Got it? Cheers, Alan |
From: john s. <sk...@us...> - 2014-02-25 04:37:25
|
On 25/02/2014, at 11:34 AM, Alan Silverstein wrote: > > So how do you know when you're in the last word, to say it's time to > stop instead of continuing? Length count, in words instead of bytes. -- john skaller sk...@us... http://felix-lang.org |