Von: Alan Silverstein <email@example.com>
Gesendet: 2:11 Mittwoch, 21.November 2012
Betreff: Re: Prefix-only search?
> yes, lexicograhically sorting is needed.
OK, it sounds like in your application it's valuable for some reason you
> This is necessary if for multi-field keys in databases (the null
> character can occurs within the key).
OK, but don't confuse length-associated (so the variable-length key
contain any byte value) with lexigraphical sorting. The latter is only
important if for some reason you want to use First/Next (or converse) to
visit the keys in lexicographic order, rather than some other order.
> The sorted by length like in JudyHS function cannot be used.
> The big advantage of the judy arrays is the lexicograhically ordering.
For some people it's a complete don't-care, for others libJudy isn't
good enough because it's not "real" sorting (multibyte, various keys,
etc, see sort(1)), but for some applications a simple sorted order
(other than by length) does have some value.
> Otherwise you can simply use hashing, witch is faster.
Not always! Hashing is hard to beat for simple/fast hash algorithms
with good spectral properties (short synonym chains) for the particular
sparse data being saved. The problem with hashing is it's "fragile"
real-world applications where the programmer doesn't understand it very
well, tune it up well to match the data, or (most often) keep it
maintained well as the dataset grows. I've personally sped up old
(buried and forgotten) hashing code by 10x just by retuning a few
>> Storing variable-length, length-associated keys sorted
>> lexicographically is actually a fairly hard problem.
> I don't know the internals of the judy arrays but ithink, you can
> store the common prefixes as actually done and store the suffixes as
> variable length array (length as varbyte + the byte array).
Well I wish I could point you to the old paper where I put a lot of
thought into this. Consider that any possible substring/superstring, of
any length, is possible. While "decoding" the next 4/8 bytes of the key
string (including any binary byte values in the general case),
be able to efficiently distinguish (in the 4-byte case) all of these
- string has only 1 more byte (256 possible values)
- string has 2 more bytes (256^2 possible values)
- string has 3 more bytes (256^3 possible values)
- string has 4 more bytes => each of leads to 1 or more substring values
You can't do this with the present JudySL code... You must dig into the
internals to have some new kind of tree node that tells you how many and
which strings are present that terminate at this level, along with
all superstrings that go deeper. Say you store:
After decoding "abcd" through the first JudyL array, you can't just
point to another JudyL array for the rest of the string(s). Something
has to sort by remaining length (1,2,3,4+) first and lead you off to the
children trees for each length. And you
gotta do this while maintaining
lexicographical order too, can't have "abcdf" (shorter key) appear
before "abcdef" (longer key) while walking the list.
Simply listing N (large) substrings in lexicographical order is very
inefficient (in both space and CPU time) if there are more of them than
fit in 1-2 cache lines! And common substrings are problematic too.
I did dream up the most efficient (space and timewise) data structure I
could think of to solve this problem, but I've forgotten it, and it's
buried in the "variable size keys" paper I wrote over 10 years ago.
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25
servers or applications!http://p.sf.net/sfu/zoho_dev2dev_nov
Judy-devel mailing listJudyfirstname.lastname@example.org://lists.sourceforge.net/lists/listinfo/judy-devel