Thread: Greetings! Judy Strings!

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hello everyone,

First, if Doug Baskins is still around on this list, I'd like to
congradulate him on a really wonderful idea.  Even fairly simple testing
shows dramatic speed gains over binary trees or skiplists, and
drastically reduced memory usage.

This brings me to an an interesting idea that dawned on me, and I'll let
the list know how this works out once I get it tested.  Basically, if
anyone has messed with the Boehm's Garbage Collector they've likely come
across the CORD string package, which is a C version of the ROPE package
http://www.cs.ubc.ca/local/reading/proceedings/spe91-95/spe/vol25/issue12/spe986.pdf

In a CORD, the string data area is immutable.  Strings are built from
the concatenation of substrings.  A substring is just a node with a
pointer to the start of the data (usually a character array like a C
string) and the length of the substring.  The CORD then builds a binary
tree where the right and left nodes represent concatenation, without any
data copies ever taking place.  There are some other frills, but thats
the basics.

Now, there were three things I didn't like about CORDs.  First, it makes
use of stdio and I'm pretty much against that.  Second, it favors
concatenation at the expense of random access times (rightfully so, but
it seems like the problem is tree traversal).  Third, it uses binary
trees as its concatenation structure, and only balances them at certain
times, which is the cause of the previous issue.  I'm guessing that
during typical usage, the binary tree degrades to be horribly out of
balance, and likely not much better than using a linked list.
Incidently, before reading about CORDs I did a similar scheme using a
linked list instead of a binary tree.

I then realized that I was already using Judy Arrays in my program, and
I had already thrown binary trees out the window after comparing against
Judy, and for not only size/speed issues, but the lack of any tuning or
rebalancing requirements in Judy.  Therefore, I might as well use Judy
array's for string management using the basic ideas of CORDs, but with a
Judy Array instead of a binary tree!

I envision this working with the same basic substring node idea, only
store each substring in a Judy Array, at the index where said substring
would occur.  You'd lose out big if you ever changed the size of a
substring and had to move all your data to new indexes, but according to
the rules of the CORD, *ALL* strings are immutable, so you'd make a new
string structure and a new Judy array instead.  Now .. you concatenate
by asking Judy for the last non-empty index, add the length from the
node at the end, and insert at the new index.  Iteration from start to
beginning can be done either by starting at element 0 and adding the
length to each offset to get the next index .. or just output the length
given at that node and ask Judy where the "next" node is!  Not sure
which is faster - likely adding the index?

To find a character at any given index (a *worst* case operation for a
CORD), you only have to ask Judy if there is an element at that index.
if so - your done.  If not, ask judy what the previous non-empty element
is, and that will get you to the right substring and the difference of
the indexes will give you the index into that substring.

With slightly more work (or maybe even less work), you can remove the
need for storing a substring length at every node.  You can infer the
length of any node by fetching the index of the next non-empty node and
subtracting.  While this is another Judy operation that may not be
necessary in many cases, its a single subtraction of indexes to give a
length, instead of iterating through the entire string looking for a
NULL, plus it would give you the next substring which you'll fetch
anyway when iterating over the string or outputting it!  Plus, the Judy
array would only have to store pointers to characters and not pointers
to nodes with a length and another character pointer.  A string would
then be a Judy Array and a total length since you'd need the total
length to compute the length of the last substring.

At this point, I'm going to do it and test it.  I haven't decided if I
want to store nodes with a substring length in them or just store the
pointers to the character array data and infer the substring lengths
from the indexes.  Suggestions welcome.

The only problem I see is that CORDs are said to be thread safe, and I
have read nothing about the thread safety of Judy and what functions
should have a MUTEX stuck around them to prevent concurrent access.  I'm
guessing every GET, INSERT, or DELETE must be locked??  Hopefully not
ALL operations?

-- Evan

Thread: Greetings! Judy Strings!

judy-devel