From: Evan L. <Evan@CoolRunningConcepts.com> - 2005-11-09 02:59:42
|
Hello everyone, First, if Doug Baskins is still around on this list, I'd like to congradulate him on a really wonderful idea. Even fairly simple testing shows dramatic speed gains over binary trees or skiplists, and drastically reduced memory usage. This brings me to an an interesting idea that dawned on me, and I'll let the list know how this works out once I get it tested. Basically, if anyone has messed with the Boehm's Garbage Collector they've likely come across the CORD string package, which is a C version of the ROPE package http://www.cs.ubc.ca/local/reading/proceedings/spe91-95/spe/vol25/issue12/spe986.pdf In a CORD, the string data area is immutable. Strings are built from the concatenation of substrings. A substring is just a node with a pointer to the start of the data (usually a character array like a C string) and the length of the substring. The CORD then builds a binary tree where the right and left nodes represent concatenation, without any data copies ever taking place. There are some other frills, but thats the basics. Now, there were three things I didn't like about CORDs. First, it makes use of stdio and I'm pretty much against that. Second, it favors concatenation at the expense of random access times (rightfully so, but it seems like the problem is tree traversal). Third, it uses binary trees as its concatenation structure, and only balances them at certain times, which is the cause of the previous issue. I'm guessing that during typical usage, the binary tree degrades to be horribly out of balance, and likely not much better than using a linked list. Incidently, before reading about CORDs I did a similar scheme using a linked list instead of a binary tree. I then realized that I was already using Judy Arrays in my program, and I had already thrown binary trees out the window after comparing against Judy, and for not only size/speed issues, but the lack of any tuning or rebalancing requirements in Judy. Therefore, I might as well use Judy array's for string management using the basic ideas of CORDs, but with a Judy Array instead of a binary tree! I envision this working with the same basic substring node idea, only store each substring in a Judy Array, at the index where said substring would occur. You'd lose out big if you ever changed the size of a substring and had to move all your data to new indexes, but according to the rules of the CORD, *ALL* strings are immutable, so you'd make a new string structure and a new Judy array instead. Now .. you concatenate by asking Judy for the last non-empty index, add the length from the node at the end, and insert at the new index. Iteration from start to beginning can be done either by starting at element 0 and adding the length to each offset to get the next index .. or just output the length given at that node and ask Judy where the "next" node is! Not sure which is faster - likely adding the index? To find a character at any given index (a *worst* case operation for a CORD), you only have to ask Judy if there is an element at that index. if so - your done. If not, ask judy what the previous non-empty element is, and that will get you to the right substring and the difference of the indexes will give you the index into that substring. With slightly more work (or maybe even less work), you can remove the need for storing a substring length at every node. You can infer the length of any node by fetching the index of the next non-empty node and subtracting. While this is another Judy operation that may not be necessary in many cases, its a single subtraction of indexes to give a length, instead of iterating through the entire string looking for a NULL, plus it would give you the next substring which you'll fetch anyway when iterating over the string or outputting it! Plus, the Judy array would only have to store pointers to characters and not pointers to nodes with a length and another character pointer. A string would then be a Judy Array and a total length since you'd need the total length to compute the length of the last substring. At this point, I'm going to do it and test it. I haven't decided if I want to store nodes with a substring length in them or just store the pointers to the character array data and infer the substring lengths from the indexes. Suggestions welcome. The only problem I see is that CORDs are said to be thread safe, and I have read nothing about the thread safety of Judy and what functions should have a MUTEX stuck around them to prevent concurrent access. I'm guessing every GET, INSERT, or DELETE must be locked?? Hopefully not ALL operations? -- Evan |
From: yjudy <yj...@cl...> - 2005-11-09 07:05:14
|
Hi Evan! Very interesting idea and the paper is awesome. >At this point, I'm going to do it and test it. I haven't decided if I >want to store nodes with a substring length in them or just store the >pointers to the character array data and infer the substring lengths >from the indexes. Suggestions welcome. > =20 > In my opinion, you may try both implementations. Storing strings in nodes make the interface simpler when accessing the data after, but using point= ers instead, give more flexibility at the expanse of one or two operations=20 for length computing. >The only problem I see is that CORDs are said to be thread safe, and I >have read nothing about the thread safety of Judy and what functions >should have a MUTEX stuck around them to prevent concurrent access. I'm >guessing every GET, INSERT, or DELETE must be locked?? Hopefully not >ALL operations? > =20 > About threads safety, Doug is the best to answer this question, even I=20 think Juddy is suitable for this kind of job. Hope you start the implementation and we'll see the results. Good Luck Youn=E8s |
From: <ze...@ze...> - 2005-11-09 07:49:29
|
Hello, > >At this point, I'm going to do it and test it. I haven't decided if I > >want to store nodes with a substring length in them or just store the > >pointers to the character array data and infer the substring lengths > >from the indexes. Suggestions welcome. > > =20 > > In case you haven't seen it before: http://www.and.org/vstr/ http://www.and.org/vstr/comparison "Vstr is a string library, it's designed so you can work optimally with readv()/writev() for input/output. This means that, for instance, you can readv() data to the end of the string and writev() data from the beginning of the string without having to allocate or move memory. It also means that the library is completely happy with data that has multiple zero bytes in it. This design constraint means that unlike most string libraries Vstr doesn't have an internal representation of the string where everything can be accessed from a single (char *) pointer in C, the internal representation is of multiple "blocks" or nodes each carrying some of the data for the string. This model of representing the data also means that as a string gets bigger the Vstr memory usage only goes up linearly and has no inherent copying (due to other string libraries increasing space for the string via. realloc() the memory usage can be triple the required size and require a complete copy of the string). It also means that adding, substituting or moving data anywhere in the string can be optimized a lot, to require O(1) copying instead of O(n). Speaking of O(1), it's worth remembering that if you have a Vstr string with caching it is O(1) to get all the data to the writev() system call (the cat example below shows an example of this, the write call is always constant time." |
From: Troy H. <tro...@hp...> - 2005-11-09 14:04:12
|
On 11/09/05 08:06, yjudy wrote: > About threads safety, Doug is the best to answer this question, even > I think Juddy is suitable for this kind of job. Judy is currently *not* thread safe. This is the next area of development for Judy. Troy |