From: Andy G. <unu...@ai...> - 2010-02-01 06:24:24
|
(Is this the right forum for this type of discussion?) I believe it is possible to merge the list and dict Tcl_Obj types such that a dict is simply a list whose intrep contains some extra indexing information; dict extends list without replacing it, like a subclass. This change can be made without affecting script-level, API-level, or even binary-level compatibility. The only observable difference should be a performance improvement in some circumstances from reduced shimmering. The idea came about while working on processing HTTP headers in Wibble. Headers can repeat, so I create a key/value list using [lappend] instead of [dict set]. Most of the time header repetition can be neglected, so the resulting list is conveniently accessed using [dict get], which returns the last (usually only) matching entry. However, this incurs a significant shimmering penalty for every lookup. If [dict get] were to access a persistent index over the list, rather than insisting on storing the values themselves in the index, this shimmering would not take place. (The Wibble version I describe hasn't been posted to the Wiki yet.) I have decided against using the Tcl hash table implementation for this project, primary because there is no good way to pass "client data" to the custom hash procs. Such client data is necessary to tie the hash table back to the intrep. I considered and rejected global variables, thread-local storage, and putting magic keys in the table. I suppose I could have a global hash table mapping from hash tables back to intreps, but that is too incestuous for my liking. Freed from using a generic hash table, I have integrated an optimized hash table directly into the data structure. I have mostly been using the BLT hash table implementation as reference. In my "sandbox copy" of Tcl I have added a *dict field to the definition of List in tclInt.h. (As far as I know, changing data structure definitions tclInt.h doesn't automatically break binary compatibility with extensions.) typedef struct List { int refCount; int maxElemCount; /* Total number of element array slots. */ int elemCount; /* Current number of list elements. */ int canonicalFlag; /* Set if the string representation was * derived from the list representation. May * be ignored if there is no string rep at * all.*/ Tcl_Obj *elements; /* First list element; the struct is grown to * accomodate all elements. */ Dict *dict; /* Dict index over the list, or NULL if this * list doesn't have a dict representation. */ } List; I have modified the list functions to recognize when to invalidate the dict index. That's the only responsibility added to the list functions. I define Dict thusly: #define ENTRIES_PER_BUCKET 3 #define STATIC_BUCKETS 4 #define STATIC_ENTRIES (STATIC_BUCKETS * ENTRIES_PER_BUCKET) struct Dict { int *buckets; /* Pointer to bucket array. Each element is the * bucket head entry's index into the entries * array, or -1 if the bucket is empty. */ DictEntry *entries; /* Pointer to entry array. */ int *allocation; /* Partitioned list tracking the allocation of * entries, containing indexes into entries. The * first numEntries values are the indexes of * allocated entries, in the order they were * created. The remaining values are the indexes * of unused entries, in arbitrary order. */ int maxBuckets; /* Size of the buckets array. */ int numEntries; /* Number of entries present in table. Equal to * elemCount/2 plus the number of duplicate keys * in the elements array. */ int maxEntries; /* Size of the entries and allocation arrays. The * table grows when numEntries exceeds this. */ unsigned long mask; /* Mask value used in hashing function. */ unsigned int downShift; /* Shift count used in hashing function. Designed * to use high-order bits of randomized keys. */ int staticBuckets[STATIC_BUCKETS]; /* Bucket array used for small dicts to avoid * malloc. When the dict is small, buckets points * to this array. */ DictEntry staticEntries[STATIC_ENTRIES]; /* Entry array used for small dicts to avoid * malloc. When the dict is small, entries points * to this array. */ int staticAllocation[STATIC_ENTRIES]; /* Allocation array used for small dicts to avoid * malloc. When the dict is small, allocation * points to this array. */ Tcl_Obj *chain; /* Linked list used for invalidating the string * reps of updated nested dictionaries. */ }; And DictEntry is: typedef struct DictEntry { unsigned long hash; /* This entry's hash value, for quick comparison * when finding the right entry in the bucket. */ int next; /* Index into the entries array for the next entry * in this hash bucket, or -1 for end of chain. */ int key; /* The entry's index into elements, divided by * two. Multiply by two to get the key index, then * add one to get the value index. If there are no * duplicates, key equals the entry's index in the * allocation array. */ } DictEntry; A "pure" dict has no duplicate keys, so numEntries*2==elemCount. The dict commands produce only pure dicts. The only way to make an impure dict is to build it using list commands then create a dict index by using a read-only dict command. Impure dicts can be read as-is, but modifying or copying it using [dict] commands necessitate rebuilding the list to have no duplicates. This can be done with a simple in-place algorithm, because i<=entries[allocation[i]].key for 0<=i<numEntries. In the following code fragment, elements and elemCount are fields in the List structure, and entries, allocation, and numEntries are fields in the dict structure. if (numEntries * 2 != elemCount) { int i, j; for (i = 0; i < numEntries; ++i) { j = entries[allocation[i]].key; if (i != j) { assert(j > i); memcpy(elements + i * 2, elements + j * 2, sizeof(*elements) * 2); entries[allocation[i]].key = i; } } elemCount = numEntries * 2; } Here's a script example that shows how dict behaves in the face of duplicates. This is current Tcl behavior, which I take pains to preserve. % set data {a 0 b 1 a 2 a 3 c 4} a 0 b 1 a 2 a 3 c 4 % dict get $data a 3 % dict get $data a 3 b 1 c 4 % set data a 0 b 1 a 2 a 3 c 4 % dict set data a -3 a -3 b 1 c 4 % set data a -3 b 1 c 4 [dict get] returns a new pure dict formed from the input list, which the above algorithm permutes as follows: (the / in the allocation list shows the partitioning, determined by numEntries) elements = {a 0 b 1 a 2 a 3 c 4} elemCount = 10 numEntries = 3 allocation = {0 1 2 / 3 4 5 6 7 8 9 10 11} entries.key = {3 1 4 ? ? ? ? ? ? ? ? ?} 1. Initial : a 0 b 1 a 2 a 3 c 4 2. Copy "a" : a 3 b 1 a 2 a 3 c 4 3. Leave "b" alone: a 3 b 1 a 2 a 3 c 4 4. Copy "c" : a 3 b 1 c 4 a 3 c 4 5. Truncate : a 3 b 1 c 4 elements = {a 3 b 1 c 4} elemCount = 6 numEntries = 3 allocation = {0 1 2 / 3 4 5 6 7 8 9 10 11} entries.key = {0 1 2 ? ? ? ? ? ? ? ? ?} Now let's delete an element. % dict unset data b a 3 c 4 This would correspond to: elements = {a 3 c 4} elemCount = 4 numEntries = 2 allocation = {0 2 / 1 3 4 5 6 7 8 9 10 11} entries.key = {0 ? 1 ? ? ? ? ? ? ? ? ?} "b" is looked up in the hash table. The hash function selects a bucket, which is the head of a singly-linked list through entries. One of the linked entries has a matching hash, and a full key comparison is done to confirm the match. The full text of the key comes from the Tcl_Obj pointed to by the entry->key*2'th slot in the elements array. The key'th slot of the allocation array gets obliterated by shifting down all higher-indexed slots, up to numEntries-1 (in this case, allocation[1] = allocation[2]), and similar happens to elements, except that the indexes are doubled (overwrite "b" and "1" with "c" and "4"). The entry is freed by putting its index into numEntries-1 slot (allocation[2] = 1). (It's more like a rotate than a shift.) The keys for the shifted entries get decremented (--entries[2].key), and numEntries and elemCount are decremented as well. Of course, reference counts are decremented for the objects removed from the list. I could discuss a few more algorithms, but I don't want to bore you all to death. I just picked these two to demonstrate the operation of these data structures. Now, here's my big coding quandary. Implementing the foregoing requires that the dict code be able to access some list internals and vice versa. I don't want to merge tclListObj.c and tclDictObj.c (even though list and dict would no longer be distinct types), but I also don't want to de-static list and dict internals and move them into tclInt.h. How should I handle this? Thank you for your comments. This has been an exciting project for me, and I'd like to make it a reality someday. I have further enhancements planned for dict which are made possible by unifying list and dict, which I'll be happy to write emails and even TIPs about if you're interested. -- Andy Goth | http://andy.junkdrome.org./ unununium@{aircanopy.net,openverse.com} |
From: Alexandre F. <ale...@gm...> - 2010-02-01 08:58:14
|
Hello Andy, On 2/1/10, Andy Goth <unu...@ai...> wrote: > (Is this the right forum for this type of discussion?) If it's not, we're doomed :} > I believe it is possible to merge the list and dict Tcl_Obj types such that a dict is simply a list whose intrep contains some extra indexing information; dict extends list without replacing it, like a subclass. This change can be made without affecting script-level, API-level, or even binary-level compatibility. The only observable difference should be a performance improvement in some circumstances from reduced shimmering. I sympathetize with the idea of inheritance among obj types, but given Donal's reaction to the suggestion I have more or less given it up: http://groups.google.com/group/comp.lang.tcl/tree/browse_frm/thread/a5214ae93e0294ac/0c62a476064031b1?rnum=11&q=polymorphism&_done=%2Fgroup%2Fcomp.lang.tcl%2Fbrowse_frm%2Fthread%2Fa5214ae93e0294ac%2Fe929b181c5aac543%3Flnk%3Dgst%26q%3Dpolymorphism%26safe%3Dimages%26#doc_2fe1f8c2cf7e339a I must say that the lack of a "compelling need" plays an important role though... > The idea came about while working on processing HTTP headers in Wibble. Headers can repeat, so I create a key/value list using [lappend] instead of [dict set]. Most of the time header repetition can be neglected, so the resulting list is conveniently accessed using [dict get], which returns the last (usually only) matching entry. However, this incurs a significant shimmering penalty for every lookup. If [dict get] were to access a persistent index over the list, rather than insisting on storing the values themselves in the index, this shimmering would not take place. (The Wibble version I describe hasn't been posted to the Wiki yet.) Uh, maybe I'm missing something, but doing this you'll wreck repeated headers, right ? The standard idiom for {HTTP,SMTP,SIP} headers is rather (with array, translate to dict if you want): lappend values($header) value so that for non-repeated headers you need an extra [lindex 0], but in exchange you have absolutely no shimmering. Given this compromise, is there a "compelling need" for a new bump in code complexity ? -Alex |
From: Neil M. <Nei...@no...> - 2010-02-02 11:54:39
|
Hi Andy, On 1 Feb 2010, at 06:08, Andy Goth wrote: > (Is this the right forum for this type of discussion?) > > I believe it is possible to merge the list and dict Tcl_Obj types such that a dict is simply a list whose intrep contains some extra indexing information; dict extends list without replacing it, like a subclass. This change can be made without affecting script-level, API-level, or even binary-level compatibility. The only observable difference should be a performance improvement in some circumstances from reduced shimmering. Sounds good. If I understand correctly you add a hashtable to the list rep which maps from string keys to list indices, rather than directly to the associated value? Apologies, I don't have time to read through all the details you posted. This seems like a good approach to me. > > The idea came about while working on processing HTTP headers in Wibble. Headers can repeat, so I create a key/value list using [lappend] instead of [dict set]. Most of the time header repetition can be neglected, so the resulting list is conveniently accessed using [dict get], which returns the last (usually only) matching entry. However, this incurs a significant shimmering penalty for every lookup. If [dict get] were to access a persistent index over the list, rather than insisting on storing the values themselves in the index, this shimmering would not take place. (The Wibble version I describe hasn't been posted to the Wiki yet.) Presumably your direct concern could be solved by using [dict lappend] instead of [dict set]? I.e., instead of trying to create a map with duplicate keys, you instead store a list of values for each key. [...] -- NeilThis message has been checked for viruses but the contents of an attachment may still contain software viruses which could damage your computer system: you are advised to perform your own checks. Email communications with the University of Nottingham may be monitored as permitted by UK legislation. |
From: Andy G. <unu...@ai...> - 2010-02-02 15:43:46
|
"Neil Madden" <Nei...@no...> wrote: > Sounds good. If I understand correctly you add a hashtable to the list rep > which maps from string keys to list indices, rather than directly to the > associated value? This seems like a good approach to me. Exactly. > Presumably your direct concern could be solved by using [dict lappend] > instead of [dict set]? Yes. Alexandre and I discussed this and I agree. But as I said, I'm still interested in working on [dict]. This should improve the performance of normal operation as well as list and string conversion performance. > I.e., instead of trying to create a map with duplicate keys, you instead > store a list of values for each key. Since I'm unwilling to change the externally visible behavior of [dict] or the Tcl_*Dict* functions, technically the "dict" wouldn't contain duplicate keys, only the "key/value list" that the "dict" is indexing. I give an example near the end of my first email. -- Andy Goth | http://andy.junkdrome.org./ unununium@{aircanopy.net,openverse.com} |
From: Donal K. F. <don...@ma...> - 2010-02-02 16:35:46
Attachments:
donal_k_fellows.vcf
|
On 02/02/2010 15:43, Andy Goth wrote: > Since I'm unwilling to change the externally visible behavior of [dict] or > the Tcl_*Dict* functions, technically the "dict" wouldn't contain duplicate > keys, only the "key/value list" that the "dict" is indexing. I give an > example near the end of my first email. Have you evaluated the memory cost of these changes to code that only needs basic lists or basic dicts and isn't transforming one to the other? Donal. |
From: Kevin K. <kev...@gm...> - 2010-02-03 00:59:00
|
Andy Goth wrote: > (Is this the right forum for this type of discussion?) > > I believe it is possible to merge the list and dict Tcl_Obj types > such that a dict is simply a list whose intrep contains some extra > indexing information; dict extends list without replacing it, like > a subclass. This change can be made without affecting script-level, > API-level, or even binary-level compatibility. The only observable > difference should be a performance improvement in some circumstances > from reduced shimmering. Uhm. Is it time to mention Feather again? Some of Feather's ideas were - after sober thought - very bad ones. But I rather liked its idea that Tcl_ObjTypes could have inheritable interfaces. It would be a Very Good Thing if [dict for], the [array startsearch] stuff, and [foreach] could be unified. It would be a Very Good Thing if [dict keys] and [array keys] could return some sort of iterator, and not have to construct a list (particularly since their commonest use is as the list argument of [foreach]. And (in Tcl 9 territory, because we'd have either to grow Tcl_ObjType by at least one word or else result to an ugly kludge like the one Paul Duffin came up with), I think unified containers would make sense. -- 73 de ke9tv/2, Kevin PS: (The one I want is a sorted list or map, allowing for duplicate keys, and supporting O(logN) time for insertion (by key and perhaps position within equal keys, search by key, search by numeric position, and split (destructive [lrange]). It's amazing how useful such a structure is if done right. |