Thread: [TCLCORE] Unifying list and dict types

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

(Is this the right forum for this type of discussion?)

I believe it is possible to merge the list and dict Tcl_Obj types such that a dict is simply a list whose intrep contains some extra indexing information; dict extends list without replacing it, like a subclass.  This change can be made without affecting script-level, API-level, or even binary-level compatibility.  The only observable difference should be a performance improvement in some circumstances from reduced shimmering.

The idea came about while working on processing HTTP headers in Wibble.  Headers can repeat, so I create a key/value list using [lappend] instead of [dict set].  Most of the time header repetition can be neglected, so the resulting list is conveniently accessed using [dict get], which returns the last (usually only) matching entry.  However, this incurs a significant shimmering penalty for every lookup.  If [dict get] were to access a persistent index over the list, rather than insisting on storing the values themselves in the index, this shimmering would not take place.  (The Wibble version I describe hasn't been posted to the Wiki yet.)

I have decided against using the Tcl hash table implementation for this project, primary because there is no good way to pass "client data" to the custom hash procs.  Such client data is necessary to tie the hash table back to the intrep.  I considered and rejected global variables, thread-local storage, and putting magic keys in the table.  I suppose I could have a global hash table mapping from hash tables back to intreps, but that is too incestuous for my liking.  Freed from using a generic hash table, I have integrated an optimized hash table directly into the data structure.  I have mostly been using the BLT hash table implementation as reference.

In my "sandbox copy" of Tcl I have added a *dict field to the definition of List in tclInt.h.  (As far as I know, changing data structure definitions tclInt.h doesn't automatically break binary compatibility with extensions.)

typedef struct List {
    int refCount;
    int maxElemCount;           /* Total number of element array slots. */
    int elemCount;              /* Current number of list elements. */
    int canonicalFlag;          /* Set if the string representation was
                                 * derived from the list representation. May
                                 * be ignored if there is no string rep at
                                 * all.*/
    Tcl_Obj *elements;          /* First list element; the struct is grown to
                                 * accomodate all elements. */
    Dict *dict;                 /* Dict index over the list, or NULL if this
                                 * list doesn't have a dict representation. */
} List;

I have modified the list functions to recognize when to invalidate the dict index.  That's the only responsibility added to the list functions.

I define Dict thusly:

#define ENTRIES_PER_BUCKET 3
#define STATIC_BUCKETS 4
#define STATIC_ENTRIES (STATIC_BUCKETS * ENTRIES_PER_BUCKET)

struct Dict {
    int *buckets;           /* Pointer to bucket array. Each element is the
                             * bucket head entry's index into the entries
                             * array, or -1 if the bucket is empty. */
    DictEntry *entries;     /* Pointer to entry array. */
    int *allocation;        /* Partitioned list tracking the allocation of
                             * entries, containing indexes into entries. The
                             * first numEntries values are the indexes of
                             * allocated entries, in the order they were
                             * created. The remaining values are the indexes
                             * of unused entries, in arbitrary order. */
    int maxBuckets;         /* Size of the buckets array. */
    int numEntries;         /* Number of entries present in table. Equal to
                             * elemCount/2 plus the number of duplicate keys
                             * in the elements array. */
    int maxEntries;         /* Size of the entries and allocation arrays. The
                             * table grows when numEntries exceeds this. */
    unsigned long mask;     /* Mask value used in hashing function. */
    unsigned int downShift; /* Shift count used in hashing function. Designed
                             * to use high-order bits of randomized keys. */
    int staticBuckets[STATIC_BUCKETS];
                            /* Bucket array used for small dicts to avoid
                             * malloc. When the dict is small, buckets points
                             * to this array. */
    DictEntry staticEntries[STATIC_ENTRIES];
                            /* Entry array used for small dicts to avoid
                             * malloc. When the dict is small, entries points
                             * to this array. */
    int staticAllocation[STATIC_ENTRIES];
                            /* Allocation array used for small dicts to avoid
                             * malloc. When the dict is small, allocation
                             * points to this array. */
    Tcl_Obj *chain;         /* Linked list used for invalidating the string
                             * reps of updated nested dictionaries. */
};

And DictEntry is:

typedef struct DictEntry {
    unsigned long hash;     /* This entry's hash value, for quick comparison
                             * when finding the right entry in the bucket. */
    int next;               /* Index into the entries array for the next entry
                             * in this hash bucket, or -1 for end of chain. */
    int key;                /* The entry's index into elements, divided by
                             * two. Multiply by two to get the key index, then
                             * add one to get the value index. If there are no
                             * duplicates, key equals the entry's index in the
                             * allocation array. */
} DictEntry;

A "pure" dict has no duplicate keys, so numEntries*2==elemCount.  The dict commands produce only pure dicts.  The only way to make an impure dict is to build it using list commands then create a dict index by using a read-only dict command.  Impure dicts can be read as-is, but modifying or copying it using [dict] commands necessitate rebuilding the list to have no duplicates.  This can be done with a simple in-place algorithm, because i<=entries[allocation[i]].key for 0<=i<numEntries.  In the following code fragment, elements and elemCount are fields in the List structure, and entries, allocation, and numEntries are fields in the dict structure.

if (numEntries * 2 != elemCount) {
    int i, j;
    for (i = 0; i < numEntries; ++i) {
        j = entries[allocation[i]].key;
        if (i != j) {
            assert(j > i);
            memcpy(elements + i * 2, elements + j * 2, sizeof(*elements) * 2);
            entries[allocation[i]].key = i;
        }
    }
    elemCount = numEntries * 2;
}

Here's a script example that shows how dict behaves in the face of duplicates.  This is current Tcl behavior, which I take pains to preserve.

% set data {a 0 b 1 a 2 a 3 c 4}
a 0 b 1 a 2 a 3 c 4
% dict get $data a
3
% dict get $data
a 3 b 1 c 4
% set data
a 0 b 1 a 2 a 3 c 4
% dict set data a -3
a -3 b 1 c 4
% set data
a -3 b 1 c 4

[dict get] returns a new pure dict formed from the input list, which the above algorithm permutes as follows: (the / in the allocation list shows the partitioning, determined by numEntries)

elements = {a 0 b 1 a 2 a 3 c 4}
elemCount = 10
numEntries = 3
allocation = {0 1 2 / 3 4 5 6 7 8 9 10 11}
entries.key = {3 1 4 ? ? ? ? ? ? ? ? ?}

1. Initial        : a 0 b 1 a 2 a 3 c 4
2. Copy "a"       : a 3 b 1 a 2 a 3 c 4
3. Leave "b" alone: a 3 b 1 a 2 a 3 c 4
4. Copy "c"       : a 3 b 1 c 4 a 3 c 4
5. Truncate       : a 3 b 1 c 4

elements = {a 3 b 1 c 4}
elemCount = 6
numEntries = 3
allocation = {0 1 2 / 3 4 5 6 7 8 9 10 11}
entries.key = {0 1 2 ? ? ? ? ? ? ? ? ?}

Now let's delete an element.

% dict unset data b
a 3 c 4

This would correspond to:

elements = {a 3 c 4}
elemCount = 4
numEntries = 2
allocation = {0 2 / 1 3 4 5 6 7 8 9 10 11}
entries.key = {0 ? 1 ? ? ? ? ? ? ? ? ?}

"b" is looked up in the hash table.  The hash function selects a bucket, which is the head of a singly-linked list through entries.  One of the linked entries has a matching hash, and a full key comparison is done to confirm the match.  The full text of the key comes from the Tcl_Obj pointed to by the entry->key*2'th slot in the elements array.  The key'th slot of the allocation array gets obliterated by shifting down all higher-indexed slots, up to numEntries-1 (in this case, allocation[1] = allocation[2]), and similar happens to elements, except that the indexes are doubled (overwrite "b" and "1" with "c" and "4").  The entry is freed by putting its index into numEntries-1 slot (allocation[2] = 1).  (It's more like a rotate than a shift.)  The keys for the shifted entries get decremented (--entries[2].key), and numEntries and elemCount are decremented as well.  Of course, reference counts are decremented for the objects removed from the list.

I could discuss a few more algorithms, but I don't want to bore you all to death.  I just picked these two to demonstrate the operation of these data structures.

Now, here's my big coding quandary.  Implementing the foregoing requires that the dict code be able to access some list internals and vice versa.  I don't want to merge tclListObj.c and tclDictObj.c (even though list and dict would no longer be distinct types), but I also don't want to de-static list and dict internals and move them into tclInt.h.  How should I handle this?

Thank you for your comments.  This has been an exciting project for me, and I'd like to make it a reality someday.  I have further enhancements planned for dict which are made possible by unifying list and dict, which I'll be happy to write emails and even TIPs about if you're interested.

-- 
Andy Goth | http://andy.junkdrome.org./
unununium@{aircanopy.net,openverse.com}

Thread: [TCLCORE] Unifying list and dict types

The Tool Command Language implementation

tcl-core