From: <fre...@fr...> - 2009-01-12 15:39:29
|
Hi all, First, happy new year! Following the discussions we've had a while back about string representations, Unicode, Tcl9, Cloverfield and the like, I've been working during the past weeks on a rope package. You can find it here on the Tcl9 project on Sourceforge: http://sourceforge.net/project/showfiles.php?group_id=216988 The implementation is a materialization of several ideas I've developed over the years, with some borrowed from the seminal paper on Cedar Ropes by Hans Boehm et. al., available here: http://www.cs.ubc.ca/local/reading/proceedings/spe91-95/spe/vol25/issue12/spe986.pdf Ropes are string structures where data is not stored in flat NUL-terminated arrays but in self-balancing binary trees, allowing for fast insertion/removal of arbitrarily long strings. The package is built around a dedicated memory allocator based on fixed-size cells, coupled with a generational, exact, mark-and-sweep garbage collector. Data structures: ================ Ropes are made of chunks of Unicode string data that can use either fixed-width formats (native C strings, UCS 1, 2 or 4 bytes) or UTF-8. Such string data chunks can take a variable number of cells; if the provided data is larger than the maximum size (here 63 cells, i.e. 1008 bytes minus the 4 or 8 byte header), it is split into several chunks and assembled transparently. Ropes can be made of chunks having different representations, thus allowing maximum compacity when mixing e.g. typically 7-bit Tcl code and foreign language data. Moreover, native NUL-terminated C strings are recognized as valid ropes from C code. Basic string chunks are assembled by concatenation to form larger ropes. Small strings are transparently merged into flat leaves, whereas larger ones use a self-balancing binary tree made of concat nodes. Substrings can also be extracted, and form either flat leaves for the smallest ones, or use a substr node for larger ones. The combination of both techniques allows for easy handling of arbitrarily large strings with minimal duplication and maximal sharing of raw string data. Apart from strings, all nodes are designed to fit into one single cell, thus providing maximum allocation performances and minimal memory impact. Indexing is O(1) for flat fixed-width ropes, O(n) for UTF-8 (as with Tcl), and O(logn) in general for complex ropes. The fact that basic string chunk size is limited also means that very large UTF-8 strings have better performances than flat UTF-8 strings (such as those used by Tcl) thanks to the intermediary indexing levels. Memory management: ================== The dedicated memory allocator is based on fixed-size cells (16 bytes on 32-bit architectures) within page-aligned memory. Each page contains a bitmask that indicates the cell status. For 1024-byte pages, this gives 64 16-byte cells, of which one is reserved. This results in a very small overhead (2 bit per cell) and a very good memory locality, both improving the performances dramatically (more on that below) on modern architectures compared to a traditional allocator (which typically uses linked list structures). This choice has been made because, with the exception of pure string data, rope nodes are typically small and easily fit within a 16-byte cell. Moreover, this allocator is coupled with a generational, exact (as opposed to conservative), mark-and-sweep garbage collector that again provides a huge speedup compared to manual free() calls. The only needed action is root declaration for all ropes that are externally referenced, using a simple reference counting scheme. The GC process is fully controllable in the sense that sections of code can be protected by pausing and resuming the automatic collection. Generational GC means that older ropes (having survived several GC cycles) are promoted to older pools that are collected less frequently; this limits the CPU impact of collections. Other features: =============== The package provides iterator structures and procedures, so porting existing code should be easy. I'm thinking especially about the regexp engine, which needs flat Unicode strings. Iterators abstract the whole stuff into a random-access model. Direct traversal of the string is O(1). A custom rope is available for extensions. Typical use would be for example memory-mapping of potentially very large datasets, even on memory-constrained systems (e.g. mobile platforms). Another use case would be programmatically generated data. Or it could be used to wrap malloc'd strings into ropes. Connections with Tcl: ===================== Tcl currently uses UTF-8 flat strings as its string representation. There have been some discussions about the format that future versions should use. One such proposal was augmented strings, where flat strings would be supplemented by additional information (indexing of UTF-8 strings for instance). Another concern was memory consumption needed for the support of UCS4 (32-bit chars). I think ropes provides a good solution to all this problems. Ropes are immutable strings, which fit the Tcl model perfectly. One can mix several formats transparently, making byte arrays unnecessary and removing a prominent cause of shimmering. And as complex types such as lists build their string representations from those of their elements, maximum reuse of existing strings is ensured and limits the memory impact even further. Last, the impact on client code is minimal thanks to the automatic garbage collector and the backward compatibility with C strings. For these reasons I think that ropes would be a good choice as a native string representation for future versions of Tcl. Performances: ============= Of course there are lies, damn lies and benchmarks, but the first performance test I've run on my system are very satisfactory. On a Core 2 Duo P8400 2.26GHz WinXP SP3 with 2GB RAM I get the following results (from test.c): (figures in ms) --------------------------------------------------------------------- testAlloc: ropes vs. malloc raw allocation performances Ropes: 40000000 12-byte ropes = 480000000 data bytes ... 1859 create + 610 GC = 2469 malloc: 40000000 12-byte C strings = 480000000 data bytes ... 6718 malloc + memcpy + 7922 free = 14640 Ropes: 20000000 28-byte ropes = 560000000 data bytes ... 1453 create + 625 GC = 2078 malloc: 20000000 28-byte C strings = 560000000 data bytes ... 3813 malloc + memcpy + 3828 free = 7641 Ropes: 15000000 44-byte ropes = 660000000 data bytes ... 1375 create + 703 GC = 2078 malloc: 15000000 44-byte C strings = 660000000 data bytes ... 3016 malloc + memcpy + 2953 free = 5969* Ropes: 1000000 1000-byte ropes = 1000000000 data bytes ... 1062 create + 1000 GC = 2062 malloc: 1000000 1000-byte C strings = 1000000000 data bytes ... 1047 malloc + memcpy + 391 free = 1438 --------------------------------------------------------------------- This shows that the allocator+GC performs usually faster than malloc+free, mostly because of the GC performances compared to free(). The malloc version outperforms the ropes only in the case of a large number of large strings. So this benchmark shows the real benefits of automatic memory management even on the simplest cases; in the general real-world case where a lot of small structure are allocated and freed during the lifetime of the application, the malloc version would perform closer to the worst case above, and maybe worse because of memory fragmentation. The following test is closer to a real world application: it runs several (10,000) cycles during which 80 large strings are allocated and preserved. --------------------------------------------------------------------- testGeneration: With all ropes preserved: 10000 x 80 988-byte ropes + roots = 790400000 data bytes : 16953 With no more than 10000 ropes preserved: 10000 x 80 988-byte ropes + roots = 790400000 data bytes : 4406 --------------------------------------------------------------------- This shows the generational properties of the GC: older ropes will be promoted to older pools and thus traversed less often by the collector. A real world application during its lifetime would typically store a fairly constant number of stable objects (global or static data, business models) and allocate a larger number of short-lived objects (input, output and temporary values). Generational GC ensures that the latter get collected more often than the former, and let stable objects percolate to deeper layers. Conclusion: =========== The package needs some polish (test suite, docs, etc.) but I think the code is fairly usable in its current state. At present it only works on 32-bit architectures, but a port to 64-bit should be straightforward. Moreover, as the custom allocator needs page-aligned memory, I've only implemented a Win32 version based on VirtualAlloc(). Posix systems would need posix_memalign() or something more suitable to get the same result (it uses a 1024-byte boundary on 32-bit systems). I plan to design a similar package but for objects, especially lists since they are very similar to strings, and integrate both closely. Comments welcome! |