From: Jim L. <jl...@si...> - 2010-12-20 18:36:10
|
I see that Judy has implemented bottleneck functions in src/JudyCommon/ in the files JudyMalloc.c and JudyMallocIF.c. It appears that primary intent of of this implementation is for debugging and statistics. Has any thought been put into extending Judy's control over memory allocation so that each Judy array (or set) could be attached to a memory pool? Thanks, Jim Lloyd Silver Tail Systems |
From: Alan S. <aj...@fr...> - 2010-12-20 20:35:16
|
Jim et al, > I see that Judy has implemented bottleneck functions in > src/JudyCommon/ in the files JudyMalloc.c and JudyMallocIF.c. It > appears that primary intent of of this implementation is for debugging > and statistics. Has any thought been put into extending Judy's > control over memory allocation so that each Judy array (or set) could > be attached to a memory pool? Actually we put a LOT of development time into libJudy having its own memory allocator (directly calling sbrk(), etc). We studied the issue deeply. As you might know, malloc() and free() are about the only external requirements used by libJudy -- maybe one more, I forget what it was. Doug decided that there was no gain in having our own memory allocator, compared to a quality implementation of malloc(). It's been eight years and I'm having trouble recalling the public domain version he found and liked best. Doug Lea Malloc maybe? As for memory pools -- not sure what you mean. I haven't kept up. Are there modern malloc() libraries that offer some kind of namespace/pool feature? Does this buy you any performance? Cheers, Alan Silverstein |
From: Jim L. <jl...@si...> - 2010-12-20 21:17:14
|
Hi Alan, For various reasons it is awkward for all Judy allocations to come out of the global heap managed by malloc/free. We want to manage separate pools (heaps) and be able to specify for each Judy array which pool should be used for all memory allocs/frees done by the array. I agree that malloc/free has excellent performance and that it was a good choice to have Judy use malloc as opposed to some other memory allocator using a single heap. It is only because we want multiple pools that we want something other than malloc. We're prepared to sacrifice a bit of performance due to the increase in complexity. Even if the underlying pool allocator is as good as malloc/free, there will be a small performance cost because a reference to the allocator will have to be stored in the root record for a Judy array and then passed down as a function argument in the various operations that traverse the internal tree/trie data structures. We believe the benefits will outweigh the cost. Jim On Mon, Dec 20, 2010 at 12:35 PM, Alan Silverstein <aj...@fr...> wrote: > Jim et al, > > > I see that Judy has implemented bottleneck functions in > > src/JudyCommon/ in the files JudyMalloc.c and JudyMallocIF.c. It > > appears that primary intent of of this implementation is for debugging > > and statistics. Has any thought been put into extending Judy's > > control over memory allocation so that each Judy array (or set) could > > be attached to a memory pool? > > Actually we put a LOT of development time into libJudy having its own > memory allocator (directly calling sbrk(), etc). We studied the issue > deeply. As you might know, malloc() and free() are about the only > external requirements used by libJudy -- maybe one more, I forget what > it was. > > Doug decided that there was no gain in having our own memory allocator, > compared to a quality implementation of malloc(). It's been eight years > and I'm having trouble recalling the public domain version he found and > liked best. Doug Lea Malloc maybe? > > As for memory pools -- not sure what you mean. I haven't kept up. Are > there modern malloc() libraries that offer some kind of namespace/pool > feature? Does this buy you any performance? > > Cheers, > Alan Silverstein > |
From: Alan S. <aj...@fr...> - 2010-12-20 22:43:00
|
Jim et al, > For various reasons it is awkward for all Judy allocations to come out > of the global heap managed by malloc/free. We want to manage separate > pools (heaps) and be able to specify for each Judy array which pool > should be used for all memory allocs/frees done by the array. Gotcha. I'm wondering how you get multiple heaps. Been a while, I'm rusty. I know sbrk() is the *NIX way to get to quadrant-1 heap (right?), but newer alternative memory methods let you access shared memory or other address spaces? Or do modern malloc()s let you just specify a heap and somehow manage this for you? The fact that you are worried about this level of performance is both inspiring and depressing. I guess libJudy is doing good things for you but you still need it to be faster. > I agree that malloc/free has excellent performance and that it was a > good choice to have Judy use malloc as opposed to some other memory > allocator using a single heap. What other allocator? :-) Pretty much they are all called "malloc", but implementations vary greatly. Oh, I remembered, it's Ian Lea malloc() that Doug liked. Sounds to me like you and others have a much better grasp of the technicalities today than I do -- sorry. I made heavy and happy use of libJudy myself as a consumer 2007-2009, but haven't looked inside the code since 2002 when I helped develop it on HP. Also haven't heard from Doug Baskins, the main inventor, for several years. Dunno what he's up to. Cheers, Alan Silverstein |
From: john s. <sk...@us...> - 2010-12-20 22:02:58
|
On 21/12/2010, at 8:17 AM, Jim Lloyd wrote: > I agree that malloc/free has excellent performance and that it was a good choice to have Judy use malloc as opposed to some other memory allocator using a single heap. It is only because we want multiple pools that we want something other than malloc. Is your app multi-threaded? If not, you can use a global variable to store the allocator data: very low cost compared to passing the allocator around. [You set the variable before each operation] The same method would work for multi-threaded apps with a lock to protect the allocator: pretty ugly and much higher performance cost. > > We're prepared to sacrifice a bit of performance due to the increase in complexity. Even if the underlying pool allocator is as good as malloc/free, there will be a small performance cost because a reference to the allocator will have to be stored in the root record for a Judy array and then passed down as a function argument in the various operations that traverse the internal tree/trie data structures. We believe the benefits will outweigh the cost. For you. The challenge would be to modify the source so this overhead is optional. Say, by having two completely distinct versions of the source.. (Arggg not another MACRO please .. :) The current build technology is very bad so lets not make that even worse with more generators: we had to throw that stuff out because it is too hard to manage. My primary application for Judy is in a garbage collector, which maps pointers to information about the allocated objects. Judy is also available "in language" as a data structure to hold integers (can't use pointers yet because it requires the gc know all about Judy, since Judy hides addresses). In this context, a small extra cost adding a new address to the Judy data structures would be acceptable: allocation is already expensive, so an extra 5-10% overhead would be OK. The cost on a collection is another story though, i.e. the cost of *removing* values from a Judy array matters here because all the removals are done "at once" during the reaping phase of the collector and so are observable by the programmer (think games and framerates). Do you need to de-allocate as well? Often pools are used to avoid deallocation costs: you just de-allocate the whole pool. -- john skaller sk...@us... |
From: Jim L. <jl...@si...> - 2010-12-20 22:40:58
|
On Mon, Dec 20, 2010 at 2:02 PM, john skaller <sk...@us... > wrote: > > On 21/12/2010, at 8:17 AM, Jim Lloyd wrote: > > > I agree that malloc/free has excellent performance and that it was a good > choice to have Judy use malloc as opposed to some other memory allocator > using a single heap. It is only because we want multiple pools that we want > something other than malloc. > > Is your app multi-threaded? If not, you can use a global variable to store > the > allocator data: very low cost compared to passing the allocator around. > [You set the variable before each operation] > Yes, our app is multi-threaded . Actually, we have several different applications that share common code and Judy is integral to all of these applications. Some of these apps are batch and some are long-lived daemons. All of them scale well up to 16 cores. > > The same method would work for multi-threaded apps with a lock to protect > the allocator: pretty ugly and much higher performance cost. > No, that's not feasible. We can have literally millions of Judy arrays in one process. One global lock shared by all Judy arrays would kill the performance. > > > > We're prepared to sacrifice a bit of performance due to the increase in > complexity. Even if the underlying pool allocator is as good as malloc/free, > there will be a small performance cost because a reference to the allocator > will have to be stored in the root record for a Judy array and then passed > down as a function argument in the various operations that traverse the > internal tree/trie data structures. We believe the benefits will outweigh > the cost. > > For you. The challenge would be to modify the source so this overhead is > optional. > Say, by having two completely distinct versions of the source.. (Arggg not > another > MACRO please .. :) The current build technology is very bad so lets not > make > that even worse with more generators: we had to throw that stuff out > because > it is too hard to manage. > I agree this is a challenge. > > My primary application for Judy is in a garbage collector, which maps > pointers > to information about the allocated objects. Judy is also available "in > language" as > a data structure to hold integers (can't use pointers yet because it > requires the gc > know all about Judy, since Judy hides addresses). > > In this context, a small extra cost adding a new address to the Judy data > structures > would be acceptable: allocation is already expensive, so an extra 5-10% > overhead > would be OK. > That's my assumption. I'd expect the added overhead to be under 1% when amortized over the total cost (i.e. including the cost of the allocations). There should be no extra cost to the Judy*Get operations since these don't ever allocate. A Judy*Insert operation will be marginally more expensive, even in the case where the key already exists and no allocation is necessary, due to an additional argument on each stack frame. This is the worst case for cost, because the other case of a Judy*Insert where the key doesn't exist will have cost dominated by the cost of the memory allocation. > > The cost on a collection is another story though, i.e. the cost of > *removing* values > from a Judy array matters here because all the removals are done "at once" > during > the reaping phase of the collector and so are observable by the programmer > (think games and framerates). > > Do you need to de-allocate as well? Often pools are used to avoid > deallocation > costs: you just de-allocate the whole pool. > Yes, de-allocation is one of the big motivating factors for us, primarily in our long-lived daemon applications. > > -- > john skaller > sk...@us... > > > > > |
From: John M. <jo...@re...> - 2010-12-21 01:01:33
|
On Mon, Dec 20, 2010 at 2:02 PM, john skaller <sk...@us...> wrote: > My primary application for Judy is in a garbage collector, which maps pointers > to information about the allocated objects. Judy is also available "in language" as > a data structure to hold integers (can't use pointers yet because it requires the gc > know all about Judy, since Judy hides addresses). Ah, I was curious if anyone else was using Judy this way. I have been using Judy arrays to store all the GC meta info for my jhc compiler and it works really well. Although I no longer depend on it in the default case, I turn it on for debugging and tracking allocations and I found it very useful when prototyping new algorithms without having to modify the heap layout of my run-time. I just store the meta information in judy arrays and profiling and tracing will help me decide whether implementing the change for real is worth it. John |
From: john s. <sk...@us...> - 2010-12-21 02:49:06
|
On 21/12/2010, at 9:33 AM, Jim Lloyd wrote: > > On 21/12/2010, at 8:17 AM, Jim Lloyd wrote: > > Yes, our app is multi-threaded . Actually, we have several different applications that share common code and Judy is integral to all of these applications. Some of these apps are batch and some are long-lived daemons. All of them scale well up to 16 cores. > I am so glad to here this, gives me confidence in choice of Judy for my own app. > The same method would work for multi-threaded apps with a lock to protect > the allocator: pretty ugly and much higher performance cost. > > No, that's not feasible. We can have literally millions of Judy arrays in one process. One global lock shared by all Judy arrays would kill the performance. > Yea, it would. Actually you only need one lock per pool.. but it's still bad. > In this context, a small extra cost adding a new address to the Judy data structures > would be acceptable: allocation is already expensive, so an extra 5-10% overhead > would be OK. > > That's my assumption. I'd expect the added overhead to be under 1% when amortized over the total cost (i.e. including the cost of the allocations). There should be no extra cost to the Judy*Get operations since these don't ever allocate. A Judy*Insert operation will be marginally more expensive, even in the case where the key already exists and no allocation is necessary, due to an additional argument on each stack frame. This is the worst case for cost, because the other case of a Judy*Insert where the key doesn't exist will have cost dominated by the cost of the memory allocation. > Unfortunately it's hard to predict, since compilers like gcc can do fantastic job on code C and if you make a tiny change halve the performance. One extra machine register to pass the pointer down could cost 1% or 50%. > Do you need to de-allocate as well? Often pools are used to avoid deallocation > costs: you just de-allocate the whole pool. > > Yes, de-allocation is one of the big motivating factors for us, primarily in our long-lived daemon applications. If you're doing individual de-allocations instead of whole-pool deallocations, why are you using pools? [I mean why do you want to use pools :] -- john skaller sk...@us... |
From: john s. <sk...@us...> - 2010-12-21 02:49:07
|
On 21/12/2010, at 12:01 PM, John Meacham wrote: > On Mon, Dec 20, 2010 at 2:02 PM, john skaller > <sk...@us...> wrote: >> My primary application for Judy is in a garbage collector, which maps pointers >> to information about the allocated objects. Judy is also available "in language" as >> a data structure to hold integers (can't use pointers yet because it requires the gc >> know all about Judy, since Judy hides addresses). > > Ah, I was curious if anyone else was using Judy this way. I have been > using Judy arrays to store all the GC meta info for my jhc compiler > and it works really well. Although I no longer depend on it in the > default case, Oh? So how do you access the information? I suppose Haskell has different requirements. In Felix I have to cope with C pointers as well and also pointers INTO objects, and want to mix them all up as best as possible. Judy is just great since I can search for a pointer less than or equal to p, then use the result q to get a length n, and ask if q <= p < q+n and if so I have a pointer to q I allocated, otherwise it's a pointer I didn't allocate, or, maybe even an integer or something else from some C library. -- john skaller sk...@us... |
From: John M. <jo...@re...> - 2010-12-21 03:35:39
|
On Mon, Dec 20, 2010 at 6:28 PM, john skaller <sk...@us...> wrote: >> Ah, I was curious if anyone else was using Judy this way. I have been >> using Judy arrays to store all the GC meta info for my jhc compiler >> and it works really well. Although I no longer depend on it in the >> default case, > > Oh? So how do you access the information? > > I suppose Haskell has different requirements. In Felix I have to cope with > C pointers as well and also pointers INTO objects, and want to mix them all up > as best as possible. > > Judy is just great since I can search for a pointer less than or equal to p, > then use the result q to get a length n, and ask if q <= p < q+n and if > so I have a pointer to q I allocated, otherwise it's a pointer I didn't allocate, > or, maybe even an integer or something else from some C library. I was basically using it to implement a standard tri-color GC except instead of attaching the bits to the heap objects, I used Judy sets to represent the color sets. It had a numer of advantages, the sweep phase was as simple as discarding the array and the cache benefits of not having to bounce all over the heap were substantial. I could add something to the grey set and pre-fetch it then go on and by the time I went to follow its pointers it would be prefetched in, having to go to the heap location to set the grey bit would introduce a cache line stall. Judy helped me quickly experiment with different traversal/pre-fetching strategies. For the common case of allocations that have no internal pointers (known statically due to typing) I need never follow the pointer at all, just mark it black in my judy set and skip the grey list. Using judy to add debugging information was straightforward too, just have a judy map of memory location to whatever debugging info I wanted to collect at the allocation point (line number of code for instance, or the last lazy evaluation update to said location). I didn't want to make programs compiled with jhc depend on the judy library and could actually implement something better than straight judy due to having type information to take advantage of, but I was able to implement a fully formed and tested GC system due to my experimentation with judy which was quite useful. John |
From: Jim L. <jl...@si...> - 2010-12-21 03:17:24
|
On Mon, Dec 20, 2010 at 6:08 PM, john skaller <sk...@us... > wrote: > > On 21/12/2010, at 9:33 AM, Jim Lloyd wrote: > > Yes, de-allocation is one of the big motivating factors for us, primarily > in our long-lived daemon applications. > > If you're doing individual de-allocations instead of whole-pool > deallocations, why are you using pools? > [I mean why do you want to use pools :] > > We're not yet using pools and currently we don't do individual de-allocations. One of our uses of Judy is a symbol table, i.e. a bidirectional mapping between strings and tokens. This table only grows. This is fine in our batch applications but is an issue in our long-lived daemons where it is an apparent slow leak. We can delete the entire table and start fresh but we need a fast way to do this. Even if we could somehow make individual de-allocations fast, I'm concerned about heap fragmentation. A separate pool for the symbol table would address both problems. |
From: john s. <sk...@us...> - 2010-12-21 04:47:09
|
On 21/12/2010, at 2:17 PM, Jim Lloyd wrote: > On Mon, Dec 20, 2010 at 6:08 PM, john skaller <sk...@us...> wrote: > > On 21/12/2010, at 9:33 AM, Jim Lloyd wrote: > > Yes, de-allocation is one of the big motivating factors for us, primarily in our long-lived daemon applications. > > If you're doing individual de-allocations instead of whole-pool deallocations, why are you using pools? > [I mean why do you want to use pools :] > > We're not yet using pools and currently we don't do individual de-allocations. One of our uses of Judy is a symbol table, i.e. a bidirectional mapping between strings and tokens. This table only grows. This is fine in our batch applications but is an issue in our long-lived daemons where it is an apparent slow leak. We can delete the entire table and start fresh but we need a fast way to do this. Even if we could somehow make individual de-allocations fast, I'm concerned about heap fragmentation. A separate pool for the symbol table would address both problems. Right. So that's even a bit harder: I'd accept your overhead on allocations but desire a deallocator without the overhead since my system does do individual de-allocations, but it does them "in bulk" where RT performance is noticeable. Actually .. my system has provision for using a completely distinct collector for each object, and thus supporting pools and pool de-allocation .. but that doesn't apply to the Judy arrays providing the infra-structure: tricky making the GC infrastructure itself gc-collectable :) Hmm, there is only one other transparent solution I can think of: TLS. That doesn't require locking by the user, no idea how the compiler/os handle it though. Traditionally TLS is expensive. You'd go back to the store the allocator in a TLS variable each operation model. If you're using any C style I/O with errno crud, you're using TLS already. Actually, malloc() has to use synchronisation already so you're already paying a performance hit for synchronisation, it used to use mutex locks but today I don't know, it seems a good candidate for "lock free" operation to me. -- john skaller sk...@us... |
From: Alex M. <ale...@gm...> - 2010-12-21 04:58:35
|
Just dropping by to provide some numbers... Hmm, there is only one other transparent solution I can think of: TLS. That > doesn't require > locking by the user, no idea how the compiler/os handle it though. > Traditionally TLS is > expensive. You'd go back to the store the allocator in a TLS variable each > operation > model. If you're using any C style I/O with errno crud, you're using TLS > already. > If you look at this thread<http://lists.boost.org/Archives/boost/2009/03/149601.php>from Boost's mailing list, there's some nice benchmarking showing the performance of TLS (both pthread's and boost's implementation). Actually, malloc() has to use synchronisation already so you're already > paying a performance hit > for synchronisation, it used to use mutex locks but today I don't know, it > seems a good candidate > for "lock free" operation to me. The graphs generated to show tcmalloc's performance vs glibc's malloc (found here <http://goog-perftools.sourceforge.net/doc/tcmalloc.html>) give a nice idea of how well malloc handles multiple cores. |
From: john s. <sk...@us...> - 2010-12-21 06:43:51
|
On 21/12/2010, at 3:58 PM, Alex Miller wrote: > > If you look at this thread from Boost's mailing list, there's some nice benchmarking showing the performance of TLS (both pthread's and boost's implementation). Hardly surprising. Last I looked TLS is supported by blowing a machine register, (FS I think?) and there's no way boost can compete with an ABI integrated mechanism. > > The graphs generated to show tcmalloc's performance vs glibc's malloc (found here) give a nice idea of how well malloc handles multiple cores. Oooo .. thanks for that! Will have to investigate! (Criticism of paper: no date on it! mallocs et all change fast, nice to have dates). -- john skaller sk...@us... |