You can subscribe to this list here.
2005 |
Jan
|
Feb
(53) |
Mar
(62) |
Apr
(88) |
May
(55) |
Jun
(204) |
Jul
(52) |
Aug
|
Sep
(1) |
Oct
(94) |
Nov
(15) |
Dec
(68) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2006 |
Jan
(130) |
Feb
(105) |
Mar
(34) |
Apr
(61) |
May
(41) |
Jun
(92) |
Jul
(176) |
Aug
(102) |
Sep
(247) |
Oct
(69) |
Nov
(32) |
Dec
(140) |
2007 |
Jan
(58) |
Feb
(51) |
Mar
(11) |
Apr
(20) |
May
(34) |
Jun
(37) |
Jul
(18) |
Aug
(60) |
Sep
(41) |
Oct
(105) |
Nov
(19) |
Dec
(14) |
2008 |
Jan
(3) |
Feb
|
Mar
(7) |
Apr
(5) |
May
(123) |
Jun
(5) |
Jul
(1) |
Aug
(29) |
Sep
(15) |
Oct
(21) |
Nov
(51) |
Dec
(3) |
2009 |
Jan
|
Feb
(36) |
Mar
(29) |
Apr
|
May
|
Jun
(7) |
Jul
(4) |
Aug
|
Sep
(4) |
Oct
|
Nov
(13) |
Dec
|
2010 |
Jan
|
Feb
|
Mar
(9) |
Apr
(11) |
May
(16) |
Jun
|
Jul
|
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2012 |
Jan
(7) |
Feb
(3) |
Mar
|
Apr
|
May
|
Jun
(3) |
Jul
|
Aug
|
Sep
|
Oct
(92) |
Nov
(28) |
Dec
(16) |
2013 |
Jan
(9) |
Feb
(2) |
Mar
|
Apr
(4) |
May
(4) |
Jun
(6) |
Jul
(14) |
Aug
(12) |
Sep
(4) |
Oct
(13) |
Nov
(1) |
Dec
(6) |
2014 |
Jan
(23) |
Feb
(19) |
Mar
(10) |
Apr
(14) |
May
(11) |
Jun
(6) |
Jul
(11) |
Aug
(15) |
Sep
(41) |
Oct
(95) |
Nov
(23) |
Dec
(11) |
2015 |
Jan
(3) |
Feb
(9) |
Mar
(19) |
Apr
(3) |
May
(1) |
Jun
(3) |
Jul
(11) |
Aug
(1) |
Sep
(15) |
Oct
(5) |
Nov
(2) |
Dec
|
2016 |
Jan
(7) |
Feb
(11) |
Mar
(8) |
Apr
(1) |
May
(3) |
Jun
(17) |
Jul
(12) |
Aug
(3) |
Sep
(5) |
Oct
(19) |
Nov
(12) |
Dec
(6) |
2017 |
Jan
(30) |
Feb
(23) |
Mar
(12) |
Apr
(32) |
May
(27) |
Jun
(7) |
Jul
(13) |
Aug
(16) |
Sep
(6) |
Oct
(11) |
Nov
|
Dec
(12) |
2018 |
Jan
(1) |
Feb
(5) |
Mar
(6) |
Apr
(7) |
May
(23) |
Jun
(3) |
Jul
(2) |
Aug
(1) |
Sep
(6) |
Oct
(6) |
Nov
(10) |
Dec
(3) |
2019 |
Jan
(26) |
Feb
(15) |
Mar
(9) |
Apr
|
May
(8) |
Jun
(14) |
Jul
(10) |
Aug
(10) |
Sep
(4) |
Oct
(2) |
Nov
(20) |
Dec
(10) |
2020 |
Jan
(10) |
Feb
(14) |
Mar
(29) |
Apr
(11) |
May
(25) |
Jun
(21) |
Jul
(23) |
Aug
(12) |
Sep
(19) |
Oct
(6) |
Nov
(8) |
Dec
(12) |
2021 |
Jan
(29) |
Feb
(9) |
Mar
(8) |
Apr
(8) |
May
(2) |
Jun
(2) |
Jul
(9) |
Aug
(9) |
Sep
(3) |
Oct
(4) |
Nov
(12) |
Dec
(13) |
2022 |
Jan
(4) |
Feb
|
Mar
(4) |
Apr
(12) |
May
(15) |
Jun
(7) |
Jul
(10) |
Aug
(2) |
Sep
|
Oct
(1) |
Nov
(8) |
Dec
|
2023 |
Jan
(15) |
Feb
|
Mar
(23) |
Apr
(1) |
May
(2) |
Jun
(10) |
Jul
|
Aug
(22) |
Sep
(19) |
Oct
(2) |
Nov
(20) |
Dec
|
2024 |
Jan
(1) |
Feb
|
Mar
(16) |
Apr
(15) |
May
(6) |
Jun
(4) |
Jul
(1) |
Aug
(1) |
Sep
|
Oct
(13) |
Nov
(18) |
Dec
(6) |
2025 |
Jan
(12) |
Feb
|
Mar
(2) |
Apr
(1) |
May
(11) |
Jun
(5) |
Jul
(4) |
Aug
(2) |
Sep
|
Oct
|
Nov
|
Dec
|
From: Zoran V. <zv...@ar...> - 2007-01-23 14:54:03
|
Am 23.01.2007 um 15:42 schrieb Zoran Vasiljevic: > This way the system allocator > is handling those and it can toss free'd blocks > to other threads reducing peak memory usage. > Since allocations of that sizes (and larger) are > pretty rare, we need not worry about locking > contention as it is the case with smaller allocations. More info... Actually that lowers the amount of memory released to the system (as system allocs mostly never do that) but it exremely reduces the peak memory usage. In our app, when we start a job, the VM goes to about 300MB with initial VTmalloc, to 235MB with modified VTmalloc (8K allocs) and to about 225MB with system alloc. At the same time, VTmalloc is up to 20 times faster than system alloc (thats on Mac). Thats a pretty good bargain for us: invest about 10MB more but get far better performance is absolutely acceptable. I haven=91t tested Linux or Solaris yet, but I guess I will have the same result (apart from the difference between the OS alloc and us). What I would like to do is to lower that usage even more on the cost of more locking contention, as we can tolerate that. Just to see how low and fast we can go. |
From: Zoran V. <zv...@ar...> - 2007-01-23 14:43:04
|
Am 23.01.2007 um 15:33 schrieb Stephen Deasey: > > Why? Even more, from 16 to 8K. Because most of the allocations fall in the range to about 8K. That was empirically obtained on a large OpenACS site. And, as we are pretty greedy (a thread never releases allocated memory until the exit) we can significantly bloat. This way the system allocator is handling those and it can toss free'd blocks to other threads reducing peak memory usage. Since allocations of that sizes (and larger) are pretty rare, we need not worry about locking contention as it is the case with smaller allocations. I can confirm that with this strategy, we are still pretty fast (speed didn't practically change) and use far less memory than before. |
From: Stephen D. <sd...@gm...> - 2007-01-23 14:33:18
|
On 1/22/07, Zoran Vasiljevic <vas...@us...> wrote: > Update of /cvsroot/naviserver/vtmalloc > In directory sc8-pr-cvs7.sourceforge.net:/tmp/cvs-serv32463 > > Modified Files: > ChangeLog tclThreadAlloc.c > Log Message: > Reduced alloation size handled by this allocator to 16384 bytes. Larger > allocations are done with the system allocator. Why? |
From: Stephen D. <sd...@gm...> - 2007-01-20 16:05:46
|
On 1/20/07, Zoran Vasiljevic <zv...@ar...> wrote: > > Am 19.01.2007 um 17:38 schrieb Stephen Deasey: > > > For Tcl, add a new switch to ns_cache_create, -expireflush (or > > whatever). Keep the -expire and -expireflush times distinct. Tcl > > caches live forever IIRC so that's all there is to it, no need to > > handle unregistering the scheduled proc. > > > > Why not just > > ns_cache_flush ?-expired? cache > > Programmer may invoke that anytime it wishes. Actually, only the > Tcl wrapper needs to be done that will step each entry from the > cache but not return any values. The cache code will take care > to expire old items as it walks the cache. > Even better. By adding the flag to the ns_cache_create command I was trying to bias the Tcl side towards easy and the C side towards flexibility. But the above is easy enough and it looks cleaner. |
From: Zoran V. <zv...@ar...> - 2007-01-20 15:52:09
|
Am 19.01.2007 um 17:38 schrieb Stephen Deasey: > > I don't think it's 100% accurate to say memory just bloats and bloats! > All caches have an upper bound. A 10MB cache will not grow beyond > 10MB. Yes. It is not 100% accurate. > Looks like you have a different situation though. Your cache is sized > large enough to hold all data, but the usual working set is much > smaller. With the fancy new memory allocator that returns memory to > the OS, you'd like to prune the cache of expired entries even though > the cache is still below it's configured max size. Actually, the situation is: I have a large database (not SQL) and I cache some visited rows in the ns_cache. During the session user may cache number of rows but not visit them any more, hence they will stay in the cache forever. It may be that most of the cache is just old-data but it is kept there for no reason. > > I removed this functionality from the old code! > > I'm fine with it being put back, but there were a couple of problems > with the old code worth mentioning. > > In the old code, caches were either size limited *or* time limited, > but not both. With a time limited cache therefore you had to have a > sweeper thread to clear out expired entries or it would indeed bloat > and bloat... Unfortunately, if a load spike occurred in between a > sweep then the cache would grow until the server aborts. > > Another problem (if I'm remembering this right) was that the sweep > interval was the same as the expiry time for entries, which was > specified at cache creation time. So if you had a large cache with a > short expiry time, frequently a thread would be created, lock the > cache, scan through all the entries, maybe evict a few, then unlock > the cache. And then do it again a moment later. > > OK. No automatism. I'm also OK if I can run that by hand on points in time. > So for a new sweeper, maybe the way to handle it is to create > Ns_CacheFlushExpired() which can be passed to Ns_ScheduleProc(). C > code can handle the period manually. > > For Tcl, add a new switch to ns_cache_create, -expireflush (or > whatever). Keep the -expire and -expireflush times distinct. Tcl > caches live forever IIRC so that's all there is to it, no need to > handle unregistering the scheduled proc. > Why not just ns_cache_flush ?-expired? cache Programmer may invoke that anytime it wishes. Actually, only the Tcl wrapper needs to be done that will step each entry from the cache but not return any values. The cache code will take care to expire old items as it walks the cache. I cannot comment about having single LRU list for all the caches. It sounds like more work than I can afford. A simple "-expired" flag would be sufficient. But, the idea seems OK as it would free me from "guessing" on how large each cache would/should be. |
From: Stephen D. <sd...@gm...> - 2007-01-19 20:15:50
|
On 1/19/07, Jeff Rogers <dv...@di...> wrote: > > A second situation is where there is an expired entry and multiple > threads access it at the same time. It is probably handled as above, as > if there was no entry. But it might be desirable in certain situations > to return a stale entry immediately and start a separate thread to run > the script. > That's an interesting idea -- a stale entry might be acceptable some times. The situation isn't too bad though. If the server is loaded and computing the cache entry takes 30 secs, only the first thread, the computing thread, will take the full 30 sec hit. The other threads will wait only -timeout seconds, which may be e.g. 2 secs. They'll fail. But it least they won't be tied up waiting. It might be nice for subsequent threads to have the option of getting a stale value after the timeout period. |
From: Stephen D. <sd...@gm...> - 2007-01-19 19:59:30
|
On 1/19/07, Jeff Rogers <dv...@di...> wrote: > Stephen Deasey wrote: > > > Hit rate is not necessarily the best measure of effectiveness. Perhaps > > it's more expensive to fill one cache than it is to fill another? A > > single, shared LRU seemed like the simplest solution. Ideas > > appreciated. > > What if 'ns_cache eval' stored with the cached value an estimate of how > much work (i.e., time) it took to create the cache entry, and used that > as a hint for expiring or flushing values? An operation that is used > 1/10 as often but takes 50x longer to compute may be worth keeping > around longer. That might make sense. Save the x50 hit 1/10 of the time. I guess you'd have to factor in the size. If the x50 hit took a lot of cache space, it might not be worth it, compared to all those more frequently used, but less expensive entries you could have stored. How do you figure out where the line is? Also, how would you sort the weighted LRU list? > And as long as I'm musing on caching, 2 situations dealing with > expensive cache calculations come to mind. One is where there is no > current value for an entry and several threads request it simultaneously > (or it is requested multiple times before the first request has time to > complete evaluation of the script). Do the second and subsequent > accesses start their own evaluation of the script, or do they block > until the first completes and then all return the single value? Without > digging into the code or running tests, my guess is that this works > correctly, since the key should be locked with a mutex and reads would > block the thread until the mutex is released. We do the right thing. There's a lock around the whole cache. If an entry value needs to be computed, the entry is marked, the lock is released while the value is computed and the lock is re-taken once the value is ready. Another thread which takes the lock will discover the marked entry and sleep waiting for the original thread to complete the cache fill. > A second situation is where there is an expired entry and multiple > threads access it at the same time. It is probably handled as above, as > if there was no entry. But it might be desirable in certain situations > to return a stale entry immediately and start a separate thread to run > the script. Should work as above. > These solutions may be too clever for their own good and apply only to a > very few marginal cases. > Seems like the cache balancing problem is the kind of thing someone way smarter than me has already solved. I had a google around but could only find papers on optimising processor caches... |
From: Jeff R. <dv...@di...> - 2007-01-19 19:03:13
|
Stephen Deasey wrote: > Hit rate is not necessarily the best measure of effectiveness. Perhaps > it's more expensive to fill one cache than it is to fill another? A > single, shared LRU seemed like the simplest solution. Ideas > appreciated. What if 'ns_cache eval' stored with the cached value an estimate of how much work (i.e., time) it took to create the cache entry, and used that as a hint for expiring or flushing values? An operation that is used 1/10 as often but takes 50x longer to compute may be worth keeping around longer. And as long as I'm musing on caching, 2 situations dealing with expensive cache calculations come to mind. One is where there is no current value for an entry and several threads request it simultaneously (or it is requested multiple times before the first request has time to complete evaluation of the script). Do the second and subsequent accesses start their own evaluation of the script, or do they block until the first completes and then all return the single value? Without digging into the code or running tests, my guess is that this works correctly, since the key should be locked with a mutex and reads would block the thread until the mutex is released. A second situation is where there is an expired entry and multiple threads access it at the same time. It is probably handled as above, as if there was no entry. But it might be desirable in certain situations to return a stale entry immediately and start a separate thread to run the script. These solutions may be too clever for their own good and apply only to a very few marginal cases. -J |
From: Stephen D. <sd...@gm...> - 2007-01-19 16:38:17
|
On 1/19/07, Zoran Vasiljevic <zv...@ar...> wrote: > Hi! > > I have observed a ns_cache behaviour that I'd like to change. > > In ns_cache, it is possible to create timestamped elements > that would expire after some time. But, they will never > expire on their own. If you fill the cache with thousands > of elements and never visit any of those any more, they will > stay in the cache forever, bloating the memory. > > There has to be some instance that will either automatically > purge expired elements (a yet-another-detached thread) or > one should be able to do that programatically, like > > ns_cache_flush ?-expired? cache > > otherwise the system memory just bloats and bloats. > At the moment, the only workaround to this behaviour > is programmer calling > > ns_cache_keys > > as it will return all still-valid keys, silently deleting > expired ones. This is not optimal, as again large memory might > be needed to store all the "alive" keys, although caller is > not particulary interested to have them in the first place. > > What should we do? > I don't think it's 100% accurate to say memory just bloats and bloats! All caches have an upper bound. A 10MB cache will not grow beyond 10MB. There's two ways an expired entry will be physically removed from the cache: it reaches the end of the LRU list, and whether it has expired or not, it is pruned; it is accessed but has expired, so is pruned rather than returned. The idea is that a cache contains a subset of the total data set. The natural action of fresh data entering the cache pushes out the expired entries without overhead, just as it pushes out infrequently used entries. Looks like you have a different situation though. Your cache is sized large enough to hold all data, but the usual working set is much smaller. With the fancy new memory allocator that returns memory to the OS, you'd like to prune the cache of expired entries even though the cache is still below it's configured max size. I removed this functionality from the old code! I'm fine with it being put back, but there were a couple of problems with the old code worth mentioning. In the old code, caches were either size limited *or* time limited, but not both. With a time limited cache therefore you had to have a sweeper thread to clear out expired entries or it would indeed bloat and bloat... Unfortunately, if a load spike occurred in between a sweep then the cache would grow until the server aborts. Another problem (if I'm remembering this right) was that the sweep interval was the same as the expiry time for entries, which was specified at cache creation time. So if you had a large cache with a short expiry time, frequently a thread would be created, lock the cache, scan through all the entries, maybe evict a few, then unlock the cache. And then do it again a moment later. So for a new sweeper, maybe the way to handle it is to create Ns_CacheFlushExpired() which can be passed to Ns_ScheduleProc(). C code can handle the period manually. For Tcl, add a new switch to ns_cache_create, -expireflush (or whatever). Keep the -expire and -expireflush times distinct. Tcl caches live forever IIRC so that's all there is to it, no need to handle unregistering the scheduled proc. I've been thinking about how to auto-tune cache sizes. It occurs to me that maybe it solves this problem too. The idea is to use a common LRU list for all caches. Specify a total maximum cache size which all caches will share. Don't specify individual cache sizes. You set the total max cache size to whatever your server can spare, and as the server runs and load changes, the caches compete for a slice of memory. In effect, creating a new cache is creating a new name space in the single global cache. The more caches exist, the harder it is tune the sizes for maximum performance. When you consider dynamically changing load, it's basically impossible. The above is just an implementation -- the problem is given space X and Y caches of varying load, how do you assign the space to the caches such that, overall, the performance is highest. Hit rate is not necessarily the best measure of effectiveness. Perhaps it's more expensive to fill one cache than it is to fill another? A single, shared LRU seemed like the simplest solution. Ideas appreciated. Anyway, this applies to your current problem because by using a common LRU the contents of other caches can push out the expired entries from a cache which would otherwise leave them active. My assumption is the same: caches are usually smaller than the total data size. The common case will compensate for your unusual cache problem. If you only have the one cache, or all your caches fit fully into memory, I guess this doesn't apply. You need the sweeper. Otherwise, the auto tuning caches would mean no new extra tuning knob for expiry flushing, and of course the auto size tuning... |
From: Vlad S. <vl...@cr...> - 2007-01-19 15:13:46
|
You can make it similar to DB driver where each pool has its own schedule proc for periodic checks Ns_ScheduleProc(CheckPool, poolPtr, 0, Ns_ConfigIntRange(path, "checkinterval", 600, 0, INT_MAX)); Zoran Vasiljevic wrote: > Hi! > > I have observed a ns_cache behaviour that I'd like to change. > > In ns_cache, it is possible to create timestamped elements > that would expire after some time. But, they will never > expire on their own. If you fill the cache with thousands > of elements and never visit any of those any more, they will > stay in the cache forever, bloating the memory. > > There has to be some instance that will either automatically > purge expired elements (a yet-another-detached thread) or > one should be able to do that programatically, like > > ns_cache_flush ?-expired? cache > > otherwise the system memory just bloats and bloats. > At the moment, the only workaround to this behaviour > is programmer calling > > ns_cache_keys > > as it will return all still-valid keys, silently deleting > expired ones. This is not optimal, as again large memory might > be needed to store all the "alive" keys, although caller is > not particulary interested to have them in the first place. > > What should we do? > > Cheers > Zoran > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > naviserver-devel mailing list > nav...@li... > https://lists.sourceforge.net/lists/listinfo/naviserver-devel > -- Vlad Seryakov 571 262-8608 office vl...@cr... http://www.crystalballinc.com/vlad/ |
From: Zoran V. <zv...@ar...> - 2007-01-19 11:00:45
|
Hi! I have observed a ns_cache behaviour that I'd like to change. In ns_cache, it is possible to create timestamped elements that would expire after some time. But, they will never expire on their own. If you fill the cache with thousands of elements and never visit any of those any more, they will stay in the cache forever, bloating the memory. There has to be some instance that will either automatically purge expired elements (a yet-another-detached thread) or one should be able to do that programatically, like ns_cache_flush ?-expired? cache otherwise the system memory just bloats and bloats. At the moment, the only workaround to this behaviour is programmer calling ns_cache_keys as it will return all still-valid keys, silently deleting expired ones. This is not optimal, as again large memory might be needed to store all the "alive" keys, although caller is not particulary interested to have them in the first place. What should we do? Cheers Zoran |
From: Gustaf N. <ne...@wu...> - 2007-01-17 07:50:45
|
Hi Jeff, we are aware that the funciton is essentially an integer log2. The chosen C-based variant is acually faster and more general than what you have included (it needs only max 2 shift operations for the relevant range) but the assembler based variant is hard to beat and yields another 3% for the performance of the benchmark on top of the fastest C version. Thanks for that! -gustaf Jeff Rogers schrieb: > I don't think anyone has pointed this out yet, but this is a logarithm > in base 2 (log2), and there are a fair number of implementations of this > available; for maximum performance there are assembly implementations > using 'bsr' on x86 architectures, such as this one from google's tcmalloc: > > |
From: Jeff R. <dv...@di...> - 2007-01-16 23:03:31
|
Gustaf Neumann wrote: > This is most probably the best variabt so far, and not complicated, such a > optimizer can do "the right thing" easily. sorry for the many versions.. > > -gustaf > > > { unsigned register int s = (size-1) >> 3; > while (s>1) { s >>= 1; bucket++; } > } > > if (bucket > NBUCKETS) { > bucket = NBUCKETS; > } I don't think anyone has pointed this out yet, but this is a logarithm in base 2 (log2), and there are a fair number of implementations of this available; for maximum performance there are assembly implementations using 'bsr' on x86 architectures, such as this one from google's tcmalloc: // Return floor(log2(n)) for n > 0. #if (defined __i386__ || defined __x86_64__) && defined __GNUC__ static inline int LgFloor(size_t n) { // "ro" for the input spec means the input can come from either a // register ("r") or offsetable memory ("o"). size_t result; __asm__("bsr %1, %0" : "=r" (result) // Output spec : "ro" (n) // Input spec : "cc" // Clobbers condition-codes ); return result; } #else // Note: the following only works for "n"s that fit in 32-bits, but // that is fine since we only use it for small sizes. static inline int LgFloor(size_t n) { int log = 0; for (int i = 4; i >= 0; --i) { int shift = (1 << i); size_t x = n >> shift; if (x != 0) { n = x; log += shift; } } ASSERT(n == 1); return log; } #endif (Disclaimer - this comment is based on my explorations of zippy, not vt, so the logic may be entirely different) If this log2(requested_size) is used to translate directly index into the bucket table that necessarily restricts you to having power-of-2 bucket sizes, meaning you allocate on average nearly 50% more than requested (i.e., nearly 33% of allocated memory is overhead/wasted). Adding more, closer-spaced buckets adds to the base footprint but possibly reduces the max usage by dropping the wasted space. I believe tcmalloc uses buckets spaced so that the average waste is only 12.5%. -J |
From: Zoran V. <zv...@ar...> - 2007-01-16 16:17:17
|
Am 16.01.2007 um 15:52 schrieb Zoran Vasiljevic: > > You see, even we (i.e. Mike) noticed one glitch in the > test program that make Zippy look ridiculous on the Mac, > although it wasn't. Hmhmhmh... I must have done something very wrong :-( When I now repeat the tests on Mac/Zippy, even with the size limited to 16000 bytes, it still performs miserably. For just one thread, it gives "decent" values (although still 2.5 times slower than VT). For two threads, it goes down to about 1/5th and so on... I have asked Gustaf to try to reproduce that on his Mac, as I slowly start to see white mice (no, I never drink _any_ alkohol)... If Gustaf confirms my findings, then we are still back where we were with Zippy. And yes, I have disabled that block splitting Mike was talking about in his email. So it is not that. And... it is not size of the allocation (> 16284) as I fixed that as well... Background: I wanted to update the README file with new performance values and found out that Zippy isn't changed, although I thought it was fixed with that size change... Hmmm... Zoran |
From: Vlad S. <vl...@cr...> - 2007-01-16 15:05:49
|
Yes, it is combined version, but Tcl version is slightly different and Zoran took it over to maintain, in my tarball i include both, we do experiments in different directions and then combine best results. Also the intention was to try to include it in Tcl itself. Stephen Deasey wrote: > On 1/16/07, Stephen Deasey <sd...@gm...> wrote: >> On 1/16/07, Zoran Vasiljevic <zv...@ar...> wrote: >>> Am 16.01.2007 um 12:18 schrieb Stephen Deasey: >>> >>>> vtmalloc <-- add this >>> It's there. Everybody can now contribute, if needed. >>> >> Rocking. >> >> I suggest putting the 0.0.3 tarball up on sourceforge, announcing on >> Freshmeat, and cross-posting on the aolserver list. You really want >> random people with their random workloads on random OS to beat on >> this. I don't know if the pool of people here is large enough for >> that... >> >> I'm sure there's a lot of other people who would be interested in >> this, if they knew about it. Should probably cross-post here, for >> example: >> >> http://wiki.tcl.tk/9683 - Why Do Programs Take Up So Much Memory? >> > > Vlad's already on the ball... > > http://freshmeat.net/projects/vtmalloc/ > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > naviserver-devel mailing list > nav...@li... > https://lists.sourceforge.net/lists/listinfo/naviserver-devel > -- Vlad Seryakov 571 262-8608 office vl...@cr... http://www.crystalballinc.com/vlad/ |
From: Zoran V. <zv...@ar...> - 2007-01-16 14:53:06
|
Am 16.01.2007 um 15:41 schrieb Stephen Deasey: > > I suggest putting the 0.0.3 tarball up on sourceforge, announcing on > Freshmeat, and cross-posting on the aolserver list. You really want > random people with their random workloads on random OS to beat on > this. I don't know if the pool of people here is large enough for > that... > > I'm sure there's a lot of other people who would be interested in > this, if they knew about it. Should probably cross-post here, for > example: > > http://wiki.tcl.tk/9683 - Why Do Programs Take Up So Much Memory? The plan was to beat this beast first in the "family", then go to the next village (aol-list) and then visit the next town (tcl-core list), in that sequence. You see, even we (i.e. Mike) noticed one glitch in the test program that make Zippy look ridiculous on the Mac, although it wasn't. So we now have enough experience to go visit our neighbours and see what they'll say. On positive feedback, the next is Tcl core list. There I expect most fierce opposition to any change (which is understandable, given the size of the group of involved people and the kind of the change). Cheers Zoran |
From: Stephen D. <sd...@gm...> - 2007-01-16 14:48:19
|
On 1/16/07, Stephen Deasey <sd...@gm...> wrote: > On 1/16/07, Zoran Vasiljevic <zv...@ar...> wrote: > > > > Am 16.01.2007 um 12:18 schrieb Stephen Deasey: > > > > > vtmalloc <-- add this > > > > It's there. Everybody can now contribute, if needed. > > > > Rocking. > > I suggest putting the 0.0.3 tarball up on sourceforge, announcing on > Freshmeat, and cross-posting on the aolserver list. You really want > random people with their random workloads on random OS to beat on > this. I don't know if the pool of people here is large enough for > that... > > I'm sure there's a lot of other people who would be interested in > this, if they knew about it. Should probably cross-post here, for > example: > > http://wiki.tcl.tk/9683 - Why Do Programs Take Up So Much Memory? > Vlad's already on the ball... http://freshmeat.net/projects/vtmalloc/ |
From: Stephen D. <sd...@gm...> - 2007-01-16 14:42:05
|
On 1/16/07, Zoran Vasiljevic <zv...@ar...> wrote: > > Am 16.01.2007 um 12:18 schrieb Stephen Deasey: > > > vtmalloc <-- add this > > It's there. Everybody can now contribute, if needed. > Rocking. I suggest putting the 0.0.3 tarball up on sourceforge, announcing on Freshmeat, and cross-posting on the aolserver list. You really want random people with their random workloads on random OS to beat on this. I don't know if the pool of people here is large enough for that... I'm sure there's a lot of other people who would be interested in this, if they knew about it. Should probably cross-post here, for example: http://wiki.tcl.tk/9683 - Why Do Programs Take Up So Much Memory? |
From: Zoran V. <zv...@ar...> - 2007-01-16 14:04:33
|
Am 16.01.2007 um 12:18 schrieb Stephen Deasey: > vtmalloc <-- add this It's there. Everybody can now contribute, if needed. |
From: Stephen D. <sd...@gm...> - 2007-01-16 11:18:49
|
On 1/16/07, Zoran Vasiljevic <zv...@ar...> wrote: > > Am 16.01.2007 um 10:37 schrieb Stephen Deasey: > > > > > Can you import this into CVS? Top level. > > > > You mean the tclThreadAlloc.c file on top-level > of the naviserver project? > The whole thing: README, licence, tests etc. By top level, I just mean not in the modules directory, because it isn't one. So, CVS: naviserver modules website vtmalloc <-- add this Unless you're planning to push this upstream in the next week or so. Or you really want to host this on your own website. It's a shame to have good work hidden in random places. |
From: Gustaf N. <ne...@wu...> - 2007-01-16 11:03:35
|
Zoran Vasiljevic schrieb: > Guess what: it is _slower_ now then the > > s = (size-1) >> 3; > while (s>1) {s >>= 1; bucket++;} > > I tend to like that one as it is really neat. > It will also better illustrate what is being > done. > this is the last for today. It is the unrolled variant, with less tests, and still human readable. It should be faster than the unrolled while variants.... -gustaf { unsigned register int s = (size-1) >> 4; while (s >= 0x1000) { s >>= 12; bucket += 12; } if (s >= 0x0800) { s >>= 11; bucket += 11; } else if (s >= 0x0400) { s >>= 10; bucket += 10; } else if (s >= 0x0200) { s >>= 9; bucket += 9; } else if (s >= 0x0100) { s >>= 8; bucket += 8; } else if (s >= 0x0080) { s >>= 7; bucket += 7; } else if (s >= 0x0040) { s >>= 6; bucket += 6; } else if (s >= 0x0020) { s >>= 5; bucket += 5; } else if (s >= 0x0010) { s >>= 4; bucket += 4; } else if (s >= 0x0008) { s >>= 3; bucket += 3; } else if (s >= 0x0004) { s >>= 2; bucket += 2; } else if (s >= 0x0002) { s >>= 1; bucket += 1; } if (s >= 1) { bucket++; } if (bucket > NBUCKETS) { bucket = NBUCKETS; } } # |
From: Zoran V. <zv...@ar...> - 2007-01-16 10:40:58
|
Am 16.01.2007 um 10:37 schrieb Stephen Deasey: > > Can you import this into CVS? Top level. > You mean the tclThreadAlloc.c file on top-level of the naviserver project? |
From: Zoran V. <zv...@ar...> - 2007-01-16 10:39:37
|
Am 16.01.2007 um 11:24 schrieb Gustaf Neumann: > if all cases are used, all but the first loops are executed > mostly once and could be changed into ifs... i will send > you with a separate mail on such variant, but i am running > currently out of battery. > Guess what: it is _slower_ now then the s = (size-1) >> 3; while (s>1) {s >>= 1; bucket++;} I tend to like that one as it is really neat. It will also better illustrate what is being done. Watch: _slower_ means about 1-2%, so I do not believe we need to improve on that any more. The above version is I believe most "opportune" as it is readable (thus understandable) and very fast. |
From: Gustaf N. <ne...@wu...> - 2007-01-16 10:24:47
|
Zoran Vasiljevic schrieb: > > Am 16.01.2007 um 10:46 schrieb Gustaf Neumann: > >> This is most probably the best variabt so far, and not complicated, >> such a >> optimizer can do "the right thing" easily. sorry for the many versions.. >> -gustaf >> >> >> { unsigned register int s = (size-1) >> 3; >> while (s>1) { s >>= 1; bucket++; } >> } >> >> if (bucket > NBUCKETS) { >> bucket = NBUCKETS; >> } > > You'd be surprised that this one i am. that's the story of the unrolled loops. Btw, the version you have listed as the fastest has wrong boundary tests (but still gives the same result. below is is corrected version, which needs up to one mio max 2 shift operations. The nice thing of this code (due to staggered whiles) is that any of the while loops (execpt the last) can be removed and the code works still correctly (but needs more shift operations). that's the reason, why yesterdays version actually works. if all cases are used, all but the first loops are executed mostly once and could be changed into ifs... i will send you with a separate mail on such variant, but i am running currently out of battery. while (s >= 0x1000) { s >>= 12; bucket += 12; } while (s >= 0x0800) { s >>= 11; bucket += 11; } while (s >= 0x0400) { s >>= 10; bucket += 10; } while (s >= 0x200) { s >>= 9; bucket += 9; } while (s >= 0x0100) { s >>= 8; bucket += 8; } while (s >= 0x80) { s >>= 7; bucket += 7; } while (s >= 0x40) { s >>= 6; bucket += 6; } while (s >= 0x20) { s >>= 5; bucket += 5; } while (s >= 0x10) { s >>= 4; bucket += 4; } while (s >= 0x08) { s >>= 3; bucket += 3; } while (s >= 0x04) { s >>= 2; bucket += 2; } while(s >= 1) { s >>= 1; bucket++; } if (bucket > NBUCKETS) { bucket = NBUCKETS; } > > Test Tcl allocator with 4 threads, 16000 records ... > This allocator achieves 10098495 ops/sec under 4 threads > Press return to exit (observe the current memory footprint!) > > > whereas this one: > > s = (size-1) >> 3; > while (s>1) { s >>= 1; bucket++;} > > gives: > > Test Tcl allocator with 4 threads, 16000 records ... > This allocator achieves 9720847 ops/sec under 4 threads > Press return to exit (observe the current memory footprint!) > > That is (10098495-9720847/10098495)*100 = 3% less > > That is all measured on Linux. I haven't done it on the Mac > and on the Sun yet. I now have all versions inside and will > play a little on each plaform to see which one operates best > overall. The latest one is more appealing because of the > siplicitly of the code, so we can close an eye on that 3% > I guess. > > Cheers > Zoran |
From: Zoran V. <zv...@ar...> - 2007-01-16 10:18:41
|
Am 16.01.2007 um 10:46 schrieb Gustaf Neumann: > s = (size-1) >> 3; > while (s>1) { s >>= 1; bucket++; On Linux and Solaris (both x86 machines) the "long" version: s = (size-1) >> 4; while (s > 0xFF) { s = s >> 5; bucket += 5; } while (s > 0x0F) { s = s >> 4; bucket += 4; } ... is faster then the "short" above. On Mac OSX it is the same (no difference). Look the Sun Solaris 10 (x86 box): (the "short" version) Test Tcl allocator with 4 threads, 16000 records ... This allocator achieves 13753084 ops/sec under 4 threads Press return to exit (observe the current memory footprint!) (the "long" version) -bash-3.00$ ./memtest Test Tcl allocator with 4 threads, 16000 records ... This allocator achieves 14341236 ops/sec under 4 threads Press return to exit (observe the current memory footprint!) That is ((14341236-13753084)/14341236)*100 = 4% On Linux we had about 3% improvement. On Sun about 4% and on Mac OSX none. Note: all were x86 (Intel, AMD) machines just different OS and GHz-count. When we go back to the "slow" (original) version: Test Tcl allocator with 4 threads, 16000 records ... This allocator achieves 13474091 ops/sec under 4 threads Press return to exit (observe the current memory footprint!) We get ((14341236-13474091)/14341236)*100 = 6% improvement. Cheers Zoran |