From: Vlad S. <vl...@cr...> - 2006-02-04 06:13:05
|
Here is the test http://www.crystalballinc.com/vlad/tmp/memtest.c It give very strange results, it works for Linux only because it uses mmap only and it looks like brk uses mmap internally according to Linux 2.6.13 kernel and it allows unlimited mmap-ed regions (or as i understand up to vm.max_map_count = 65536). According to this test, when i use random sizes from 0-128k, Tcl allocator gives worse results than Linux malloc. On small amounts ckalloc faster but once over 64k, Linux malloc is faster. And, my small malloc implementaion which is based on first version of Lea's malloc and uses mmap only and supports per-thread memory only beats all mallocs, especially on bigger sizes. It does not crash, even on 5Mil loops, but i am not sure why it is so simple and so effective. -- Vlad Seryakov 571 262-8608 office vl...@cr... http://www.crystalballinc.com/vlad/ |
From: Stephen D. <sd...@gm...> - 2006-02-04 07:51:11
|
On 2/3/06, Vlad Seryakov <vl...@cr...> wrote: > > Here is the test http://www.crystalballinc.com/vlad/tmp/memtest.c > > It give very strange results, it works for Linux only because it uses > mmap only and it looks like brk uses mmap internally according to Linux > 2.6.13 kernel and it allows unlimited mmap-ed regions (or as i > understand up to vm.max_map_count =3D 65536). > > According to this test, when i use random sizes from 0-128k, Tcl > allocator gives worse results than Linux malloc. On small amounts > ckalloc faster but once over 64k, Linux malloc is faster. And, my small > malloc implementaion which is based on first version of Lea's malloc and > uses mmap only and supports per-thread memory only beats all mallocs, > especially on bigger sizes. It does not crash, even on 5Mil loops, but i > am not sure why it is so simple and so effective. Have you taken fragmentation into account? There's some memory related links in this blog post I read recently: http://primates.ximian.com/~federico/news-2005-12.html#14 Federico makes a good point: this has all been done before... |
From: Andrew P. <at...@pi...> - 2006-02-04 09:36:57
|
On Sat, Feb 04, 2006 at 12:51:07AM -0700, Stephen Deasey wrote: > There's some memory related links in this blog post I read recently: > > http://primates.ximian.com/~federico/news-2005-12.html#14 > > Federico makes a good point: this has all been done before... Hm, he links to these two: http://citeseer.ist.psu.edu/bonwick94slab.html http://citeseer.ist.psu.edu/bonwick01magazines.html The 1994 "Slab Allocator" was used in SunOS 5.4 (Solaris 2.4). The 2001 "Vmem" and "libumem" version uses a "per-processor caching scheme ... that provides linear scaling to any number of CPUs." (Nice paper!) That seems reasonable. I figure that the CPU is what does the work, so per-thread memory caching as done the Tcl/AOLserver "zippy" allocator is only necessary because threads are not tied to any one CPU. If they were, it would be better to have those N threads all use the same memory cache for their allocations, as obviously only 1 can allocate at any given time. Interestingly, libumem, the non-kernel version of their work, they (initially?) usedper-thread rather than per-cpu caches, because the Solaris thread library didn't have the right APIs to do things per CPU. Apparently that worked fine, and it was still faster than Hoard (by a constant amount, they both scaled linearly). However, their tests only seem to use only 1 thread per CPU though, which isn't terribly realistic. Allocator CPU-affinity is probably much more useful when you have 100 or 1000 threads per CPU, and it might be interesting to see scalability graphs under those conditions. Some of the benchmarks are impressive, on large multi-cpu boxes they cite 2x throughput improvement on a Spec web serving benchmark by adding their new stuff to Solaris. And it even improves single cpu performance somewhat too. The paper clearly says that the userspace version beats Hoard, ptmalloc (Gnu libc), and mtmalloc (Solaris), which it calls "the strongest competition". As of 2001 glibc's allocator was clearly crap for anything other than single-thread code - thus the traditional need for hacks like the zippy allocator. One interesting bit is that: "We discussed the magazine layer in the context of the slab allocator, but in fact the algorithms are completely general. A magazine layer can be added to ANY memory allocator to make it scale." The work described in that 2001 paper has all been in Solaris since version 8. Do Linux, FreeBSD, and/or Gnu Libc have this yet? Ah, Mena-Quintero's blog above says that "gslice" is exactly that, and seems to be in Gnome 2.10: http://developer.gnome.org/doc/API/2.0/glib/glib-Memory-Slices.html Shouldn't something like that be in ** Gnu LibC **, not just in Gnome? Bonwick's brief concluding thoughts about how OS core services are often the most neglected are also interesting. -- Andrew Piskorski <at...@pi...> http://www.piskorski.com/ |
From: Vlad S. <vl...@cr...> - 2006-02-04 16:08:48
|
That could be true on Solaris, but in Linux 2.6 mmap/munmap is very fast and looking into kernel source it tells you that they conver sbrk ito mmap imternally but the different is that mmap is multithreaded-aware while sbrk not. Now, using mmap to allocate block of memory and then re-using that this is waht i am doing, but i do not use munmap, still it is possible. With random allocations from 1-128L, Tcl alloc gives the worst results, constantly, which means it is good on small allocations only? Tcl: 8.4.12, threads 16, loops 500000 starting 16 malloc threads...waiting....done: 3 seconds, 955518 usec starting 16 ckalloc threads...waiting....done: 4 seconds, 272964 usec starting 16 _malloc threads...waiting....done: 1 seconds, 890566 usec I am not trying to re-invent the wheel, it is just accidentally i replaced sbrk with mmap and removed mutexes around it and it became much faster than what we have now, at least on Linux. Stephen Deasey wrote: > On 2/3/06, Vlad Seryakov <vl...@cr...> wrote: > >>Here is the test http://www.crystalballinc.com/vlad/tmp/memtest.c >> >>It give very strange results, it works for Linux only because it uses >>mmap only and it looks like brk uses mmap internally according to Linux >>2.6.13 kernel and it allows unlimited mmap-ed regions (or as i >>understand up to vm.max_map_count = 65536). >> >>According to this test, when i use random sizes from 0-128k, Tcl >>allocator gives worse results than Linux malloc. On small amounts >>ckalloc faster but once over 64k, Linux malloc is faster. And, my small >>malloc implementaion which is based on first version of Lea's malloc and >>uses mmap only and supports per-thread memory only beats all mallocs, >>especially on bigger sizes. It does not crash, even on 5Mil loops, but i >>am not sure why it is so simple and so effective. > > > > Have you taken fragmentation into account? > > There's some memory related links in this blog post I read recently: > > http://primates.ximian.com/~federico/news-2005-12.html#14 > > Federico makes a good point: this has all been done before... > > > ------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. Do you grep through log files > for problems? Stop! Download the new AJAX search engine that makes > searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! > http://sel.as-us.falkag.net/sel?cmd=k&kid3432&bid#0486&dat1642 > _______________________________________________ > naviserver-devel mailing list > nav...@li... > https://lists.sourceforge.net/lists/listinfo/naviserver-devel > -- Vlad Seryakov 571 262-8608 office vl...@cr... http://www.crystalballinc.com/vlad/ |
From: Zoran V. <zv...@ar...> - 2006-02-04 16:20:54
|
Am 04.02.2006 um 17:08 schrieb Vlad Seryakov: > That could be true on Solaris, but in Linux 2.6 mmap/munmap is very > fast and looking into kernel source it tells you that they conver > sbrk ito mmap imternally but the different is that mmap is > multithreaded-aware while sbrk not. Solaris (1 CPU) Tcl: 8.4.12, threads 16, loops 500000 starting 16 malloc threads...waiting....done: 3 seconds, 938700 usec starting 16 ckalloc threads...waiting....done: 6 seconds, 62454 usec starting 16 _malloc threads...waiting....done: 9 seconds, 755277 usec Linux (1 CPU, 1.8 GHz) Tcl: 8.4.12, threads 16, loops 500000 starting 16 malloc threads...waiting....done: 2 seconds, 298735 usec starting 16 ckalloc threads...waiting....done: 3 seconds, 331197 usec starting 16 _malloc threads...waiting....done: 1 seconds, 323865 usec Mac OSX (1 CPU 1.5Ghz) zoran:~ zoran$ ./m2 Tcl: 8.4.12, threads 16, loops 500000 starting 16 malloc threads...waiting....done: 57 seconds, 300088 usec starting 16 ckalloc threads...waiting....done: 195 seconds, 526369 usec starting 16 _malloc threads...waiting....done: 13 seconds, 869307 usec Mac OSX (2 CPU 867MHz) panther:~ zoran$ ./m2 Tcl: 8.4.12, threads 16, loops 500000 starting 16 malloc threads...waiting....done: 189 seconds, 228665 usec starting 16 ckalloc threads...waiting....done: 730 seconds, 700258 usec (!!!!!) starting 16 _malloc threads...waiting....done: 19 seconds, 958533 usec > > Now, using mmap to allocate block of memory and then re-using that > this is waht i am doing, but i do not use munmap, still it is > possible. > With random allocations from 1-128L, Tcl alloc gives the worst > results, constantly, which means it is good on small allocations only? Aparently it is all above 16284 bytes that uses malloc directly. > > I am not trying to re-invent the wheel, it is just accidentally i > replaced sbrk with mmap and removed mutexes around it and it became > much faster than what we have now, at least on Linux. The only part where it is not faster is single-cpu solaris. I have no idea why. I can test it on 2 cpu solaris next week. Anyway, from all this tests, it appears that the Tcl allocator is slower than anything else, at least for the test-pattern used in your test. Cheers Zoran |
From: Gustaf N. <ne...@wu...> - 2006-02-04 21:26:13
|
Zoran Vasiljevic schrieb: > Anyway, from all this tests, it appears that the Tcl allocator > is slower than anything else, at least for the test-pattern > used in your test. for our power5 machine (the monster) with the linux 2.6.9-22 kernel, ckalloc is not so bad, but _malloc is still significantly better. Tcl: 8.4.11, threads 16, loops 500000 starting 16 malloc threads...waiting....done: 8 seconds, 397357 usec starting 16 ckalloc threads...waiting....done: 8 seconds, 4346 usec starting 16 _malloc threads...waiting....done: 6 seconds, 404069 usec strangely enough, with 64 threads, the differences seem to disappear. Tcl: 8.4.11, threads 64, loops 500000 starting 64 malloc threads...waiting....done: 26 seconds, 718077 usec starting 64 ckalloc threads...waiting....done: 26 seconds, 783919 usec starting 64 _malloc threads...waiting....done: 26 seconds, 237283 usec these results are repeatable. -gustaf |
From: Gustaf N. <ne...@wu...> - 2006-02-18 12:17:08
|
Hi Folks, Concerning the previous discussions: Just in case you have not seen this: Gnome 2.14 will come with a GSlice based malloc with significant improvements for multithreaded apps relative to prior versions: http://www.gnome.org/~davyd/gnome-2-14/ -gustaf |