|
From: Christoph B. <bar...@or...> - 2007-07-10 08:47:25
|
Hi, I have a program that normally takes 3.5GB of memory and runs within 20 minutes. I have started the same binary with valgrind to debug a problem and I had to abort the process after more than 10 days of runtime while the process used 28GB of main memory. With a degradation factor of 100, I would expect that valgrind takes about 10 hours for a whole run. With 10 days we have a factor of 720. Is there a way to speed the whole process up? I also wonder about the memory consumption factor of 8x. This is also more than normal. The machine has 64GB of main memory such that swapping is not an explanation for the slowness. Greetings Christoph |
|
From: Julian S. <js...@ac...> - 2007-07-10 09:04:19
|
> With a degradation factor of 100, I would expect that valgrind takes about > 10 hours for a whole run. With 10 days we have a factor of 720. Is there a > way to speed the whole process up? I would say this is a bug of some kind. Normally the slowdown factor is about 80 in the very worst case. However, without a way to reproduce the problem, I don't see much hope of solving it. What version is this with? You should try the svn trunk as that is the most performance-robust yet. It contains some changes that make the performance of large memory programs more robust. > I also wonder about the memory consumption factor of 8x. This is also more > than normal. Indeed. I suppose one thing you could do is to profile the running process, to see what it is doing. Either using oprofile, or very crudely by repeatedly attaching gdb to it (gdb $prefix/lib/valgrind/amd64-linux/memcheck pid) and looking at the backtrace. Crude but sometimes gives a clue. J |
|
From: Christoph B. <bar...@or...> - 2007-07-10 09:22:23
|
> I would say this is a bug of some kind. Normally the slowdown factor is > about 80 in the very worst case. However, without a way to reproduce the > problem, I don't see much hope of solving it. > > What version is this with? You should try the svn trunk as that is the > most performance-robust yet. It contains some changes that make the > performance of large memory programs more robust. > I am going to check the latest svn version. Is there a branch or should I use the trunk? Christoph |
|
From: Nicholas N. <nj...@cs...> - 2007-07-10 10:28:09
|
On Tue, 10 Jul 2007, Christoph Bartoschek wrote: > I am going to check the latest svn version. Is there a branch or should I use > the trunk? Use the trunk. Nick |
|
From: Christoph B. <bar...@or...> - 2007-07-10 13:47:39
|
Here is a sample oprofile log of the top5 runtime and data cache misses during one operation that takes two hours in valgrind but only 9 seconds in reality. CPU_CLK_UNHALTED: Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % image name app name symbol name 10327059 73.4198 memcheck memcheck vgPlain_arena_free 1336777 9.5038 anon (tgid:16937 range:0x403270000-0x4054f1000) memcheck (no symbols) 593485 4.2194 memcheck memcheck vgMemCheck_helperc_STOREV8 142851 1.0156 memcheck memcheck avl_lookup 139942 0.9949 memcheck memcheck avl_insert DATA_CACHE_MISSES: Counted DATA_CACHE_MISSES events (Data cache misses) with a unit mask of 0x00 (No unit mask) count 10000 samples % image name app name symbol name 124218 84.4302 memcheck memcheck vgPlain_arena_free 4890 3.3237 anon (tgid:19179 range:0x403270000-0x4054f1000) memcheck (no symbols) 2835 1.9269 memcheck memcheck vgModuleLocal_search_one_cfitab 2710 1.8420 memcheck memcheck avl_insert 2408 1.6367 vmlinux-2.6.13-15.12-smp.gz vmlinux-2.6.13-15.12-smp.gz (no symbols) 2046 1.3907 memcheck memcheck avl_lookup Christoph |
|
From: Christoph B. <bar...@or...> - 2007-07-10 14:36:36
Attachments:
m_mallocfree.c.bz2
|
After about an hour there seems to be only one function that runs: 75373374 93.0560 memcheck memcheck vgPlain_arena_free More than 93% and increasing in vgPlain_arena_free. However it seems as if this function is not the real bottleneck. I have attached the output of opannotate for the file m_mallocfree.c. It shows that findSb() is the slowest part of the program. I guess it is inlined into vgPlain_arena_free. Greetings Christoph |
|
From: Christoph B. <bar...@or...> - 2007-07-10 14:44:54
|
> It shows that findSb() is the slowest part of the program. I guess it is > inlined into vgPlain_arena_free. I do not know what the purpose of the superblocks is and how they are organized but here you have to search for the correct superblock for a piece of memory. The usage of a list here seems to trigger quadratic behaviour. As lists are always evil there might be a better data structure for this application. How about using an avl tree or a Judy tree to find an object that holds a memory region as described in http://judy.sourceforge.net/examples/content_addressable_memory.pdf Greetings Christoph |
|
From: Julian S. <js...@ac...> - 2007-07-10 14:53:36
|
On Tuesday 10 July 2007 15:36, Christoph Bartoschek wrote: > After about an hour there seems to be only one function that runs: > > 75373374 93.0560 memcheck memcheck vgPlain_arena_free > > More than 93% and increasing in vgPlain_arena_free. Ah, that's interesting. We've had performance problems in m_mallocfree.c before now. So I am not (entirely) surprised to see it again. Can you describe a bit the malloc/free behaviour of your application? - how many blocks live (approximately) - approximately what size - are you allocating fast, or slow? - approximately what lifetimes? more or less LIFO, or FIFO, or random? The ultimate aim is to construct a simple test case which shows the same performance problem. J |
|
From: Christoph B. <bar...@or...> - 2007-07-10 15:29:47
|
Am Dienstag, 10. Juli 2007 schrieb Julian Seward: > On Tuesday 10 July 2007 15:36, Christoph Bartoschek wrote: > > After about an hour there seems to be only one function that runs: > > > > 75373374 93.0560 memcheck memcheck vgPlain_arena_free > > > > More than 93% and increasing in vgPlain_arena_free. > > Ah, that's interesting. We've had performance problems in m_mallocfree.c > before now. So I am not (entirely) surprised to see it again. > > Can you describe a bit the malloc/free behaviour of your application? > > - how many blocks live (approximately) > - approximately what size > - are you allocating fast, or slow? > - approximately what lifetimes? more or less LIFO, or FIFO, or random? > > The ultimate aim is to construct a simple test case which shows the > same performance problem. It's hard to answer the code because it is more than ten years old and the original developers are not here anymore. The method transforms a given vlsi wiring into internal structures. The loop looks like that: for each net build a list of wires perform some unification methods copying the lists and deleting intermediate lists insert wire segments to an avl tree delete all wire lists end for The block sizes are small about 20-100 bytes, the list items need place for three pointers. The intermediate lists are created and deleted in a FIFO manner. The resulting avl tree grows during the whole procedure till about 3GB are used. Greetings Christoph |
|
From: Nicholas N. <nj...@cs...> - 2007-07-10 22:09:47
|
On Tue, 10 Jul 2007, Christoph Bartoschek wrote: >> Ah, that's interesting. We've had performance problems in m_mallocfree.c >> before now. So I am not (entirely) surprised to see it again. >> >> Can you describe a bit the malloc/free behaviour of your application? >> >> - how many blocks live (approximately) >> - approximately what size >> - are you allocating fast, or slow? >> - approximately what lifetimes? more or less LIFO, or FIFO, or random? >> >> The ultimate aim is to construct a simple test case which shows the >> same performance problem. > > It's hard to answer the code because it is more than ten years old and the > original developers are not here anymore. What would be most useful is a trace of the malloc/free calls. Can you add calls to VG_(printf)() in coregrind/m_mallocfree.c, to the functions VG_(arena_malloc) and VG_(arena_free) (and also the ones for VG_(arena_realloc), VG_(arena_strdup), VG_(malloc_aligned) might be useful). These should print out all the relevant info, eg. the size for malloc(), and the pointer for free(). If you send us a trace of this, hopefully that will let us identify the problem. Thanks. Nick |
|
From: Christoph B. <bar...@or...> - 2007-07-11 10:29:39
Attachments:
mallocfree.diff.bz2
|
Hi, I have fixed the problem for me right now. Maybe you want to fix it in a similar way: Instead of using a list of superblocks I have a sorted array of superblocks per arena. New superblocks are inserted by insertion sort, but most of the time I see that new superblocks have higher addresses than the previous ones. findSB() then runs with a binary search and is no longer the bottleneck. The implementation has the following problems: - superblocks still have a next pointer. Just removing them breaks an assertion. - When the array is resized the previous memory is lost. Maybe one could allocate enough memory for a superblock for the array and when it is discarded reuse it as a superblock. - If lots of new superblocks are created and the addresses do not grow, then insertion sort is not the best way. Maybe an AVL tree from the superblocks would be nice. The first results show that some methods have decreased their runtime from 6 hours to 7 minutes. I am still working on a testcase and now profiling the resulting binary. Greetins Christoph |
|
From: Julian S. <js...@ac...> - 2007-07-13 08:28:24
|
> Instead of using a list of superblocks I have a sorted array of superblocks > per arena. New superblocks are inserted by insertion sort, but most of the > time I see that new superblocks have higher addresses than the previous > ones. > > findSB() then runs with a binary search and is no longer the bottleneck. > > The implementation has the following problems: > > - superblocks still have a next pointer. Just removing them breaks an > assertion. > - When the array is resized the previous memory is lost. Maybe one could > allocate enough memory for a superblock for the array and when it is > discarded reuse it as a superblock. > - If lots of new superblocks are created and the addresses do not grow, > then insertion sort is not the best way. Maybe an AVL tree from the > superblocks would be nice. > > The first results show that some methods have decreased their runtime from > 6 hours to 7 minutes. I looked at this and it seems like a reasonable solution. The Superblocks are so much bigger than a Superblock* (at least 64k for a Superblock vs sizeof(Superblock*)) that the loss of memory at resize does not seem significant. Although perhaps fill the old area with 0xAAAAAAAA or some other bogus value so V just segfaults if the old area is mistakenly used. J |
|
From: Christoph B. <bar...@or...> - 2007-07-11 12:33:12
|
The trace looks like this now: Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 warning: /boot/vmlinux-2.6.13-15.12-smp.gz is not in a usable binary format. samples % image name app name symbol name 4070379 30.9112 anon (tgid:6163 range:0x40327e000-0x405501000) memcheck (no symbols) 1062497 8.0688 memcheck memcheck vgPlain_HT_remove 735320 5.5842 memcheck memcheck vgModuleLocal_search_one_cfitab 685708 5.2074 memcheck memcheck vgMemCheck_helperc_STOREV64le 580023 4.4048 memcheck memcheck vgMemCheck_helperc_LOADV64le 546871 4.1530 memcheck memcheck avl_lookup 520773 3.9549 memcheck memcheck set_address_range_perms For some parts of the programm the vgModuleLocal_search_one_cfitab and avl_lookup methods are on the top. Could there also be room for improvement? Christoph |
|
From: Christoph B. <bar...@or...> - 2007-07-11 15:18:39
|
Hi,
here is the testcase:
#include <vector>
#include <cstdlib>
#include <iostream>
static int const RUNS = 1000;
static int const MORE = 3000;
static int const ROUND = 10000;
static int const BASE = 10000000;
std::vector<void *> allocated;
static int const sizes[] = {12, 28, 48};
void allocate(int num) {
for (int i = 0; i < num; ++i) {
void * ptr = malloc(sizes[random() % 3]);
allocated.push_back(ptr);
}
}
void release(int num) {
for (int i = 0; i < num; ++i) {
std::size_t idx = random() % allocated.size();
free(allocated[idx]);
allocated[idx] = allocated.back();
allocated.resize(allocated.size() -1);
}
}
int main() {
allocate(BASE);
for (int i = 0; i < RUNS; ++i) {
allocate(ROUND);
release(ROUND);
allocate(ROUND + MORE);
release(ROUND);
allocate(ROUND);
release(ROUND);
std::cerr << "Iteration: " << i << std::endl;
}
for (std::size_t i = 0; i < allocated.size(); ++i)
free(allocated[i]);
}
This program roughly models the application I am testing. Running it with the
trunk version of valgrind and with the one I have sent earlier this day I see
a huge improvement in runtime. However the second version still seems to
trigger quadratic runtime in valgrind. Here is a profile after 500
iterations:
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit
mask of 0x00 (No unit mask) count 100000
warning: /boot/vmlinux-2.6.13-15.12-smp.gz is not in a usable binary format.
samples % image name app name symbol
name
6210651 54.6327 memcheck memcheck
vgPlain_HT_remove
614447 5.4051 anon (tgid:22205 range:0x402b98000-0x404f2f000) memcheck
(no symbols)
588357 5.1755 memcheck memcheck
vgPlain_arena_free
479507 4.2180 memcheck memcheck
vgPlain_ssort
358811 3.1563 memcheck memcheck
set_address_range_perms
More than 50% of the time is spent in vgPlain_HT_remove. I guess there is a
problem with the hash function.
Greetings
Christoph
|
|
From: Julian S. <js...@ac...> - 2007-07-11 23:47:05
|
> For some parts of the programm the vgModuleLocal_search_one_cfitab and > avl_lookup methods are on the top. Could there also be room for > improvement? Yes .. maybe. search_one_cfitab already got some tuning (see its caller) but maybe there is more that can be done. For avl_lookup, I have an as-yet uncommitted, highly tuned version which removes most of the branch mispredictions involved in searching the trees. It requires some structural changes to the tree's definition so is not entirely straightforward. I hope to commit that in the next month or so. J |