|
From: Josef W. <Jos...@gm...> - 2003-05-08 08:34:46
|
On Wednesday 07 May 2003 23:02, Nicholas Nethercote wrote: > On Wed, 7 May 2003, Josef Weidendorfer wrote: > > > Is a round-robin mapping of threads to processors accurate, ie. > > > representative of what would really happen? > > > > I think the regular case is e.g. to run a multithreaded application with > > 4 threads on a 4-processor machine, and round robin mapping is accurate > > in a "n threads/n processor" szenario. > > How common is that scenario? I thought it's common for number chrunching apps. But I just remembered that this sometimes isn't true: With parallizing compilers (e.g. Intel with OpenMP) you often have an additional thread doing almost nothing. And multithreaded Webservers and GUI apps don't work this way. So some kind of scheduling seems to be needed :-( For the load of a processor I could use the instruction fetch counter of the last 10 time slices on this processor. On a thread switch (tid1 -> tid2), I check for tid2 if it needs to be migrated to a processor with less load. I think this needs some experimentation to avoid unnecessary migration. On a first thought, I would say that if the idle time of a processor (in the last 10 time slices) is at least 2 times the load of tid2 (in the last 10 time slices), migrate it to this processor. > > > The typical use case is here to check if there's e.g. cache trashing > > (independed data regularily accessed by the two processors are located in > > the same cache line, leading to a lot of cache invalidation/misses) or > > general performance slowdown because shared data is accessed often. > > So with the cache trashing, Cachegrind/Calltree with this feature wouldn't > necessarily be reporting a figure that's representative of any real-life > configuration, but would give a general indication of how well different > threads interact, yes? That sounds like it could be useful. Yes, I think so. Cache trashing (=false sharing) can always be avoided by padding. But you have to know that it's happening at all. So this extension would be useful for this. Aside, it would be interesting to see if the cache miss numbers depend on the time slice length. Because of the sequential thread execution in valgrind, you never will be able to see the real effect of false sharing where threads run simultanously on 2 processors, thus possible increasing the number of invalidations/misses by a large amount. Josef |