|
From: zap f. <zap...@gm...> - 2011-09-19 13:50:24
|
I tried running the following example program across cachegrind to see metrics
that indicate cache line false-sharing. Here's the example:
#include <pearl/version.h>
#include <boost/thread.hpp>
#include <boost/static_assert.hpp>
#include <iostream>
#define MAX_TESTS 100000
#define MAX_THREADS 100
#define CACHELINE_SIZE_INTS \
(64/sizeof(int))
BOOST_STATIC_ASSERT( (sizeof(int)*CACHELINE_SIZE_INTS)==64 );
struct Work {
Work( int& d) : data( d ) {}
void operator()() {
for(int index=0; index<MAX_TESTS; ++index) {
++data;
}
}
int& data;
};
int main(int argc,const char ** argv) {
boost::thread_group threads;
if( argc>2 )
{
std::cout << "FALSE SHARING version" << std::endl;
int false_sharing[MAX_THREADS] = { 0 };
for (int i = 0; i < MAX_THREADS; ++i) {
int& var=false_sharing[i];
threads.create_thread(Work(var));
}
threads.join_all();
}
else
{
std::cout << "NO FALSE SHARING version" << std::endl;
int no_false_sharing[MAX_THREADS * CACHELINE_SIZE_INTS] = { 0 };
for (int i = 0; i < MAX_THREADS; ++i) {
int& var=no_false_sharing[i * CACHELINE_SIZE_INTS];
threads.create_thread(Work(var));
}
threads.join_all();
}
I compile/link this with:
g++ -fPIC -Wall -Wfloat-equal -Wundef -pipe -Wno-deprecated -ggdb3 -
O0 -DPROFILE -I/usr/local/include -fuse-cxa-atexit -DNDEBUG
-DBUILDINFO="\"09/19/2011 at 06:37:43 AM by zap@iulius\""
-I.. -I../include -Wno-deprecated -fno-strict-aliasing -c -o
false_sharing.o ../src/false_sharing.cpp
g++ -o false_sharing.1.0.28 -Wl,--demangle,-Map,false_sharing.1.0.28.map
false_sharing.o -lrt -lpthread -ldl -lz -lm -lboost_filesystem
-lboost_program_options -lboost_regex -lboost_signals
-lboost_serialization -lboost_thread -lboost_unit_test_framework
-lboost_system
Then run twice with:
valgrind --tool=cachegrind --branch-sim=yes --cache-sim=yes
../profile/false_sharing
and
valgrind --tool=cachegrind --branch-sim=yes --cache-sim=yes
../profile/false_sharing x
The results are what bother me: there is no difference in the stats gathered
between runs. Both show output like:
NO FALSE SHARING version
==13718==
==13718== I refs: 112,888,692
==13718== I1 misses: 3,135
==13718== L2i misses: 2,396
==13718== I1 miss rate: 0.00%
==13718== L2i miss rate: 0.00%
==13718==
==13718== D refs: 65,138,690 (53,817,934 rd + 11,320,756 wr)
==13718== D1 misses: 51,954 ( 47,947 rd + 4,007 wr)
==13718== L2d misses: 20,929 ( 17,626 rd + 3,303 wr)
==13718== D1 miss rate: 0.0% ( 0.0% + 0.0% )
==13718== L2d miss rate: 0.0% ( 0.0% + 0.0% )
==13718==
==13718== L2 refs: 55,089 ( 51,082 rd + 4,007 wr)
==13718== L2 misses: 23,325 ( 20,022 rd + 3,303 wr)
==13718== L2 miss rate: 0.0% ( 0.0% + 0.0% )
==13718==
==13718== Branches: 11,910,054 (11,886,870 cond + 23,184 ind)
==13718== Mispredicts: 55,848 ( 53,014 cond + 2,834 ind)
==13718== Mispred rate: 0.4% ( 0.4% + 12.2% )
And the other:
FALSE SHARING version
==13823==
==13823== I refs: 112,878,279
==13823== I1 misses: 3,117
==13823== L2i misses: 2,395
==13823== I1 miss rate: 0.00%
==13823== L2i miss rate: 0.00%
==13823==
==13823== D refs: 65,131,908 (53,814,295 rd + 11,317,613 wr)
==13823== D1 misses: 51,587 ( 47,744 rd + 3,843 wr)
==13823== L2d misses: 20,896 ( 17,624 rd + 3,272 wr)
==13823== D1 miss rate: 0.0% ( 0.0% + 0.0% )
==13823== L2d miss rate: 0.0% ( 0.0% + 0.0% )
==13823==
==13823== L2 refs: 54,704 ( 50,861 rd + 3,843 wr)
==13823== L2 misses: 23,291 ( 20,019 rd + 3,272 wr)
==13823== L2 miss rate: 0.0% ( 0.0% + 0.0% )
==13823==
==13823== Branches: 11,907,051 (11,883,867 cond + 23,184 ind)
==13823== Mispredicts: 55,946 ( 53,121 cond + 2,825 ind)
==13823== Mispred rate: 0.4% ( 0.4% + 12.1% )
How can the results be virtually identical? I would have thought the "false
sharing" version would be significantly higher than the other, but they are
nearly identical.
Is there something I am doing wrong with the tool? I want to be able to
measure the false-sharing so that I can find and improve it in my target
program.
Many thanks
-z
|
|
From: Josef W. <Jos...@gm...> - 2011-09-19 16:03:13
|
On Monday 19 September 2011, zap foster wrote: > How can the results be virtually identical? I would have thought the "false > sharing" version would be significantly higher than the other, but they are > nearly identical. > > Is there something I am doing wrong with the tool? No. You did not take into account the model cachegrind simulates: It is one 2-level cache hierarchy, for all threads. Ie. all data accesses of all threads go through the same L1 data cache. Because of that, false sharing never can happen, and results are expected. Similarily, if you run your code constrained to the 2 hyper-threads of 1 core of an Intel processor, there will be no false sharing. > I want to be able to > measure the false-sharing so that I can find and improve it in my target > program. To detect that, cachegrind needs an extended model with private L1 caches per core and according coherence protocol. Further, some kind of scheduling/mapping between threads and cores of that model has to be in place (I would vote for fixed 1:1 mapping). It really would be cool to have that. However, with the current way Valgrind is implemented, you still would not see much false sharing as VG is serialising threads. Only in scheduling points, a false sharing event can happen. But we could make the time slice quantum changeable via command-line option. Josef |