|
From: Xiang Li <lx...@qt...> - 2017-06-26 17:16:24
|
Dear Valgrind users, I am using Cachegrind to study the cache miss behavior of our program. I need to collect the last level cache misses in bytes. According to the online manual http://valgrind.org/docs/manual/cg-manual.html (related parts are copied at the end for reference), Cachegrind outputs the number of misses. To my understanding, the amount in bytes can be calculated as follows Misses in bytes = (the number of misses) * (line size of the cache) where the line size can be configured with option such as --LL=<size>,<associativity>,<line size>. Is my understanding correct? If not, could you tell me how to calculate the misses in bytes? Copied from the manual: Cache accesses for instruction fetches are summarised first, giving the number of fetches made (this is the number of instructions executed, which can be useful to know in its own right), the number of I1 misses, and the number of LL instruction (LLi) misses. Cache accesses for data follow. The information is similar to that of the instruction fetches, except that the values are also shown split between reads and writes (note each row's rd and wr values add up to the row's total). Thanks and regards, Xiang |
|
From: John R. <jr...@bi...> - 2017-06-26 17:48:49
|
> I am using Cachegrind to study the cache miss behavior of our program. I need to collect the last level cache misses in bytes. According to the online manual http://valgrind.org/docs/manual/cg-manual.html (related parts are copied at the end for reference), Cachegrind outputs the number of misses. > To my understanding, the amount in bytes can be calculated as follows > > Misses in bytes = (the number of misses) * (line size of the cache) > > where the line size can be configured with option such as |--LL=<size>,<associativity>,<line size>. |Is my understanding correct? If not, could you tell me how to calculate the misses in bytes? That product is the total bytes of traffic that are caused by misses. But it ignores the width of the bus, which determines the duration of the transfers. Most desktop computers have a 64-bit data bus (72 bits if ECC) to DDR3 or DDR4 SDRAM. Some embedded devices have a 32-bit bus (or even narrower). Desktop video graphic display cards usually have 32, 64, 128, 192, 256, or 384 bits [and no cache :-)] The bus width to L1 and L2 caches can be wider. It's 128 or 256 bits on PowerPC chips, for instance. (Yes: the icache fetches 4 or 8 32-bit instructions at a time, and all can be decoded and executed in parallel except for dataflow constraints. Aligning branch destinations to 32-byte boundaries might make a big difference in execution speed.) |
|
From: Xiang Li <lx...@qt...> - 2017-06-26 18:22:07
|
Hi John, Thanks for your prompt reply. I got your point. BR, Xiang -----Original Message----- From: John Reiser [mailto:jr...@bi...] Sent: Monday, June 26, 2017 10:49 AM To: val...@li... Subject: Re: [Valgrind-users] How to calculate the amount (in bytes) of cache misses > I am using Cachegrind to study the cache miss behavior of our program. I need to collect the last level cache misses in bytes. According to the online manual http://valgrind.org/docs/manual/cg-manual.html (related parts are copied at the end for reference), Cachegrind outputs the number of misses. > To my understanding, the amount in bytes can be calculated as follows > > Misses in bytes = (the number of misses) * (line size of the cache) > > where the line size can be configured with option such as |--LL=<size>,<associativity>,<line size>. |Is my understanding correct? If not, could you tell me how to calculate the misses in bytes? That product is the total bytes of traffic that are caused by misses. But it ignores the width of the bus, which determines the duration of the transfers. Most desktop computers have a 64-bit data bus (72 bits if ECC) to DDR3 or DDR4 SDRAM. Some embedded devices have a 32-bit bus (or even narrower). Desktop video graphic display cards usually have 32, 64, 128, 192, 256, or 384 bits [and no cache :-)] The bus width to L1 and L2 caches can be wider. It's 128 or 256 bits on PowerPC chips, for instance. (Yes: the icache fetches 4 or 8 32-bit instructions at a time, and all can be decoded and executed in parallel except for dataflow constraints. Aligning branch destinations to 32-byte boundaries might make a big difference in execution speed.) ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Valgrind-users mailing list Val...@li... https://lists.sourceforge.net/lists/listinfo/valgrind-users |
|
From: Xiang Li <lx...@qt...> - 2017-06-29 16:58:36
|
Hi John, One thing to confirm. For LLd reading misses (data only) in bytes, it is DLmr * lineSize, correct? (suppose bus width is the same as that of LLd for a simple case). Thanks, Xiang -----Original Message----- From: John Reiser [mailto:jr...@bi...] Sent: Monday, June 26, 2017 10:49 AM To: val...@li... Subject: Re: [Valgrind-users] How to calculate the amount (in bytes) of cache misses > I am using Cachegrind to study the cache miss behavior of our program. I need to collect the last level cache misses in bytes. According to the online manual http://valgrind.org/docs/manual/cg-manual.html (related parts are copied at the end for reference), Cachegrind outputs the number of misses. > To my understanding, the amount in bytes can be calculated as follows > > Misses in bytes = (the number of misses) * (line size of the cache) > > where the line size can be configured with option such as |--LL=<size>,<associativity>,<line size>. |Is my understanding correct? If not, could you tell me how to calculate the misses in bytes? That product is the total bytes of traffic that are caused by misses. But it ignores the width of the bus, which determines the duration of the transfers. Most desktop computers have a 64-bit data bus (72 bits if ECC) to DDR3 or DDR4 SDRAM. Some embedded devices have a 32-bit bus (or even narrower). Desktop video graphic display cards usually have 32, 64, 128, 192, 256, or 384 bits [and no cache :-)] The bus width to L1 and L2 caches can be wider. It's 128 or 256 bits on PowerPC chips, for instance. (Yes: the icache fetches 4 or 8 32-bit instructions at a time, and all can be decoded and executed in parallel except for dataflow constraints. Aligning branch destinations to 32-byte boundaries might make a big difference in execution speed.) ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Valgrind-users mailing list Val...@li... https://lists.sourceforge.net/lists/listinfo/valgrind-users |