|
From: Christopher K. <chr...@gm...> - 2015-07-29 17:58:49
|
To get the code running on the Intel Fortran 15.0.1 compiler on an Intel-Ivy-Bridge processor, the code has to be compiled with the -msse4.1 option as running with the -xAVX compiled option causes the code to exit with an ILLEGAL instruction. You might want to put into the Cachegrind user manual. I also have a question, Cachegrind gives the following line-by-line information: ILmr DLmr DLmw 528 0 209,680 Qtens_biharmonic(:,:,k,q,ie) = elem(ie)%state%Qdp(:,:,k,q,n0_qdp) Can I interpret this result as saying that there were NO last-level data cache read misses (DLmr) and that the data was in the last-level data cache and all of the memory traffic was the result of the write back (DLmw) of the data to memory? Typically a write-back operation is NOT expensive as it an be be put into the write-back buffer and performed at a later stage. Yet the code is spending a significant amount of time on this statement. Am I missing something? Thanks Chris -- Christopher Kerr Visiting Scientist Graduate School of Oceanography University of Rhode Island 215 South Ferry Road Narragansett, RI 02882-1197, USA chr...@gm... <chr...@no...> 401-874-6174 (work) Skype-Handle: chris.kerr.usa |
|
From: Josef W. <Jos...@gm...> - 2015-07-30 07:47:59
|
Am 29.07.2015 um 19:58 schrieb Christopher Kerr: > To get the code running on the Intel Fortran 15.0.1 compiler on an > Intel-Ivy-Bridge processor, the code has to be compiled with the > -msse4.1 option as running with the -xAVX compiled option causes the > code to exit with an ILLEGAL instruction. You might want to put into the > Cachegrind user manual. This may be fixed in SVN. Can you check? > I also have a question, Cachegrind gives the following line-by-line > information: > > | ILmr DLmr DLmw| > 528 0 209,680 Qtens_biharmonic(:,:,k,q,ie) = > elem(ie)%state%Qdp(:,:,k,q,n0_qdp) > > Can I interpret this result as saying that there were NO last-level data > cache read misses (DLmr) and that the data was in the last-level data > cache and all of the memory traffic was the result of the write back > (DLmw) of the data to memory? DLmw means that a write in your program did access a memory block which was not found in the last-level cache. Cachegrind does not count the implicit cache line read needed to be able to modify the line. Otherwise, DLmw > DLmr as above never would be possible. Further, cachegrind does not maintain dirty flags to understand whether evicted lines have to be written back before loading other data. This is not reflected in the reported events anyway: a miss is a miss, independent of whether a evicted line has to be written back or not. Due to that, any performance effects between write-back and write-through caches are not reflected, as they do not matter for the event counts given by cachegrind. In Callgrind you can ask for maintaining dirty flags for last-level cache lines with "--simulate-wb=yes", resulting in further events (misses with clean/dirty eviction). > Typically a write-back operation is NOT expensive as it an be be put > into the write-back buffer and performed at a later stage. This is only true if the write-back is hidden by other computation. Limited write bandwidth can slow down code as much as limited read bandwidth. > Yet the code > is spending a significant amount of time on this statement. Can be both access latency (for cache line reads before modification, which is not shown in cachegrind results as written above) and limited read/write bandwidth. You could check if the hardware prefetcher works or not (again only Callgrind: "--simulate-hwpref=yes", the misses go away if stream prefetching covers them). And of course, try to reduce miss counts by improving data reuse. Josef |