## math-atlas-devel

 [atlas-devel] About AMD Piledriver real number of cores and theoretical peak From: José Luis García Pallero - 2013-09-24 16:32:58 ```Hi all, I have access to an AMD Piledriver 8320 eight-core processor. Before to perform some tests with BLAS I would like to compute the theoretical floating point (double) peak. The problem is that I'm a bit confused about the concept of core (or thread) in AMD chips. Surfing the web one can find information about, but is not clear. In some sites talks about not all the cores are real cores, i.e., an eight-core processor is actually a quad-core. The chip is composed by 4 modules, each one having two cores, but these are cores for integer operations only, but has only one FPU per module, so the real number of FPU capable cores are 4. I have understood the prolem in this way. Am I right? So, in the caso of ATLAS compilation, should I configure the library for 8 threads or only for 4? I've seen that Piledriver is supported since ATLAS 3.11.3. Version 3.11.8 reaches a peak of about 78/83% as stated in ChangeLog. How was computed this value? And in order to compute the theoretical peak value, how many FLOPS/cycle can perform a Piledriver chip? 4 or 8? Cheers -- ***************************************** José Luis García Pallero jgpallero@... (o< / / \ V_/_ Use Debian GNU/Linux and enjoy! ***************************************** ```
 Re: [atlas-devel] About AMD Piledriver real number of cores and theoretical peak From: James Cloos - 2013-09-26 19:08:21 ```>>>>> "JLGP" == José Luis García Pallero writes: JLGP> The chip is composed by 4 modules, each one having two cores, but JLGP> these are cores for integer operations only, but has only one FPU JLGP> per module, so the real number of FPU capable cores are 4. I have JLGP> understood the prolem in this way. Am I right? Each module has two 128-bit floating point units. If all of the code uses only the xmm vector regsiters, each core will use its own float unit. OTOH, whenever the ymm vector registers are used, the code will be split betweent the two float units. At least according to what AMD has written. For comparison, and based on empirical evidence (including some posted here), it seems that Intel puts two 128-bit float units per core, but powers the second down until it looks like enough 256-bit vector ops are running so as to make sense to power up and use the second unit. ```
 Re: [atlas-devel] About AMD Piledriver real number of cores and theoretical peak From: R. Clint Whaley - 2013-09-27 15:23:15 ```Unfortunately, I have not investigated this fully. It is definitely true that on what they call a "8-core" there are only 4 FPUs, and so the best scaling you will get is 4 for FPU-based stuff like ATLAS. On Bulldozer, I investigated, and the performance went down if you tried to use all their cores, as seen here: http://math-atlas.sourceforge.net/atlas_install/node21.html This despite the fact that that machine got better performance using SSE than AVX. On the piledriver, I found that once you did some strange instruction selection, AVX was much faster than SSE, which I would say makes it much more likely that using all the modules will be a loss. However, I have not found the time to actually study the parallel performance to make sure, and to find out what funky core ID scheme is needed to maximize performance. The easiest test is to take an assembly routine that artificially gets AVX peak by doing useless operations directly on registers (no cache access at all), and spawn it to pairs of threads; when performance is cut in half, you have discovered the tids that share an FPU. Once I did this, I used the configure interface to build ATLAS in two ways: once with using only the unique FPUs, and once with 8 cores, and what I found was that the 4-core approach was decidedly better on Dozer. I have not yet done this test on Driver (I've been concentrating on serial GEMM due to the rewrite, knowing that I have to research & possibly rewrite threading thereafter). Cheers, Clint On 09/24/2013 11:32 AM, José Luis García Pallero wrote: > Hi all, > > I have access to an AMD Piledriver 8320 eight-core processor. Before > to perform some tests with BLAS I would like to compute the > theoretical floating point (double) peak. The problem is that I'm a > bit confused about the concept of core (or thread) in AMD chips. > Surfing the web one can find information about, but is not clear. In > some sites talks about not all the cores are real cores, i.e., an > eight-core processor is actually a quad-core. The chip is composed by > 4 modules, each one having two cores, but these are cores for integer > operations only, but has only one FPU per module, so the real number > of FPU capable cores are 4. I have understood the prolem in this way. > Am I right? > So, in the caso of ATLAS compilation, should I configure the library > for 8 threads or only for 4? I've seen that Piledriver is supported > since ATLAS 3.11.3. Version 3.11.8 reaches a peak of about 78/83% as > stated in ChangeLog. How was computed this value? > And in order to compute the theoretical peak value, how many > FLOPS/cycle can perform a Piledriver chip? 4 or 8? > > Cheers > -- ********************************************************************** ** R. Clint Whaley, PhD * Assoc Prof, LSU * http://www.csc.lsu.edu/~whaley ** ********************************************************************** ```
 Re: [atlas-devel] About AMD Piledriver real number of cores and theoretical peak From: José Luis García Pallero - 2013-09-27 15:58:33 ```2013/9/27 R. Clint Whaley : > Unfortunately, I have not investigated this fully. It is definitely > true that on what they call a "8-core" there are only 4 FPUs, and so the > best scaling you will get is 4 for FPU-based stuff like ATLAS. > > On Bulldozer, I investigated, and the performance went down if you tried > to use all their cores, as seen here: > http://math-atlas.sourceforge.net/atlas_install/node21.html > This despite the fact that that machine got better performance using SSE > than AVX. > > On the piledriver, I found that once you did some strange instruction > selection, AVX was much faster than SSE, which I would say makes it much > more likely that using all the modules will be a loss. However, I have > not found the time to actually study the parallel performance to make > sure, and to find out what funky core ID scheme is needed to maximize > performance. > > The easiest test is to take an assembly routine that artificially gets > AVX peak by doing useless operations directly on registers (no cache > access at all), and spawn it to pairs of threads; when performance is > cut in half, you have discovered the tids that share an FPU. So I suppose you have used this technique in order to obtain the peak performance and then you can conclude that the 3.11.8 version gets a 78/83% of peak as the ChangeLog says. But assuming only 4 FPU (in the case of 8 core), could not be computed theoretically the Rpeak? As James Cloos said in this thread: " Each module has two 128-bit floating point units. If all of the code uses only the xmm vector regsiters, each core will use its own float unit. OTOH, whenever the ymm vector registers are used, the code will be split betweent the two float units. At least according to what AMD has written. " So if the xmm and ymm registers are uses the Rpeak for each FPU could be computed as (in my case): 3.5 GHz*4=14 GFLOPS/s in double precision Cheers > > Once I did this, I used the configure interface to build ATLAS in two > ways: once with using only the unique FPUs, and once with 8 cores, and > what I found was that the 4-core approach was decidedly better on Dozer. > I have not yet done this test on Driver (I've been concentrating on > serial GEMM due to the rewrite, knowing that I have to research & > possibly rewrite threading thereafter). > > Cheers, > Clint > > On 09/24/2013 11:32 AM, José Luis García Pallero wrote: >> Hi all, >> >> I have access to an AMD Piledriver 8320 eight-core processor. Before >> to perform some tests with BLAS I would like to compute the >> theoretical floating point (double) peak. The problem is that I'm a >> bit confused about the concept of core (or thread) in AMD chips. >> Surfing the web one can find information about, but is not clear. In >> some sites talks about not all the cores are real cores, i.e., an >> eight-core processor is actually a quad-core. The chip is composed by >> 4 modules, each one having two cores, but these are cores for integer >> operations only, but has only one FPU per module, so the real number >> of FPU capable cores are 4. I have understood the prolem in this way. >> Am I right? >> So, in the caso of ATLAS compilation, should I configure the library >> for 8 threads or only for 4? I've seen that Piledriver is supported >> since ATLAS 3.11.3. Version 3.11.8 reaches a peak of about 78/83% as >> stated in ChangeLog. How was computed this value? >> And in order to compute the theoretical peak value, how many >> FLOPS/cycle can perform a Piledriver chip? 4 or 8? >> >> Cheers >> > > -- > ********************************************************************** > ** R. Clint Whaley, PhD * Assoc Prof, LSU * http://www.csc.lsu.edu/~whaley ** > ********************************************************************** > > > ------------------------------------------------------------------------------ > October Webinars: Code for Performance > Free Intel webinars can help you accelerate application performance. > Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from > the latest Intel processors and coprocessors. See abstracts and register > > http://pubads.g.doubleclick.net/gampad/clk?id=60133471&iu=/4140/ostg.clktrk > _______________________________________________ > Math-atlas-devel mailing list > Math-atlas-devel@... > https://lists.sourceforge.net/lists/listinfo/math-atlas-devel -- ***************************************** José Luis García Pallero jgpallero@... (o< / / \ V_/_ Use Debian GNU/Linux and enjoy! ***************************************** ```
 Re: [atlas-devel] About AMD Piledriver real number of cores and theoretical peak From: R. Clint Whaley - 2013-10-01 15:03:04 ```Guys, OK, I found the code I used to determine which cores share an FPU. It consists of a driver routine (partner_pt.c) that creates threads on two partner IDs specified on the commandline, and an assembly file written to run near peak (peak*.S). When the partners share an FPU, you should see the peak speed cut in half. I haven't written a peak assembly routine yet for piledriver; an easy fix would probably be to switched peak10_fm4_amd64.S to using ymm regs rather than xmm, and switch a few SSE insts to their AVX equivalents. Its possible that SSE might show which threads are partners, but I would not count on it. What you would want to do is find out which IDs share an FPU, and do an install using only the unique tids using the -t configure stuff, and compare your parallel performance to one where you use all 8. Cheers, Clint On 09/27/2013 10:22 AM, R. Clint Whaley wrote: > Unfortunately, I have not investigated this fully. It is definitely > true that on what they call a "8-core" there are only 4 FPUs, and so the > best scaling you will get is 4 for FPU-based stuff like ATLAS. > > On Bulldozer, I investigated, and the performance went down if you tried > to use all their cores, as seen here: > http://math-atlas.sourceforge.net/atlas_install/node21.html > This despite the fact that that machine got better performance using SSE > than AVX. > > On the piledriver, I found that once you did some strange instruction > selection, AVX was much faster than SSE, which I would say makes it much > more likely that using all the modules will be a loss. However, I have > not found the time to actually study the parallel performance to make > sure, and to find out what funky core ID scheme is needed to maximize > performance. > > The easiest test is to take an assembly routine that artificially gets > AVX peak by doing useless operations directly on registers (no cache > access at all), and spawn it to pairs of threads; when performance is > cut in half, you have discovered the tids that share an FPU. > > Once I did this, I used the configure interface to build ATLAS in two > ways: once with using only the unique FPUs, and once with 8 cores, and > what I found was that the 4-core approach was decidedly better on Dozer. > I have not yet done this test on Driver (I've been concentrating on > serial GEMM due to the rewrite, knowing that I have to research & > possibly rewrite threading thereafter). > > Cheers, > Clint > > On 09/24/2013 11:32 AM, José Luis García Pallero wrote: >> Hi all, >> >> I have access to an AMD Piledriver 8320 eight-core processor. Before >> to perform some tests with BLAS I would like to compute the >> theoretical floating point (double) peak. The problem is that I'm a >> bit confused about the concept of core (or thread) in AMD chips. >> Surfing the web one can find information about, but is not clear. In >> some sites talks about not all the cores are real cores, i.e., an >> eight-core processor is actually a quad-core. The chip is composed by >> 4 modules, each one having two cores, but these are cores for integer >> operations only, but has only one FPU per module, so the real number >> of FPU capable cores are 4. I have understood the prolem in this way. >> Am I right? >> So, in the caso of ATLAS compilation, should I configure the library >> for 8 threads or only for 4? I've seen that Piledriver is supported >> since ATLAS 3.11.3. Version 3.11.8 reaches a peak of about 78/83% as >> stated in ChangeLog. How was computed this value? >> And in order to compute the theoretical peak value, how many >> FLOPS/cycle can perform a Piledriver chip? 4 or 8? >> >> Cheers >> > -- ********************************************************************** ** R. Clint Whaley, PhD * Assoc Prof, LSU * http://www.csc.lsu.edu/~whaley ** ********************************************************************** ```
 Re: [atlas-devel] About AMD Piledriver real number of cores and theoretical peak From: R. Clint Whaley - 2013-10-01 15:03:03 Attachments: part.tar.bz2 ```Ooops, here's the attachment I meant to put in last message. Cheers, Clint On 09/27/2013 10:22 AM, R. Clint Whaley wrote: > Unfortunately, I have not investigated this fully. It is definitely > true that on what they call a "8-core" there are only 4 FPUs, and so the > best scaling you will get is 4 for FPU-based stuff like ATLAS. > > On Bulldozer, I investigated, and the performance went down if you tried > to use all their cores, as seen here: > http://math-atlas.sourceforge.net/atlas_install/node21.html > This despite the fact that that machine got better performance using SSE > than AVX. > > On the piledriver, I found that once you did some strange instruction > selection, AVX was much faster than SSE, which I would say makes it much > more likely that using all the modules will be a loss. However, I have > not found the time to actually study the parallel performance to make > sure, and to find out what funky core ID scheme is needed to maximize > performance. > > The easiest test is to take an assembly routine that artificially gets > AVX peak by doing useless operations directly on registers (no cache > access at all), and spawn it to pairs of threads; when performance is > cut in half, you have discovered the tids that share an FPU. > > Once I did this, I used the configure interface to build ATLAS in two > ways: once with using only the unique FPUs, and once with 8 cores, and > what I found was that the 4-core approach was decidedly better on Dozer. > I have not yet done this test on Driver (I've been concentrating on > serial GEMM due to the rewrite, knowing that I have to research & > possibly rewrite threading thereafter). > > Cheers, > Clint > > On 09/24/2013 11:32 AM, José Luis García Pallero wrote: >> Hi all, >> >> I have access to an AMD Piledriver 8320 eight-core processor. Before >> to perform some tests with BLAS I would like to compute the >> theoretical floating point (double) peak. The problem is that I'm a >> bit confused about the concept of core (or thread) in AMD chips. >> Surfing the web one can find information about, but is not clear. In >> some sites talks about not all the cores are real cores, i.e., an >> eight-core processor is actually a quad-core. The chip is composed by >> 4 modules, each one having two cores, but these are cores for integer >> operations only, but has only one FPU per module, so the real number >> of FPU capable cores are 4. I have understood the prolem in this way. >> Am I right? >> So, in the caso of ATLAS compilation, should I configure the library >> for 8 threads or only for 4? I've seen that Piledriver is supported >> since ATLAS 3.11.3. Version 3.11.8 reaches a peak of about 78/83% as >> stated in ChangeLog. How was computed this value? >> And in order to compute the theoretical peak value, how many >> FLOPS/cycle can perform a Piledriver chip? 4 or 8? >> >> Cheers >> > -- ********************************************************************** ** R. Clint Whaley, PhD * Assoc Prof, LSU * http://www.csc.lsu.edu/~whaley ** ********************************************************************** ```
 Re: [atlas-devel] About AMD Piledriver real number of cores and theoretical peak From: José Luis García Pallero - 2013-10-02 09:38:31 ```2013/10/1 R. Clint Whaley : > Ooops, here's the attachment I meant to put in last message. Hello, and thank you for your code, Surfing the net I've found an easier way in order to detect which threads are independent. The way to do it consists on inspecting the files thread_siblings_list, located in /sys/devices/system/cpu/cpuXXX/topology: http://pic.dhe.ibm.com/infocenter/lnxinfo/v3r0m0/index.jsp?topic=%2Fliaat%2Fliaattunpinthreads.htm I suppose this method only works for *NIX family of operating systems. I've tested your code (using the fma4 source on the piledriver I have access) and it shows that id 0, 2, 4 and 6 are independed, which is the same result as obtained inspecting the thread_siblings_list list. Best regards > > > Cheers, > Clint > > On 09/27/2013 10:22 AM, R. Clint Whaley wrote: >> >> Unfortunately, I have not investigated this fully. It is definitely >> true that on what they call a "8-core" there are only 4 FPUs, and so the >> best scaling you will get is 4 for FPU-based stuff like ATLAS. >> >> On Bulldozer, I investigated, and the performance went down if you tried >> to use all their cores, as seen here: >> http://math-atlas.sourceforge.net/atlas_install/node21.html >> This despite the fact that that machine got better performance using SSE >> than AVX. >> >> On the piledriver, I found that once you did some strange instruction >> selection, AVX was much faster than SSE, which I would say makes it much >> more likely that using all the modules will be a loss. However, I have >> not found the time to actually study the parallel performance to make >> sure, and to find out what funky core ID scheme is needed to maximize >> performance. >> >> The easiest test is to take an assembly routine that artificially gets >> AVX peak by doing useless operations directly on registers (no cache >> access at all), and spawn it to pairs of threads; when performance is >> cut in half, you have discovered the tids that share an FPU. >> >> Once I did this, I used the configure interface to build ATLAS in two >> ways: once with using only the unique FPUs, and once with 8 cores, and >> what I found was that the 4-core approach was decidedly better on Dozer. >> I have not yet done this test on Driver (I've been concentrating on >> serial GEMM due to the rewrite, knowing that I have to research & >> possibly rewrite threading thereafter). >> >> Cheers, >> Clint >> >> On 09/24/2013 11:32 AM, José Luis García Pallero wrote: >>> >>> Hi all, >>> >>> I have access to an AMD Piledriver 8320 eight-core processor. Before >>> to perform some tests with BLAS I would like to compute the >>> theoretical floating point (double) peak. The problem is that I'm a >>> bit confused about the concept of core (or thread) in AMD chips. >>> Surfing the web one can find information about, but is not clear. In >>> some sites talks about not all the cores are real cores, i.e., an >>> eight-core processor is actually a quad-core. The chip is composed by >>> 4 modules, each one having two cores, but these are cores for integer >>> operations only, but has only one FPU per module, so the real number >>> of FPU capable cores are 4. I have understood the prolem in this way. >>> Am I right? >>> So, in the caso of ATLAS compilation, should I configure the library >>> for 8 threads or only for 4? I've seen that Piledriver is supported >>> since ATLAS 3.11.3. Version 3.11.8 reaches a peak of about 78/83% as >>> stated in ChangeLog. How was computed this value? >>> And in order to compute the theoretical peak value, how many >>> FLOPS/cycle can perform a Piledriver chip? 4 or 8? >>> >>> Cheers >>> >> > > -- > ********************************************************************** > ** R. Clint Whaley, PhD * Assoc Prof, LSU * http://www.csc.lsu.edu/~whaley ** > ********************************************************************** > > ------------------------------------------------------------------------------ > October Webinars: Code for Performance > Free Intel webinars can help you accelerate application performance. > Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most > from > the latest Intel processors and coprocessors. See abstracts and register > > http://pubads.g.doubleclick.net/gampad/clk?id=60134791&iu=/4140/ostg.clktrk > _______________________________________________ > Math-atlas-devel mailing list > Math-atlas-devel@... > https://lists.sourceforge.net/lists/listinfo/math-atlas-devel > -- ***************************************** José Luis García Pallero jgpallero@... (o< / / \ V_/_ Use Debian GNU/Linux and enjoy! ***************************************** ```
 Re: [atlas-devel] About AMD Piledriver real number of cores and theoretical peak From: José Luis García Pallero - 2013-10-02 09:49:04 ```2013/10/2 José Luis García Pallero : > 2013/10/1 R. Clint Whaley : >> Ooops, here's the attachment I meant to put in last message. > > Hello, and thank you for your code, > > Surfing the net I've found an easier way in order to detect which > threads are independent. The way to do it consists on inspecting the > files thread_siblings_list, located in > /sys/devices/system/cpu/cpuXXX/topology: > http://pic.dhe.ibm.com/infocenter/lnxinfo/v3r0m0/index.jsp?topic=%2Fliaat%2Fliaattunpinthreads.htm > I suppose this method only works for *NIX family of operating systems. > I've tested your code (using the fma4 source on the piledriver I have > access) and it shows that id 0, 2, 4 and 6 are independed, which is > the same result as obtained inspecting the thread_siblings_list list. Hi again, related with this topic of selecting the usable thread ids, I'm trying to compile ATLAS in this way. I've tried to use in the configure (for 3.11.13 version) the flag --force-tids="0 2 4 6" but the configure breaks due to an error. Apparently, another alternative is to combine the flags -t <# threads> and -tl <#> . I'm a bit confused about the syntax of -tl due to the <#> marks. Means that first I must repeat the number of threads (it was passed yet through -t) and then the id list -> -tl 4 0 2 4 6? Or is mandatory only the list -> -tl 0 2 4 6? Apparently, it works as -tl '0 2 4 6' Best regards > > Best regards > >> >> >> Cheers, >> Clint >> >> On 09/27/2013 10:22 AM, R. Clint Whaley wrote: >>> >>> Unfortunately, I have not investigated this fully. It is definitely >>> true that on what they call a "8-core" there are only 4 FPUs, and so the >>> best scaling you will get is 4 for FPU-based stuff like ATLAS. >>> >>> On Bulldozer, I investigated, and the performance went down if you tried >>> to use all their cores, as seen here: >>> http://math-atlas.sourceforge.net/atlas_install/node21.html >>> This despite the fact that that machine got better performance using SSE >>> than AVX. >>> >>> On the piledriver, I found that once you did some strange instruction >>> selection, AVX was much faster than SSE, which I would say makes it much >>> more likely that using all the modules will be a loss. However, I have >>> not found the time to actually study the parallel performance to make >>> sure, and to find out what funky core ID scheme is needed to maximize >>> performance. >>> >>> The easiest test is to take an assembly routine that artificially gets >>> AVX peak by doing useless operations directly on registers (no cache >>> access at all), and spawn it to pairs of threads; when performance is >>> cut in half, you have discovered the tids that share an FPU. >>> >>> Once I did this, I used the configure interface to build ATLAS in two >>> ways: once with using only the unique FPUs, and once with 8 cores, and >>> what I found was that the 4-core approach was decidedly better on Dozer. >>> I have not yet done this test on Driver (I've been concentrating on >>> serial GEMM due to the rewrite, knowing that I have to research & >>> possibly rewrite threading thereafter). >>> >>> Cheers, >>> Clint >>> >>> On 09/24/2013 11:32 AM, José Luis García Pallero wrote: >>>> >>>> Hi all, >>>> >>>> I have access to an AMD Piledriver 8320 eight-core processor. Before >>>> to perform some tests with BLAS I would like to compute the >>>> theoretical floating point (double) peak. The problem is that I'm a >>>> bit confused about the concept of core (or thread) in AMD chips. >>>> Surfing the web one can find information about, but is not clear. In >>>> some sites talks about not all the cores are real cores, i.e., an >>>> eight-core processor is actually a quad-core. The chip is composed by >>>> 4 modules, each one having two cores, but these are cores for integer >>>> operations only, but has only one FPU per module, so the real number >>>> of FPU capable cores are 4. I have understood the prolem in this way. >>>> Am I right? >>>> So, in the caso of ATLAS compilation, should I configure the library >>>> for 8 threads or only for 4? I've seen that Piledriver is supported >>>> since ATLAS 3.11.3. Version 3.11.8 reaches a peak of about 78/83% as >>>> stated in ChangeLog. How was computed this value? >>>> And in order to compute the theoretical peak value, how many >>>> FLOPS/cycle can perform a Piledriver chip? 4 or 8? >>>> >>>> Cheers >>>> >>> >> >> -- >> ********************************************************************** >> ** R. Clint Whaley, PhD * Assoc Prof, LSU * http://www.csc.lsu.edu/~whaley ** >> ********************************************************************** >> >> ------------------------------------------------------------------------------ >> October Webinars: Code for Performance >> Free Intel webinars can help you accelerate application performance. >> Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most >> from >> the latest Intel processors and coprocessors. See abstracts and register > >> http://pubads.g.doubleclick.net/gampad/clk?id=60134791&iu=/4140/ostg.clktrk >> _______________________________________________ >> Math-atlas-devel mailing list >> Math-atlas-devel@... >> https://lists.sourceforge.net/lists/listinfo/math-atlas-devel >> > > > > -- > ***************************************** > José Luis García Pallero > jgpallero@... > (o< > / / \ > V_/_ > Use Debian GNU/Linux and enjoy! > ***************************************** -- ***************************************** José Luis García Pallero jgpallero@... (o< / / \ V_/_ Use Debian GNU/Linux and enjoy! ***************************************** ```
 Re: [atlas-devel] About AMD Piledriver real number of cores and theoretical peak From: José Luis García Pallero - 2013-10-02 11:12:32 ```2013/10/2 José Luis García Pallero : > 2013/10/2 José Luis García Pallero : >> 2013/10/1 R. Clint Whaley : >>> Ooops, here's the attachment I meant to put in last message. >> >> Hello, and thank you for your code, >> >> Surfing the net I've found an easier way in order to detect which >> threads are independent. The way to do it consists on inspecting the >> files thread_siblings_list, located in >> /sys/devices/system/cpu/cpuXXX/topology: >> http://pic.dhe.ibm.com/infocenter/lnxinfo/v3r0m0/index.jsp?topic=%2Fliaat%2Fliaattunpinthreads.htm >> I suppose this method only works for *NIX family of operating systems. >> I've tested your code (using the fma4 source on the piledriver I have >> access) and it shows that id 0, 2, 4 and 6 are independed, which is >> the same result as obtained inspecting the thread_siblings_list list. > > Hi again, > > related with this topic of selecting the usable thread ids, I'm trying > to compile ATLAS in this way. I've tried to use in the configure (for > 3.11.13 version) the flag --force-tids="0 2 4 6" but the configure > breaks due to an error. Apparently, another alternative is to combine > the flags -t <# threads> and -tl <#> . I'm a bit confused about > the syntax of -tl due to the <#> marks. Means that first I must > repeat the number of threads (it was passed yet through -t) and then > the id list -> -tl 4 0 2 4 6? Or is mandatory only the list -> -tl 0 2 > 4 6? Apparently, it works as -tl '0 2 4 6' Hello: I've made two different installation in the piledriver machine I have access and here are my results. There is a problem on scalability, but I don't know the reason. In the firs compilation I've used this configure order: ../configure --prefix=/home/jgpallero/software/atlas -t 4 -D c -DWALL -Ss pmake '\$(MAKE) -j 4' I limited the number of threads to four due to there is only 4 FPUs in spite of the announced eight cores. I was the only user in the machine, so the -DWALL should be report good and real times. All the process was OK, and make check, make ptchek and make time showed good results. Then I tested the library in order to see performance. With the single-threaded library the showed performance for DGEMM was around 25 GPLOPS/s. For 4 threads it can not be spected a performance of 25*4=100 GPLOPS/s, but I think we can spect about 25*[3.3/3.5] ~ 80/85 GFLOPS/s. But for 4 threads only about 47 GFLOPS/s was reached, which means a speedup around 2 for 4 cores. I think this is very strange. Thinking about the problem, the reason could be that for 4 threads the library were using only two FPUs but 4 threads, so the speeduo is around 2. I don't know if it is possible, but it was all I could think. So I tried to compile again the library indicating in the configure the threads id to use in order to select the 4 real FPUs. I detected the 4 independent threads id (using the program Clint uploaded, that shows the same results as using http://pic.dhe.ibm.com/infocenter/lnxinfo/v3r0m0/index.jsp?topic=%2Fliaat%2Fliaattunpinthreads.htm). My first configure order was: ../configure --prefix=/home/jgpallero/software/atlas -t 4 -D c -DWALL -Ss pmake '\$(MAKE) -j 4' --force-tids="0 2 4 6" using the flag --force-tids as the atlas_install.pdf says. But the configure did not terminate, so I think the --force-tids is broken in 3.11.13. So I tried to use -tl flag as: ../configure --prefix=/home/jgpallero/software/atlas -t 4 -D c -DWALL -Ss pmake '\$(MAKE) -j 4' -tl '0 2 4 6' now the configure works. As I said in my previous mail, I don't know exactly if this is the correct syntax for -tl. But the compilation ended apparently without errors. Then, I did the checks. Make check worked fine, but I have a problem in make ptcheck. I don't remember the error message from make ptcheck, but if I try to execute it again, I obtain: make[5]: Entering directory `/home/jgpallero/ATLAS/catlas/src/lapack' ar r ATL_itlaenv.o ATL_sgelq2.o ATL_sgeql2.o ATL_sgeqr2.o ATL_sgerq2.o ATL_sgetf2.o ATL_sgetrfR.o ATL_sgetri.o ATL_sgetriC.o ATL_sgetriR.o ATL_sgetrs.o ATL_slamch.o ATL_slapy2.o ATL_slarf.o ATL_slarfb.o ATL_slarfg.o ATL_slarft.o ATL_slascl.o ATL_slaswp.o ATL_slauum.o ATL_slauumCL.o ATL_slauumCU.o ATL_slauumRL.o ATL_slauumRU.o ATL_sormlq.o ATL_sormql.o ATL_sormqr.o ATL_sormrq.o ATL_spotrf.o ATL_spotrfL.o ATL_spotrfU.o ATL_spotrs.o ATL_stgelqf.o ATL_stgelqr.o ATL_stgels.o ATL_stgeqlf.o ATL_stgeqlr.o ATL_stgeqrf.o ATL_stgeqrr.o ATL_stgerqf.o ATL_stgerqr.o ATL_stgetrf.o ATL_stgetrfC.o ATL_strtri.o ATL_strtriCL.o ATL_strtriCU.o ATL_strtriRL.o ATL_strtriRU.o ATL_strtrs.o ar: ATL_itlaenv.o: File format not recognized make[5]: *** [stlib.grd] Error 1 make[5]: Leaving directory `/home/jgpallero/ATLAS/catlas/src/lapack' make[4]: *** [stlib] Error 2 make[4]: Leaving directory `/home/jgpallero/ATLAS/catlas/src/lapack' make[3]: *** [stlapack] Error 2 make[3]: Leaving directory `/home/jgpallero/ATLAS/catlas/bin' make[2]: *** [ptsanity_test] Error 2 make[2]: Leaving directory `/home/jgpallero/ATLAS/catlas/bin' make[1]: *** [ptsanity_test] Error 2 make[1]: Leaving directory `/home/jgpallero/ATLAS/catlas' make: *** [pttest] Error 2 ATL_itlaenv.o: File format not recognized ??? Make time also works fine. Finally I tried to compile a program using the threaded ATLAS. I've obtained this error: ../lib/libatlas.a(ATL_FreeGlobalAtomicCount.o): In function `ATL_FreeGlobalAtomicCount': ATL_FreeGlobalAtomicCount.c:(.text+0x29): undefined reference to `ATL_FreeAtomicCount' I suppose I have wrong used the -tl flag. But the configure says nothing about. Has anyone any idea about? Best regards > > Best regards > >> >> Best regards >> >>> >>> >>> Cheers, >>> Clint >>> >>> On 09/27/2013 10:22 AM, R. Clint Whaley wrote: >>>> >>>> Unfortunately, I have not investigated this fully. It is definitely >>>> true that on what they call a "8-core" there are only 4 FPUs, and so the >>>> best scaling you will get is 4 for FPU-based stuff like ATLAS. >>>> >>>> On Bulldozer, I investigated, and the performance went down if you tried >>>> to use all their cores, as seen here: >>>> http://math-atlas.sourceforge.net/atlas_install/node21.html >>>> This despite the fact that that machine got better performance using SSE >>>> than AVX. >>>> >>>> On the piledriver, I found that once you did some strange instruction >>>> selection, AVX was much faster than SSE, which I would say makes it much >>>> more likely that using all the modules will be a loss. However, I have >>>> not found the time to actually study the parallel performance to make >>>> sure, and to find out what funky core ID scheme is needed to maximize >>>> performance. >>>> >>>> The easiest test is to take an assembly routine that artificially gets >>>> AVX peak by doing useless operations directly on registers (no cache >>>> access at all), and spawn it to pairs of threads; when performance is >>>> cut in half, you have discovered the tids that share an FPU. >>>> >>>> Once I did this, I used the configure interface to build ATLAS in two >>>> ways: once with using only the unique FPUs, and once with 8 cores, and >>>> what I found was that the 4-core approach was decidedly better on Dozer. >>>> I have not yet done this test on Driver (I've been concentrating on >>>> serial GEMM due to the rewrite, knowing that I have to research & >>>> possibly rewrite threading thereafter). >>>> >>>> Cheers, >>>> Clint >>>> >>>> On 09/24/2013 11:32 AM, José Luis García Pallero wrote: >>>>> >>>>> Hi all, >>>>> >>>>> I have access to an AMD Piledriver 8320 eight-core processor. Before >>>>> to perform some tests with BLAS I would like to compute the >>>>> theoretical floating point (double) peak. The problem is that I'm a >>>>> bit confused about the concept of core (or thread) in AMD chips. >>>>> Surfing the web one can find information about, but is not clear. In >>>>> some sites talks about not all the cores are real cores, i.e., an >>>>> eight-core processor is actually a quad-core. The chip is composed by >>>>> 4 modules, each one having two cores, but these are cores for integer >>>>> operations only, but has only one FPU per module, so the real number >>>>> of FPU capable cores are 4. I have understood the prolem in this way. >>>>> Am I right? >>>>> So, in the caso of ATLAS compilation, should I configure the library >>>>> for 8 threads or only for 4? I've seen that Piledriver is supported >>>>> since ATLAS 3.11.3. Version 3.11.8 reaches a peak of about 78/83% as >>>>> stated in ChangeLog. How was computed this value? >>>>> And in order to compute the theoretical peak value, how many >>>>> FLOPS/cycle can perform a Piledriver chip? 4 or 8? >>>>> >>>>> Cheers >>>>> >>>> >>> >>> -- >>> ********************************************************************** >>> ** R. Clint Whaley, PhD * Assoc Prof, LSU * http://www.csc.lsu.edu/~whaley ** >>> ********************************************************************** >>> >>> ------------------------------------------------------------------------------ >>> October Webinars: Code for Performance >>> Free Intel webinars can help you accelerate application performance. >>> Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most >>> from >>> the latest Intel processors and coprocessors. See abstracts and register > >>> http://pubads.g.doubleclick.net/gampad/clk?id=60134791&iu=/4140/ostg.clktrk >>> _______________________________________________ >>> Math-atlas-devel mailing list >>> Math-atlas-devel@... >>> https://lists.sourceforge.net/lists/listinfo/math-atlas-devel >>> >> >> >> >> -- >> ***************************************** >> José Luis García Pallero >> jgpallero@... >> (o< >> / / \ >> V_/_ >> Use Debian GNU/Linux and enjoy! >> ***************************************** > > > > -- > ***************************************** > José Luis García Pallero > jgpallero@... > (o< > / / \ > V_/_ > Use Debian GNU/Linux and enjoy! > ***************************************** -- ***************************************** José Luis García Pallero jgpallero@... (o< / / \ V_/_ Use Debian GNU/Linux and enjoy! ***************************************** ```
 Re: [atlas-devel] About AMD Piledriver real number of cores and theoretical peak From: R. Clint Whaley - 2013-10-02 14:37:27 ```You have the syntax wrong: http://math-atlas.sourceforge.net/atlas_install/node21.html On 10/02/2013 06:12 AM, José Luis García Pallero wrote: > 2013/10/2 José Luis García Pallero : >> 2013/10/2 José Luis García Pallero : >>> 2013/10/1 R. Clint Whaley : >>>> Ooops, here's the attachment I meant to put in last message. >>> >>> Hello, and thank you for your code, >>> >>> Surfing the net I've found an easier way in order to detect which >>> threads are independent. The way to do it consists on inspecting the >>> files thread_siblings_list, located in >>> /sys/devices/system/cpu/cpuXXX/topology: >>> http://pic.dhe.ibm.com/infocenter/lnxinfo/v3r0m0/index.jsp?topic=%2Fliaat%2Fliaattunpinthreads.htm >>> I suppose this method only works for *NIX family of operating systems. >>> I've tested your code (using the fma4 source on the piledriver I have >>> access) and it shows that id 0, 2, 4 and 6 are independed, which is >>> the same result as obtained inspecting the thread_siblings_list list. >> >> Hi again, >> >> related with this topic of selecting the usable thread ids, I'm trying >> to compile ATLAS in this way. I've tried to use in the configure (for >> 3.11.13 version) the flag --force-tids="0 2 4 6" but the configure >> breaks due to an error. Apparently, another alternative is to combine >> the flags -t <# threads> and -tl <#> . I'm a bit confused about >> the syntax of -tl due to the <#> marks. Means that first I must >> repeat the number of threads (it was passed yet through -t) and then >> the id list -> -tl 4 0 2 4 6? Or is mandatory only the list -> -tl 0 2 >> 4 6? Apparently, it works as -tl '0 2 4 6' > > Hello: > > I've made two different installation in the piledriver machine I have > access and here are my results. There is a problem on scalability, but > I don't know the reason. > > In the firs compilation I've used this configure order: > > ../configure --prefix=/home/jgpallero/software/atlas -t 4 -D c -DWALL > -Ss pmake '\$(MAKE) -j 4' > > I limited the number of threads to four due to there is only 4 FPUs in > spite of the announced eight cores. I was the only user in the > machine, so the -DWALL should be report good and real times. All the > process was OK, and make check, make ptchek and make time showed good > results. Then I tested the library in order to see performance. With > the single-threaded library the showed performance for DGEMM was > around 25 GPLOPS/s. For 4 threads it can not be spected a performance > of 25*4=100 GPLOPS/s, but I think we can spect about 25*[3.3/3.5] ~ > 80/85 GFLOPS/s. But for 4 threads only about 47 GFLOPS/s was reached, > which means a speedup around 2 for 4 cores. I think this is very > strange. > > Thinking about the problem, the reason could be that for 4 threads the > library were using only two FPUs but 4 threads, so the speeduo is > around 2. I don't know if it is possible, but it was all I could > think. So I tried to compile again the library indicating in the > configure the threads id to use in order to select the 4 real FPUs. I > detected the 4 independent threads id (using the program Clint > uploaded, that shows the same results as using > http://pic.dhe.ibm.com/infocenter/lnxinfo/v3r0m0/index.jsp?topic=%2Fliaat%2Fliaattunpinthreads.htm). > My first configure order was: > > ../configure --prefix=/home/jgpallero/software/atlas -t 4 -D c -DWALL > -Ss pmake '\$(MAKE) -j 4' --force-tids="0 2 4 6" > > using the flag --force-tids as the atlas_install.pdf says. But the > configure did not terminate, so I think the --force-tids is broken in > 3.11.13. So I tried to use -tl flag as: > > ../configure --prefix=/home/jgpallero/software/atlas -t 4 -D c -DWALL > -Ss pmake '\$(MAKE) -j 4' -tl '0 2 4 6' > > now the configure works. As I said in my previous mail, I don't know > exactly if this is the correct syntax for -tl. But the compilation > ended apparently without errors. > > Then, I did the checks. Make check worked fine, but I have a problem > in make ptcheck. I don't remember the error message from make ptcheck, > but if I try to execute it again, I obtain: > > make[5]: Entering directory `/home/jgpallero/ATLAS/catlas/src/lapack' > ar r ATL_itlaenv.o ATL_sgelq2.o ATL_sgeql2.o ATL_sgeqr2.o > ATL_sgerq2.o ATL_sgetf2.o ATL_sgetrfR.o ATL_sgetri.o ATL_sgetriC.o > ATL_sgetriR.o ATL_sgetrs.o ATL_slamch.o ATL_slapy2.o ATL_slarf.o > ATL_slarfb.o ATL_slarfg.o ATL_slarft.o ATL_slascl.o ATL_slaswp.o > ATL_slauum.o ATL_slauumCL.o ATL_slauumCU.o ATL_slauumRL.o > ATL_slauumRU.o ATL_sormlq.o ATL_sormql.o ATL_sormqr.o ATL_sormrq.o > ATL_spotrf.o ATL_spotrfL.o ATL_spotrfU.o ATL_spotrs.o ATL_stgelqf.o > ATL_stgelqr.o ATL_stgels.o ATL_stgeqlf.o ATL_stgeqlr.o ATL_stgeqrf.o > ATL_stgeqrr.o ATL_stgerqf.o ATL_stgerqr.o ATL_stgetrf.o ATL_stgetrfC.o > ATL_strtri.o ATL_strtriCL.o ATL_strtriCU.o ATL_strtriRL.o > ATL_strtriRU.o ATL_strtrs.o > ar: ATL_itlaenv.o: File format not recognized > make[5]: *** [stlib.grd] Error 1 > make[5]: Leaving directory `/home/jgpallero/ATLAS/catlas/src/lapack' > make[4]: *** [stlib] Error 2 > make[4]: Leaving directory `/home/jgpallero/ATLAS/catlas/src/lapack' > make[3]: *** [stlapack] Error 2 > make[3]: Leaving directory `/home/jgpallero/ATLAS/catlas/bin' > make[2]: *** [ptsanity_test] Error 2 > make[2]: Leaving directory `/home/jgpallero/ATLAS/catlas/bin' > make[1]: *** [ptsanity_test] Error 2 > make[1]: Leaving directory `/home/jgpallero/ATLAS/catlas' > make: *** [pttest] Error 2 > > ATL_itlaenv.o: File format not recognized ??? > > Make time also works fine. Finally I tried to compile a program using > the threaded ATLAS. I've obtained this error: > > ../lib/libatlas.a(ATL_FreeGlobalAtomicCount.o): In function > `ATL_FreeGlobalAtomicCount': > ATL_FreeGlobalAtomicCount.c:(.text+0x29): undefined reference to > `ATL_FreeAtomicCount' > > I suppose I have wrong used the -tl flag. But the configure says > nothing about. Has anyone any idea about? > > Best regards > >> >> Best regards >> >>> >>> Best regards >>> >>>> >>>> >>>> Cheers, >>>> Clint >>>> >>>> On 09/27/2013 10:22 AM, R. Clint Whaley wrote: >>>>> >>>>> Unfortunately, I have not investigated this fully. It is definitely >>>>> true that on what they call a "8-core" there are only 4 FPUs, and so the >>>>> best scaling you will get is 4 for FPU-based stuff like ATLAS. >>>>> >>>>> On Bulldozer, I investigated, and the performance went down if you tried >>>>> to use all their cores, as seen here: >>>>> http://math-atlas.sourceforge.net/atlas_install/node21.html >>>>> This despite the fact that that machine got better performance using SSE >>>>> than AVX. >>>>> >>>>> On the piledriver, I found that once you did some strange instruction >>>>> selection, AVX was much faster than SSE, which I would say makes it much >>>>> more likely that using all the modules will be a loss. However, I have >>>>> not found the time to actually study the parallel performance to make >>>>> sure, and to find out what funky core ID scheme is needed to maximize >>>>> performance. >>>>> >>>>> The easiest test is to take an assembly routine that artificially gets >>>>> AVX peak by doing useless operations directly on registers (no cache >>>>> access at all), and spawn it to pairs of threads; when performance is >>>>> cut in half, you have discovered the tids that share an FPU. >>>>> >>>>> Once I did this, I used the configure interface to build ATLAS in two >>>>> ways: once with using only the unique FPUs, and once with 8 cores, and >>>>> what I found was that the 4-core approach was decidedly better on Dozer. >>>>> I have not yet done this test on Driver (I've been concentrating on >>>>> serial GEMM due to the rewrite, knowing that I have to research & >>>>> possibly rewrite threading thereafter). >>>>> >>>>> Cheers, >>>>> Clint >>>>> >>>>> On 09/24/2013 11:32 AM, José Luis García Pallero wrote: >>>>>> >>>>>> Hi all, >>>>>> >>>>>> I have access to an AMD Piledriver 8320 eight-core processor. Before >>>>>> to perform some tests with BLAS I would like to compute the >>>>>> theoretical floating point (double) peak. The problem is that I'm a >>>>>> bit confused about the concept of core (or thread) in AMD chips. >>>>>> Surfing the web one can find information about, but is not clear. In >>>>>> some sites talks about not all the cores are real cores, i.e., an >>>>>> eight-core processor is actually a quad-core. The chip is composed by >>>>>> 4 modules, each one having two cores, but these are cores for integer >>>>>> operations only, but has only one FPU per module, so the real number >>>>>> of FPU capable cores are 4. I have understood the prolem in this way. >>>>>> Am I right? >>>>>> So, in the caso of ATLAS compilation, should I configure the library >>>>>> for 8 threads or only for 4? I've seen that Piledriver is supported >>>>>> since ATLAS 3.11.3. Version 3.11.8 reaches a peak of about 78/83% as >>>>>> stated in ChangeLog. How was computed this value? >>>>>> And in order to compute the theoretical peak value, how many >>>>>> FLOPS/cycle can perform a Piledriver chip? 4 or 8? >>>>>> >>>>>> Cheers >>>>>> >>>>> >>>> >>>> -- >>>> ********************************************************************** >>>> ** R. Clint Whaley, PhD * Assoc Prof, LSU * http://www.csc.lsu.edu/~whaley ** >>>> ********************************************************************** >>>> >>>> ------------------------------------------------------------------------------ >>>> October Webinars: Code for Performance >>>> Free Intel webinars can help you accelerate application performance. >>>> Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most >>>> from >>>> the latest Intel processors and coprocessors. See abstracts and register > >>>> http://pubads.g.doubleclick.net/gampad/clk?id=60134791&iu=/4140/ostg.clktrk >>>> _______________________________________________ >>>> Math-atlas-devel mailing list >>>> Math-atlas-devel@... >>>> https://lists.sourceforge.net/lists/listinfo/math-atlas-devel >>>> >>> >>> >>> >>> -- >>> ***************************************** >>> José Luis García Pallero >>> jgpallero@... >>> (o< >>> / / \ >>> V_/_ >>> Use Debian GNU/Linux and enjoy! >>> ***************************************** >> >> >> >> -- >> ***************************************** >> José Luis García Pallero >> jgpallero@... >> (o< >> / / \ >> V_/_ >> Use Debian GNU/Linux and enjoy! >> ***************************************** > > > -- ********************************************************************** ** R. Clint Whaley, PhD * Assoc Prof, LSU * http://www.csc.lsu.edu/~whaley ** ********************************************************************** ```
 Re: [atlas-devel] About AMD Piledriver real number of cores and theoretical peak From: José Luis García Pallero - 2013-10-02 16:16:55 ```2013/10/2 R. Clint Whaley : > You have the syntax wrong: > http://math-atlas.sourceforge.net/atlas_install/node21.html Sorry, you are right, the first number is the number of threads. So I must use --force-tids="4 0 2 4 6". But, is this a redundance? When someone uses --force-tids is supposed that also uses the -t flag, and the argument of -t is also the number of threads... I've compiled yet two versions of ATLAS. The first one with the configure as: ../configure --prefix=/home/jgpallero/software/atlas -t 4 -D c -DWALL -Ss pmake '\$(MAKE) -j 4' --force-tids="4 0 2 4 6" and the other one with: ../configure --prefix=/home/jgpallero/software/atlas -t 4 -D c -DWALL -Ss pmake '\$(MAKE) -j 4' Here are the results (GFLOPS/s) for DGEMM M=N=K=5000 on the Piledriver AMD FX(tm)-8320 Eight-Core Processor I have access to: 1 thread: 23.1 (both compilations) 4 thread: 75.6 (with --force-tids): 4 thread: 43.3 (no --force-tids) Using htop I've seen that with --force-tids the program uses the ids I've indicated, but the program without -force-tids uses always the threads 0 to 3, so I suppose that it uses only two real cores and threads share the FPU. Has had anyone a similar problem with Piledriver? And with Bulldozer? The machine I use runs Ubuntu 12.04 and GCC 4.7.3 Best regards PS: I could upload the libraries resulting of both compilations, if anyone want to make checks > > On 10/02/2013 06:12 AM, José Luis García Pallero wrote: >> 2013/10/2 José Luis García Pallero : >>> 2013/10/2 José Luis García Pallero : >>>> 2013/10/1 R. Clint Whaley : >>>>> Ooops, here's the attachment I meant to put in last message. >>>> >>>> Hello, and thank you for your code, >>>> >>>> Surfing the net I've found an easier way in order to detect which >>>> threads are independent. The way to do it consists on inspecting the >>>> files thread_siblings_list, located in >>>> /sys/devices/system/cpu/cpuXXX/topology: >>>> http://pic.dhe.ibm.com/infocenter/lnxinfo/v3r0m0/index.jsp?topic=%2Fliaat%2Fliaattunpinthreads.htm >>>> I suppose this method only works for *NIX family of operating systems. >>>> I've tested your code (using the fma4 source on the piledriver I have >>>> access) and it shows that id 0, 2, 4 and 6 are independed, which is >>>> the same result as obtained inspecting the thread_siblings_list list. >>> >>> Hi again, >>> >>> related with this topic of selecting the usable thread ids, I'm trying >>> to compile ATLAS in this way. I've tried to use in the configure (for >>> 3.11.13 version) the flag --force-tids="0 2 4 6" but the configure >>> breaks due to an error. Apparently, another alternative is to combine >>> the flags -t <# threads> and -tl <#> . I'm a bit confused about >>> the syntax of -tl due to the <#> marks. Means that first I must >>> repeat the number of threads (it was passed yet through -t) and then >>> the id list -> -tl 4 0 2 4 6? Or is mandatory only the list -> -tl 0 2 >>> 4 6? Apparently, it works as -tl '0 2 4 6' >> >> Hello: >> >> I've made two different installation in the piledriver machine I have >> access and here are my results. There is a problem on scalability, but >> I don't know the reason. >> >> In the firs compilation I've used this configure order: >> >> ../configure --prefix=/home/jgpallero/software/atlas -t 4 -D c -DWALL >> -Ss pmake '\$(MAKE) -j 4' >> >> I limited the number of threads to four due to there is only 4 FPUs in >> spite of the announced eight cores. I was the only user in the >> machine, so the -DWALL should be report good and real times. All the >> process was OK, and make check, make ptchek and make time showed good >> results. Then I tested the library in order to see performance. With >> the single-threaded library the showed performance for DGEMM was >> around 25 GPLOPS/s. For 4 threads it can not be spected a performance >> of 25*4=100 GPLOPS/s, but I think we can spect about 25*[3.3/3.5] ~ >> 80/85 GFLOPS/s. But for 4 threads only about 47 GFLOPS/s was reached, >> which means a speedup around 2 for 4 cores. I think this is very >> strange. >> >> Thinking about the problem, the reason could be that for 4 threads the >> library were using only two FPUs but 4 threads, so the speeduo is >> around 2. I don't know if it is possible, but it was all I could >> think. So I tried to compile again the library indicating in the >> configure the threads id to use in order to select the 4 real FPUs. I >> detected the 4 independent threads id (using the program Clint >> uploaded, that shows the same results as using >> http://pic.dhe.ibm.com/infocenter/lnxinfo/v3r0m0/index.jsp?topic=%2Fliaat%2Fliaattunpinthreads.htm). >> My first configure order was: >> >> ../configure --prefix=/home/jgpallero/software/atlas -t 4 -D c -DWALL >> -Ss pmake '\$(MAKE) -j 4' --force-tids="0 2 4 6" >> >> using the flag --force-tids as the atlas_install.pdf says. But the >> configure did not terminate, so I think the --force-tids is broken in >> 3.11.13. So I tried to use -tl flag as: >> >> ../configure --prefix=/home/jgpallero/software/atlas -t 4 -D c -DWALL >> -Ss pmake '\$(MAKE) -j 4' -tl '0 2 4 6' >> >> now the configure works. As I said in my previous mail, I don't know >> exactly if this is the correct syntax for -tl. But the compilation >> ended apparently without errors. >> >> Then, I did the checks. Make check worked fine, but I have a problem >> in make ptcheck. I don't remember the error message from make ptcheck, >> but if I try to execute it again, I obtain: >> >> make[5]: Entering directory `/home/jgpallero/ATLAS/catlas/src/lapack' >> ar r ATL_itlaenv.o ATL_sgelq2.o ATL_sgeql2.o ATL_sgeqr2.o >> ATL_sgerq2.o ATL_sgetf2.o ATL_sgetrfR.o ATL_sgetri.o ATL_sgetriC.o >> ATL_sgetriR.o ATL_sgetrs.o ATL_slamch.o ATL_slapy2.o ATL_slarf.o >> ATL_slarfb.o ATL_slarfg.o ATL_slarft.o ATL_slascl.o ATL_slaswp.o >> ATL_slauum.o ATL_slauumCL.o ATL_slauumCU.o ATL_slauumRL.o >> ATL_slauumRU.o ATL_sormlq.o ATL_sormql.o ATL_sormqr.o ATL_sormrq.o >> ATL_spotrf.o ATL_spotrfL.o ATL_spotrfU.o ATL_spotrs.o ATL_stgelqf.o >> ATL_stgelqr.o ATL_stgels.o ATL_stgeqlf.o ATL_stgeqlr.o ATL_stgeqrf.o >> ATL_stgeqrr.o ATL_stgerqf.o ATL_stgerqr.o ATL_stgetrf.o ATL_stgetrfC.o >> ATL_strtri.o ATL_strtriCL.o ATL_strtriCU.o ATL_strtriRL.o >> ATL_strtriRU.o ATL_strtrs.o >> ar: ATL_itlaenv.o: File format not recognized >> make[5]: *** [stlib.grd] Error 1 >> make[5]: Leaving directory `/home/jgpallero/ATLAS/catlas/src/lapack' >> make[4]: *** [stlib] Error 2 >> make[4]: Leaving directory `/home/jgpallero/ATLAS/catlas/src/lapack' >> make[3]: *** [stlapack] Error 2 >> make[3]: Leaving directory `/home/jgpallero/ATLAS/catlas/bin' >> make[2]: *** [ptsanity_test] Error 2 >> make[2]: Leaving directory `/home/jgpallero/ATLAS/catlas/bin' >> make[1]: *** [ptsanity_test] Error 2 >> make[1]: Leaving directory `/home/jgpallero/ATLAS/catlas' >> make: *** [pttest] Error 2 >> >> ATL_itlaenv.o: File format not recognized ??? >> >> Make time also works fine. Finally I tried to compile a program using >> the threaded ATLAS. I've obtained this error: >> >> ../lib/libatlas.a(ATL_FreeGlobalAtomicCount.o): In function >> `ATL_FreeGlobalAtomicCount': >> ATL_FreeGlobalAtomicCount.c:(.text+0x29): undefined reference to >> `ATL_FreeAtomicCount' >> >> I suppose I have wrong used the -tl flag. But the configure says >> nothing about. Has anyone any idea about? >> >> Best regards >> >>> >>> Best regards >>> >>>> >>>> Best regards >>>> >>>>> >>>>> >>>>> Cheers, >>>>> Clint >>>>> >>>>> On 09/27/2013 10:22 AM, R. Clint Whaley wrote: >>>>>> >>>>>> Unfortunately, I have not investigated this fully. It is definitely >>>>>> true that on what they call a "8-core" there are only 4 FPUs, and so the >>>>>> best scaling you will get is 4 for FPU-based stuff like ATLAS. >>>>>> >>>>>> On Bulldozer, I investigated, and the performance went down if you tried >>>>>> to use all their cores, as seen here: >>>>>> http://math-atlas.sourceforge.net/atlas_install/node21.html >>>>>> This despite the fact that that machine got better performance using SSE >>>>>> than AVX. >>>>>> >>>>>> On the piledriver, I found that once you did some strange instruction >>>>>> selection, AVX was much faster than SSE, which I would say makes it much >>>>>> more likely that using all the modules will be a loss. However, I have >>>>>> not found the time to actually study the parallel performance to make >>>>>> sure, and to find out what funky core ID scheme is needed to maximize >>>>>> performance. >>>>>> >>>>>> The easiest test is to take an assembly routine that artificially gets >>>>>> AVX peak by doing useless operations directly on registers (no cache >>>>>> access at all), and spawn it to pairs of threads; when performance is >>>>>> cut in half, you have discovered the tids that share an FPU. >>>>>> >>>>>> Once I did this, I used the configure interface to build ATLAS in two >>>>>> ways: once with using only the unique FPUs, and once with 8 cores, and >>>>>> what I found was that the 4-core approach was decidedly better on Dozer. >>>>>> I have not yet done this test on Driver (I've been concentrating on >>>>>> serial GEMM due to the rewrite, knowing that I have to research & >>>>>> possibly rewrite threading thereafter). >>>>>> >>>>>> Cheers, >>>>>> Clint >>>>>> >>>>>> On 09/24/2013 11:32 AM, José Luis García Pallero wrote: >>>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> I have access to an AMD Piledriver 8320 eight-core processor. Before >>>>>>> to perform some tests with BLAS I would like to compute the >>>>>>> theoretical floating point (double) peak. The problem is that I'm a >>>>>>> bit confused about the concept of core (or thread) in AMD chips. >>>>>>> Surfing the web one can find information about, but is not clear. In >>>>>>> some sites talks about not all the cores are real cores, i.e., an >>>>>>> eight-core processor is actually a quad-core. The chip is composed by >>>>>>> 4 modules, each one having two cores, but these are cores for integer >>>>>>> operations only, but has only one FPU per module, so the real number >>>>>>> of FPU capable cores are 4. I have understood the prolem in this way. >>>>>>> Am I right? >>>>>>> So, in the caso of ATLAS compilation, should I configure the library >>>>>>> for 8 threads or only for 4? I've seen that Piledriver is supported >>>>>>> since ATLAS 3.11.3. Version 3.11.8 reaches a peak of about 78/83% as >>>>>>> stated in ChangeLog. How was computed this value? >>>>>>> And in order to compute the theoretical peak value, how many >>>>>>> FLOPS/cycle can perform a Piledriver chip? 4 or 8? >>>>>>> >>>>>>> Cheers >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> ********************************************************************** >>>>> ** R. Clint Whaley, PhD * Assoc Prof, LSU * http://www.csc.lsu.edu/~whaley ** >>>>> ********************************************************************** >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> October Webinars: Code for Performance >>>>> Free Intel webinars can help you accelerate application performance. >>>>> Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most >>>>> from >>>>> the latest Intel processors and coprocessors. See abstracts and register > >>>>> http://pubads.g.doubleclick.net/gampad/clk?id=60134791&iu=/4140/ostg.clktrk >>>>> _______________________________________________ >>>>> Math-atlas-devel mailing list >>>>> Math-atlas-devel@... >>>>> https://lists.sourceforge.net/lists/listinfo/math-atlas-devel >>>>> >>>> >>>> >>>> >>>> -- >>>> ***************************************** >>>> José Luis García Pallero >>>> jgpallero@... >>>> (o< >>>> / / \ >>>> V_/_ >>>> Use Debian GNU/Linux and enjoy! >>>> ***************************************** >>> >>> >>> >>> -- >>> ***************************************** >>> José Luis García Pallero >>> jgpallero@... >>> (o< >>> / / \ >>> V_/_ >>> Use Debian GNU/Linux and enjoy! >>> ***************************************** >> >> >> > > -- > ********************************************************************** > ** R. Clint Whaley, PhD * Assoc Prof, LSU * http://www.csc.lsu.edu/~whaley ** > ********************************************************************** > > > ------------------------------------------------------------------------------ > October Webinars: Code for Performance > Free Intel webinars can help you accelerate application performance. > Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from > the latest Intel processors and coprocessors. See abstracts and register > > http://pubads.g.doubleclick.net/gampad/clk?id=60134791&iu=/4140/ostg.clktrk > _______________________________________________ > Math-atlas-devel mailing list > Math-atlas-devel@... > https://lists.sourceforge.net/lists/listinfo/math-atlas-devel -- ***************************************** José Luis García Pallero jgpallero@... (o< / / \ V_/_ Use Debian GNU/Linux and enjoy! ***************************************** ```
 Re: [atlas-devel] About AMD Piledriver real number of cores and theoretical peak From: R. Clint Whaley - 2013-10-02 16:46:09 ```Notice the page I pointed you at has no mention of using -t, which is why --force-tids takes the number argument. The -t alone is merely used to restrict the total number of threads, not to assign them to particular tids. The other thing you want to build is a normal install that uses all 8 threads, and compare it to the numbers of the --force-tids. I would compare not only GEMM (xdl3blastst_pt/xdmmtst_pt), but also solvers (xdslvtst_pt). For the solvers, you may want to try a range of problem sizes, as they can have very different parallel usage at various scales. Cheers, Clint On 10/02/2013 11:16 AM, José Luis García Pallero wrote: > 2013/10/2 R. Clint Whaley : >> You have the syntax wrong: >> http://math-atlas.sourceforge.net/atlas_install/node21.html > > Sorry, you are right, the first number is the number of threads. So I > must use --force-tids="4 0 2 4 6". But, is this a redundance? When > someone uses --force-tids is supposed that also uses the -t flag, and > the argument of -t is also the number of threads... > > I've compiled yet two versions of ATLAS. The first one with the configure as: > > ../configure --prefix=/home/jgpallero/software/atlas -t 4 -D c -DWALL > -Ss pmake '\$(MAKE) -j 4' --force-tids="4 0 2 4 6" > > and the other one with: > > ../configure --prefix=/home/jgpallero/software/atlas -t 4 -D c -DWALL > -Ss pmake '\$(MAKE) -j 4' > > Here are the results (GFLOPS/s) for DGEMM M=N=K=5000 on the Piledriver > AMD FX(tm)-8320 Eight-Core Processor I have access to: > > 1 thread: 23.1 (both compilations) > 4 thread: 75.6 (with --force-tids): > 4 thread: 43.3 (no --force-tids) > > Using htop I've seen that with --force-tids the program uses the ids > I've indicated, but the program without -force-tids uses always the > threads 0 to 3, so I suppose that it uses only two real cores and > threads share the FPU. > > Has had anyone a similar problem with Piledriver? And with Bulldozer? > The machine I use runs Ubuntu 12.04 and GCC 4.7.3 > > Best regards > > PS: I could upload the libraries resulting of both compilations, if > anyone want to make checks > >> >> On 10/02/2013 06:12 AM, José Luis García Pallero wrote: >>> 2013/10/2 José Luis García Pallero : >>>> 2013/10/2 José Luis García Pallero : >>>>> 2013/10/1 R. Clint Whaley : >>>>>> Ooops, here's the attachment I meant to put in last message. >>>>> >>>>> Hello, and thank you for your code, >>>>> >>>>> Surfing the net I've found an easier way in order to detect which >>>>> threads are independent. The way to do it consists on inspecting the >>>>> files thread_siblings_list, located in >>>>> /sys/devices/system/cpu/cpuXXX/topology: >>>>> http://pic.dhe.ibm.com/infocenter/lnxinfo/v3r0m0/index.jsp?topic=%2Fliaat%2Fliaattunpinthreads.htm >>>>> I suppose this method only works for *NIX family of operating systems. >>>>> I've tested your code (using the fma4 source on the piledriver I have >>>>> access) and it shows that id 0, 2, 4 and 6 are independed, which is >>>>> the same result as obtained inspecting the thread_siblings_list list. >>>> >>>> Hi again, >>>> >>>> related with this topic of selecting the usable thread ids, I'm trying >>>> to compile ATLAS in this way. I've tried to use in the configure (for >>>> 3.11.13 version) the flag --force-tids="0 2 4 6" but the configure >>>> breaks due to an error. Apparently, another alternative is to combine >>>> the flags -t <# threads> and -tl <#> . I'm a bit confused about >>>> the syntax of -tl due to the <#> marks. Means that first I must >>>> repeat the number of threads (it was passed yet through -t) and then >>>> the id list -> -tl 4 0 2 4 6? Or is mandatory only the list -> -tl 0 2 >>>> 4 6? Apparently, it works as -tl '0 2 4 6' >>> >>> Hello: >>> >>> I've made two different installation in the piledriver machine I have >>> access and here are my results. There is a problem on scalability, but >>> I don't know the reason. >>> >>> In the firs compilation I've used this configure order: >>> >>> ../configure --prefix=/home/jgpallero/software/atlas -t 4 -D c -DWALL >>> -Ss pmake '\$(MAKE) -j 4' >>> >>> I limited the number of threads to four due to there is only 4 FPUs in >>> spite of the announced eight cores. I was the only user in the >>> machine, so the -DWALL should be report good and real times. All the >>> process was OK, and make check, make ptchek and make time showed good >>> results. Then I tested the library in order to see performance. With >>> the single-threaded library the showed performance for DGEMM was >>> around 25 GPLOPS/s. For 4 threads it can not be spected a performance >>> of 25*4=100 GPLOPS/s, but I think we can spect about 25*[3.3/3.5] ~ >>> 80/85 GFLOPS/s. But for 4 threads only about 47 GFLOPS/s was reached, >>> which means a speedup around 2 for 4 cores. I think this is very >>> strange. >>> >>> Thinking about the problem, the reason could be that for 4 threads the >>> library were using only two FPUs but 4 threads, so the speeduo is >>> around 2. I don't know if it is possible, but it was all I could >>> think. So I tried to compile again the library indicating in the >>> configure the threads id to use in order to select the 4 real FPUs. I >>> detected the 4 independent threads id (using the program Clint >>> uploaded, that shows the same results as using >>> http://pic.dhe.ibm.com/infocenter/lnxinfo/v3r0m0/index.jsp?topic=%2Fliaat%2Fliaattunpinthreads.htm). >>> My first configure order was: >>> >>> ../configure --prefix=/home/jgpallero/software/atlas -t 4 -D c -DWALL >>> -Ss pmake '\$(MAKE) -j 4' --force-tids="0 2 4 6" >>> >>> using the flag --force-tids as the atlas_install.pdf says. But the >>> configure did not terminate, so I think the --force-tids is broken in >>> 3.11.13. So I tried to use -tl flag as: >>> >>> ../configure --prefix=/home/jgpallero/software/atlas -t 4 -D c -DWALL >>> -Ss pmake '\$(MAKE) -j 4' -tl '0 2 4 6' >>> >>> now the configure works. As I said in my previous mail, I don't know >>> exactly if this is the correct syntax for -tl. But the compilation >>> ended apparently without errors. >>> >>> Then, I did the checks. Make check worked fine, but I have a problem >>> in make ptcheck. I don't remember the error message from make ptcheck, >>> but if I try to execute it again, I obtain: >>> >>> make[5]: Entering directory `/home/jgpallero/ATLAS/catlas/src/lapack' >>> ar r ATL_itlaenv.o ATL_sgelq2.o ATL_sgeql2.o ATL_sgeqr2.o >>> ATL_sgerq2.o ATL_sgetf2.o ATL_sgetrfR.o ATL_sgetri.o ATL_sgetriC.o >>> ATL_sgetriR.o ATL_sgetrs.o ATL_slamch.o ATL_slapy2.o ATL_slarf.o >>> ATL_slarfb.o ATL_slarfg.o ATL_slarft.o ATL_slascl.o ATL_slaswp.o >>> ATL_slauum.o ATL_slauumCL.o ATL_slauumCU.o ATL_slauumRL.o >>> ATL_slauumRU.o ATL_sormlq.o ATL_sormql.o ATL_sormqr.o ATL_sormrq.o >>> ATL_spotrf.o ATL_spotrfL.o ATL_spotrfU.o ATL_spotrs.o ATL_stgelqf.o >>> ATL_stgelqr.o ATL_stgels.o ATL_stgeqlf.o ATL_stgeqlr.o ATL_stgeqrf.o >>> ATL_stgeqrr.o ATL_stgerqf.o ATL_stgerqr.o ATL_stgetrf.o ATL_stgetrfC.o >>> ATL_strtri.o ATL_strtriCL.o ATL_strtriCU.o ATL_strtriRL.o >>> ATL_strtriRU.o ATL_strtrs.o >>> ar: ATL_itlaenv.o: File format not recognized >>> make[5]: *** [stlib.grd] Error 1 >>> make[5]: Leaving directory `/home/jgpallero/ATLAS/catlas/src/lapack' >>> make[4]: *** [stlib] Error 2 >>> make[4]: Leaving directory `/home/jgpallero/ATLAS/catlas/src/lapack' >>> make[3]: *** [stlapack] Error 2 >>> make[3]: Leaving directory `/home/jgpallero/ATLAS/catlas/bin' >>> make[2]: *** [ptsanity_test] Error 2 >>> make[2]: Leaving directory `/home/jgpallero/ATLAS/catlas/bin' >>> make[1]: *** [ptsanity_test] Error 2 >>> make[1]: Leaving directory `/home/jgpallero/ATLAS/catlas' >>> make: *** [pttest] Error 2 >>> >>> ATL_itlaenv.o: File format not recognized ??? >>> >>> Make time also works fine. Finally I tried to compile a program using >>> the threaded ATLAS. I've obtained this error: >>> >>> ../lib/libatlas.a(ATL_FreeGlobalAtomicCount.o): In function >>> `ATL_FreeGlobalAtomicCount': >>> ATL_FreeGlobalAtomicCount.c:(.text+0x29): undefined reference to >>> `ATL_FreeAtomicCount' >>> >>> I suppose I have wrong used the -tl flag. But the configure says >>> nothing about. Has anyone any idea about? >>> >>> Best regards >>> >>>> >>>> Best regards >>>> >>>>> >>>>> Best regards >>>>> >>>>>> >>>>>> >>>>>> Cheers, >>>>>> Clint >>>>>> >>>>>> On 09/27/2013 10:22 AM, R. Clint Whaley wrote: >>>>>>> >>>>>>> Unfortunately, I have not investigated this fully. It is definitely >>>>>>> true that on what they call a "8-core" there are only 4 FPUs, and so the >>>>>>> best scaling you will get is 4 for FPU-based stuff like ATLAS. >>>>>>> >>>>>>> On Bulldozer, I investigated, and the performance went down if you tried >>>>>>> to use all their cores, as seen here: >>>>>>> http://math-atlas.sourceforge.net/atlas_install/node21.html >>>>>>> This despite the fact that that machine got better performance using SSE >>>>>>> than AVX. >>>>>>> >>>>>>> On the piledriver, I found that once you did some strange instruction >>>>>>> selection, AVX was much faster than SSE, which I would say makes it much >>>>>>> more likely that using all the modules will be a loss. However, I have >>>>>>> not found the time to actually study the parallel performance to make >>>>>>> sure, and to find out what funky core ID scheme is needed to maximize >>>>>>> performance. >>>>>>> >>>>>>> The easiest test is to take an assembly routine that artificially gets >>>>>>> AVX peak by doing useless operations directly on registers (no cache >>>>>>> access at all), and spawn it to pairs of threads; when performance is >>>>>>> cut in half, you have discovered the tids that share an FPU. >>>>>>> >>>>>>> Once I did this, I used the configure interface to build ATLAS in two >>>>>>> ways: once with using only the unique FPUs, and once with 8 cores, and >>>>>>> what I found was that the 4-core approach was decidedly better on Dozer. >>>>>>> I have not yet done this test on Driver (I've been concentrating on >>>>>>> serial GEMM due to the rewrite, knowing that I have to research & >>>>>>> possibly rewrite threading thereafter). >>>>>>> >>>>>>> Cheers, >>>>>>> Clint >>>>>>> >>>>>>> On 09/24/2013 11:32 AM, José Luis García Pallero wrote: >>>>>>>> >>>>>>>> Hi all, >>>>>>>> >>>>>>>> I have access to an AMD Piledriver 8320 eight-core processor. Before >>>>>>>> to perform some tests with BLAS I would like to compute the >>>>>>>> theoretical floating point (double) peak. The problem is that I'm a >>>>>>>> bit confused about the concept of core (or thread) in AMD chips. >>>>>>>> Surfing the web one can find information about, but is not clear. In >>>>>>>> some sites talks about not all the cores are real cores, i.e., an >>>>>>>> eight-core processor is actually a quad-core. The chip is composed by >>>>>>>> 4 modules, each one having two cores, but these are cores for integer >>>>>>>> operations only, but has only one FPU per module, so the real number >>>>>>>> of FPU capable cores are 4. I have understood the prolem in this way. >>>>>>>> Am I right? >>>>>>>> So, in the caso of ATLAS compilation, should I configure the library >>>>>>>> for 8 threads or only for 4? I've seen that Piledriver is supported >>>>>>>> since ATLAS 3.11.3. Version 3.11.8 reaches a peak of about 78/83% as >>>>>>>> stated in ChangeLog. How was computed this value? >>>>>>>> And in order to compute the theoretical peak value, how many >>>>>>>> FLOPS/cycle can perform a Piledriver chip? 4 or 8? >>>>>>>> >>>>>>>> Cheers >>>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> ********************************************************************** >>>>>> ** R. Clint Whaley, PhD * Assoc Prof, LSU * http://www.csc.lsu.edu/~whaley ** >>>>>> ********************************************************************** >>>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> October Webinars: Code for Performance >>>>>> Free Intel webinars can help you accelerate application performance. >>>>>> Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most >>>>>> from >>>>>> the latest Intel processors and coprocessors. See abstracts and register > >>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=60134791&iu=/4140/ostg.clktrk >>>>>> _______________________________________________ >>>>>> Math-atlas-devel mailing list >>>>>> Math-atlas-devel@... >>>>>> https://lists.sourceforge.net/lists/listinfo/math-atlas-devel >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> ***************************************** >>>>> José Luis García Pallero >>>>> jgpallero@... >>>>> (o< >>>>> / / \ >>>>> V_/_ >>>>> Use Debian GNU/Linux and enjoy! >>>>> ***************************************** >>>> >>>> >>>> >>>> -- >>>> ***************************************** >>>> José Luis García Pallero >>>> jgpallero@... >>>> (o< >>>> / / \ >>>> V_/_ >>>> Use Debian GNU/Linux and enjoy! >>>> ***************************************** >>> >>> >>> >> >> -- >> ********************************************************************** >> ** R. Clint Whaley, PhD * Assoc Prof, LSU * http://www.csc.lsu.edu/~whaley ** >> ********************************************************************** >> >> >> ------------------------------------------------------------------------------ >> October Webinars: Code for Performance >> Free Intel webinars can help you accelerate application performance. >> Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from >> the latest Intel processors and coprocessors. See abstracts and register > >> http://pubads.g.doubleclick.net/gampad/clk?id=60134791&iu=/4140/ostg.clktrk >> _______________________________________________ >> Math-atlas-devel mailing list >> Math-atlas-devel@... >> https://lists.sourceforge.net/lists/listinfo/math-atlas-devel > > > -- ********************************************************************** ** R. Clint Whaley, PhD * Assoc Prof, LSU * http://www.csc.lsu.edu/~whaley ** ********************************************************************** ```
 Re: [atlas-devel] About AMD Piledriver real number of cores and theoretical peak From: José Luis García Pallero - 2013-10-02 17:02:22 ```2013/10/2 R. Clint Whaley : > Notice the page I pointed you at has no mention of using -t, which is > why --force-tids takes the number argument. The -t alone is merely used > to restrict the total number of threads, not to assign them to > particular tids. > > The other thing you want to build is a normal install that uses all 8 > threads, and compare it to the numbers of the --force-tids. I would > compare not only GEMM (xdl3blastst_pt/xdmmtst_pt), but also solvers > (xdslvtst_pt). For the solvers, you may want to try a range of problem > sizes, as they can have very different parallel usage at various scales. Thank you for the advice Clint. Another question is about the possibility to merge various compilations with different number of threads. Suppose I want to create some verions of ATLAS for 2, 3 and 4 threads (plus the common single threaded). Then if I want to save all the results in only one tree atlas/-inlcude/-lib/, I could to rename the resulting libcblas.a to libcblas-2.a, etc. But libatlas.a, can be shared between all the compilations or should be renamed too? And what about the content of include/? Contains the same files between different compilations? Thanks > > Cheers, > Clint > > On 10/02/2013 11:16 AM, José Luis García Pallero wrote: >> 2013/10/2 R. Clint Whaley : >>> You have the syntax wrong: >>> http://math-atlas.sourceforge.net/atlas_install/node21.html >> >> Sorry, you are right, the first number is the number of threads. So I >> must use --force-tids="4 0 2 4 6". But, is this a redundance? When >> someone uses --force-tids is supposed that also uses the -t flag, and >> the argument of -t is also the number of threads... >> >> I've compiled yet two versions of ATLAS. The first one with the configure as: >> >> ../configure --prefix=/home/jgpallero/software/atlas -t 4 -D c -DWALL >> -Ss pmake '\$(MAKE) -j 4' --force-tids="4 0 2 4 6" >> >> and the other one with: >> >> ../configure --prefix=/home/jgpallero/software/atlas -t 4 -D c -DWALL >> -Ss pmake '\$(MAKE) -j 4' >> >> Here are the results (GFLOPS/s) for DGEMM M=N=K=5000 on the Piledriver >> AMD FX(tm)-8320 Eight-Core Processor I have access to: >> >> 1 thread: 23.1 (both compilations) >> 4 thread: 75.6 (with --force-tids): >> 4 thread: 43.3 (no --force-tids) >> >> Using htop I've seen that with --force-tids the program uses the ids >> I've indicated, but the program without -force-tids uses always the >> threads 0 to 3, so I suppose that it uses only two real cores and >> threads share the FPU. >> >> Has had anyone a similar problem with Piledriver? And with Bulldozer? >> The machine I use runs Ubuntu 12.04 and GCC 4.7.3 >> >> Best regards >> >> PS: I could upload the libraries resulting of both compilations, if >> anyone want to make checks >> >>> >>> On 10/02/2013 06:12 AM, José Luis García Pallero wrote: >>>> 2013/10/2 José Luis García Pallero : >>>>> 2013/10/2 José Luis García Pallero : >>>>>> 2013/10/1 R. Clint Whaley : >>>>>>> Ooops, here's the attachment I meant to put in last message. >>>>>> >>>>>> Hello, and thank you for your code, >>>>>> >>>>>> Surfing the net I've found an easier way in order to detect which >>>>>> threads are independent. The way to do it consists on inspecting the >>>>>> files thread_siblings_list, located in >>>>>> /sys/devices/system/cpu/cpuXXX/topology: >>>>>> http://pic.dhe.ibm.com/infocenter/lnxinfo/v3r0m0/index.jsp?topic=%2Fliaat%2Fliaattunpinthreads.htm >>>>>> I suppose this method only works for *NIX family of operating systems. >>>>>> I've tested your code (using the fma4 source on the piledriver I have >>>>>> access) and it shows that id 0, 2, 4 and 6 are independed, which is >>>>>> the same result as obtained inspecting the thread_siblings_list list. >>>>> >>>>> Hi again, >>>>> >>>>> related with this topic of selecting the usable thread ids, I'm trying >>>>> to compile ATLAS in this way. I've tried to use in the configure (for >>>>> 3.11.13 version) the flag --force-tids="0 2 4 6" but the configure >>>>> breaks due to an error. Apparently, another alternative is to combine >>>>> the flags -t <# threads> and -tl <#> . I'm a bit confused about >>>>> the syntax of -tl due to the <#> marks. Means that first I must >>>>> repeat the number of threads (it was passed yet through -t) and then >>>>> the id list -> -tl 4 0 2 4 6? Or is mandatory only the list -> -tl 0 2 >>>>> 4 6? Apparently, it works as -tl '0 2 4 6' >>>> >>>> Hello: >>>> >>>> I've made two different installation in the piledriver machine I have >>>> access and here are my results. There is a problem on scalability, but >>>> I don't know the reason. >>>> >>>> In the firs compilation I've used this configure order: >>>> >>>> ../configure --prefix=/home/jgpallero/software/atlas -t 4 -D c -DWALL >>>> -Ss pmake '\$(MAKE) -j 4' >>>> >>>> I limited the number of threads to four due to there is only 4 FPUs in >>>> spite of the announced eight cores. I was the only user in the >>>> machine, so the -DWALL should be report good and real times. All the >>>> process was OK, and make check, make ptchek and make time showed good >>>> results. Then I tested the library in order to see performance. With >>>> the single-threaded library the showed performance for DGEMM was >>>> around 25 GPLOPS/s. For 4 threads it can not be spected a performance >>>> of 25*4=100 GPLOPS/s, but I think we can spect about 25*[3.3/3.5] ~ >>>> 80/85 GFLOPS/s. But for 4 threads only about 47 GFLOPS/s was reached, >>>> which means a speedup around 2 for 4 cores. I think this is very >>>> strange. >>>> >>>> Thinking about the problem, the reason could be that for 4 threads the >>>> library were using only two FPUs but 4 threads, so the speeduo is >>>> around 2. I don't know if it is possible, but it was all I could >>>> think. So I tried to compile again the library indicating in the >>>> configure the threads id to use in order to select the 4 real FPUs. I >>>> detected the 4 independent threads id (using the program Clint >>>> uploaded, that shows the same results as using >>>> http://pic.dhe.ibm.com/infocenter/lnxinfo/v3r0m0/index.jsp?topic=%2Fliaat%2Fliaattunpinthreads.htm). >>>> My first configure order was: >>>> >>>> ../configure --prefix=/home/jgpallero/software/atlas -t 4 -D c -DWALL >>>> -Ss pmake '\$(MAKE) -j 4' --force-tids="0 2 4 6" >>>> >>>> using the flag --force-tids as the atlas_install.pdf says. But the >>>> configure did not terminate, so I think the --force-tids is broken in >>>> 3.11.13. So I tried to use -tl flag as: >>>> >>>> ../configure --prefix=/home/jgpallero/software/atlas -t 4 -D c -DWALL >>>> -Ss pmake '\$(MAKE) -j 4' -tl '0 2 4 6' >>>> >>>> now the configure works. As I said in my previous mail, I don't know >>>> exactly if this is the correct syntax for -tl. But the compilation >>>> ended apparently without errors. >>>> >>>> Then, I did the checks. Make check worked fine, but I have a problem >>>> in make ptcheck. I don't remember the error message from make ptcheck, >>>> but if I try to execute it again, I obtain: >>>> >>>> make[5]: Entering directory `/home/jgpallero/ATLAS/catlas/src/lapack' >>>> ar r ATL_itlaenv.o ATL_sgelq2.o ATL_sgeql2.o ATL_sgeqr2.o >>>> ATL_sgerq2.o ATL_sgetf2.o ATL_sgetrfR.o ATL_sgetri.o ATL_sgetriC.o >>>> ATL_sgetriR.o ATL_sgetrs.o ATL_slamch.o ATL_slapy2.o ATL_slarf.o >>>> ATL_slarfb.o ATL_slarfg.o ATL_slarft.o ATL_slascl.o ATL_slaswp.o >>>> ATL_slauum.o ATL_slauumCL.o ATL_slauumCU.o ATL_slauumRL.o >>>> ATL_slauumRU.o ATL_sormlq.o ATL_sormql.o ATL_sormqr.o ATL_sormrq.o >>>> ATL_spotrf.o ATL_spotrfL.o ATL_spotrfU.o ATL_spotrs.o ATL_stgelqf.o >>>> ATL_stgelqr.o ATL_stgels.o ATL_stgeqlf.o ATL_stgeqlr.o ATL_stgeqrf.o >>>> ATL_stgeqrr.o ATL_stgerqf.o ATL_stgerqr.o ATL_stgetrf.o ATL_stgetrfC.o >>>> ATL_strtri.o ATL_strtriCL.o ATL_strtriCU.o ATL_strtriRL.o >>>> ATL_strtriRU.o ATL_strtrs.o >>>> ar: ATL_itlaenv.o: File format not recognized >>>> make[5]: *** [stlib.grd] Error 1 >>>> make[5]: Leaving directory `/home/jgpallero/ATLAS/catlas/src/lapack' >>>> make[4]: *** [stlib] Error 2 >>>> make[4]: Leaving directory `/home/jgpallero/ATLAS/catlas/src/lapack' >>>> make[3]: *** [stlapack] Error 2 >>>> make[3]: Leaving directory `/home/jgpallero/ATLAS/catlas/bin' >>>> make[2]: *** [ptsanity_test] Error 2 >>>> make[2]: Leaving directory `/home/jgpallero/ATLAS/catlas/bin' >>>> make[1]: *** [ptsanity_test] Error 2 >>>> make[1]: Leaving directory `/home/jgpallero/ATLAS/catlas' >>>> make: *** [pttest] Error 2 >>>> >>>> ATL_itlaenv.o: File format not recognized ??? >>>> >>>> Make time also works fine. Finally I tried to compile a program using >>>> the threaded ATLAS. I've obtained this error: >>>> >>>> ../lib/libatlas.a(ATL_FreeGlobalAtomicCount.o): In function >>>> `ATL_FreeGlobalAtomicCount': >>>> ATL_FreeGlobalAtomicCount.c:(.text+0x29): undefined reference to >>>> `ATL_FreeAtomicCount' >>>> >>>> I suppose I have wrong used the -tl flag. But the configure says >>>> nothing about. Has anyone any idea about? >>>> >>>> Best regards >>>> >>>>> >>>>> Best regards >>>>> >>>>>> >>>>>> Best regards >>>>>> >>>>>>> >>>>>>> >>>>>>> Cheers, >>>>>>> Clint >>>>>>> >>>>>>> On 09/27/2013 10:22 AM, R. Clint Whaley wrote: >>>>>>>> >>>>>>>> Unfortunately, I have not investigated this fully. It is definitely >>>>>>>> true that on what they call a "8-core" there are only 4 FPUs, and so the >>>>>>>> best scaling you will get is 4 for FPU-based stuff like ATLAS. >>>>>>>> >>>>>>>> On Bulldozer, I investigated, and the performance went down if you tried >>>>>>>> to use all their cores, as seen here: >>>>>>>> http://math-atlas.sourceforge.net/atlas_install/node21.html >>>>>>>> This despite the fact that that machine got better performance using SSE >>>>>>>> than AVX. >>>>>>>> >>>>>>>> On the piledriver, I found that once you did some strange instruction >>>>>>>> selection, AVX was much faster than SSE, which I would say makes it much >>>>>>>> more likely that using all the modules will be a loss. However, I have >>>>>>>> not found the time to actually study the parallel performance to make >>>>>>>> sure, and to find out what funky core ID scheme is needed to maximize >>>>>>>> performance. >>>>>>>> >>>>>>>> The easiest test is to take an assembly routine that artificially gets >>>>>>>> AVX peak by doing useless operations directly on registers (no cache >>>>>>>> access at all), and spawn it to pairs of threads; when performance is >>>>>>>> cut in half, you have discovered the tids that share an FPU. >>>>>>>> >>>>>>>> Once I did this, I used the configure interface to build ATLAS in two >>>>>>>> ways: once with using only the unique FPUs, and once with 8 cores, and >>>>>>>> what I found was that the 4-core approach was decidedly better on Dozer. >>>>>>>> I have not yet done this test on Driver (I've been concentrating on >>>>>>>> serial GEMM due to the rewrite, knowing that I have to research & >>>>>>>> possibly rewrite threading thereafter). >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Clint >>>>>>>> >>>>>>>> On 09/24/2013 11:32 AM, José Luis García Pallero wrote: >>>>>>>>> >>>>>>>>> Hi all, >>>>>>>>> >>>>>>>>> I have access to an AMD Piledriver 8320 eight-core processor. Before >>>>>>>>> to perform some tests with BLAS I would like to compute the >>>>>>>>> theoretical floating point (double) peak. The problem is that I'm a >>>>>>>>> bit confused about the concept of core (or thread) in AMD chips. >>>>>>>>> Surfing the web one can find information about, but is not clear. In >>>>>>>>> some sites talks about not all the cores are real cores, i.e., an >>>>>>>>> eight-core processor is actually a quad-core. The chip is composed by >>>>>>>>> 4 modules, each one having two cores, but these are cores for integer >>>>>>>>> operations only, but has only one FPU per module, so the real number >>>>>>>>> of FPU capable cores are 4. I have understood the prolem in this way. >>>>>>>>> Am I right? >>>>>>>>> So, in the caso of ATLAS compilation, should I configure the library >>>>>>>>> for 8 threads or only for 4? I've seen that Piledriver is supported >>>>>>>>> since ATLAS 3.11.3. Version 3.11.8 reaches a peak of about 78/83% as >>>>>>>>> stated in ChangeLog. How was computed this value? >>>>>>>>> And in order to compute the theoretical peak value, how many >>>>>>>>> FLOPS/cycle can perform a Piledriver chip? 4 or 8? >>>>>>>>> >>>>>>>>> Cheers >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> ********************************************************************** >>>>>>> ** R. Clint Whaley, PhD * Assoc Prof, LSU * http://www.csc.lsu.edu/~whaley ** >>>>>>> ********************************************************************** >>>>>>> >>>>>>> ------------------------------------------------------------------------------ >>>>>>> October Webinars: Code for Performance >>>>>>> Free Intel webinars can help you accelerate application performance. >>>>>>> Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most >>>>>>> from >>>>>>> the latest Intel processors and coprocessors. See abstracts and register > >>>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=60134791&iu=/4140/ostg.clktrk >>>>>>> _______________________________________________ >>>>>>> Math-atlas-devel mailing list >>>>>>> Math-atlas-devel@... >>>>>>> https://lists.sourceforge.net/lists/listinfo/math-atlas-devel >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> ***************************************** >>>>>> José Luis García Pallero >>>>>> jgpallero@... >>>>>> (o< >>>>>> / / \ >>>>>> V_/_ >>>>>> Use Debian GNU/Linux and enjoy! >>>>>> ***************************************** >>>>> >>>>> >>>>> >>>>> -- >>>>> ***************************************** >>>>> José Luis García Pallero >>>>> jgpallero@... >>>>> (o< >>>>> / / \ >>>>> V_/_ >>>>> Use Debian GNU/Linux and enjoy! >>>>> ***************************************** >>>> >>>> >>>> >>> >>> -- >>> ********************************************************************** >>> ** R. Clint Whaley, PhD * Assoc Prof, LSU * http://www.csc.lsu.edu/~whaley ** >>> ********************************************************************** >>> >>> >>> ------------------------------------------------------------------------------ >>> October Webinars: Code for Performance >>> Free Intel webinars can help you accelerate application performance. >>> Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from >>> the latest Intel processors and coprocessors. See abstracts and register > >>> http://pubads.g.doubleclick.net/gampad/clk?id=60134791&iu=/4140/ostg.clktrk >>> _______________________________________________ >>> Math-atlas-devel mailing list >>> Math-atlas-devel@... >>> https://lists.sourceforge.net/lists/listinfo/math-atlas-devel >> >> >> > > -- > ********************************************************************** > ** R. Clint Whaley, PhD * Assoc Prof, LSU * http://www.csc.lsu.edu/~whaley ** > ********************************************************************** > > > ------------------------------------------------------------------------------ > October Webinars: Code for Performance > Free Intel webinars can help you accelerate application performance. > Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from > the latest Intel processors and coprocessors. See abstracts and register > > http://pubads.g.doubleclick.net/gampad/clk?id=60134791&iu=/4140/ostg.clktrk > _______________________________________________ > Math-atlas-devel mailing list > Math-atlas-devel@... > https://lists.sourceforge.net/lists/listinfo/math-atlas-devel -- ***************************************** José Luis García Pallero jgpallero@... (o< / / \ V_/_ Use Debian GNU/Linux and enjoy! ***************************************** ```