Screenshot instructions:
Windows
Mac
Red Hat Linux
Ubuntu
Click URL instructions:
Rightclick on ad, choose "Copy Link", then paste here →
(This may not be possible with some types of ads)
From: José Luis García Pallero <jgpallero@gm...>  20130106 22:38:07

Hello: First of all, I apologize if this question is not a subject of this list, but I don't know in which of then could I post I'm writing some benchmark code based on DGEMM and DSYRK BLAS routines and I need to know the number of floating point operations perfermed. Use ATLAS DGEMM and/or DSYRK any non basic algorithms as strassen's for example? Or uses the reference O(n^3) one? Surfing the web I've found that the exact number of floating point operations for the reference algorithms are: DGEMM: 2*M*N*K DSYRK: M*(M+1)*N Are these formulas correct? And anybody knows about other implementations as MKL, ACML or Goto(Open)BLAS? Thanks in advance  ***************************************** José Luis García Pallero jgpallero@... (o< / / \ V_/_ Use Debian GNU/Linux and enjoy! ***************************************** 
From: Clint Whaley <whaley@cs...>  20130110 03:48:46

Jose, >First of all, I apologize if this question is not a subject of this >list, but I don't know in which of then could I post > >I'm writing some benchmark code based on DGEMM and DSYRK BLAS routines >and I need to know the number of floating point operations perfermed. >Use ATLAS DGEMM and/or DSYRK any non basic algorithms as strassen's >for example? Or uses the reference O(n^3) one? >Surfing the web I've found that the exact number of floating point >operations for the reference algorithms are: > >DGEMM: 2*M*N*K >DSYRK: M*(M+1)*N > >Are these formulas correct? Yes, though DSYRK actually takes N, K as dims, so the correct count is K*N*(N+1) Of course, the complex arithmetic costs more. Note that ATLAS's timers (in ATLAS/bin) provide the standard flop count for pretty much all BLAS and some of lapack if you want to look it up. >And anybody knows about other implementations as MKL, ACML or Goto(Open)BLAS? They should all be using the standard algorithm. Strassen (and all the other fast matmuls) are illegal in standard libraries because they are not as numerically stable. There are some high performance libraries that optionally provide fast/unstable multiplies (in particular 3M for complex), but they aren't supposed to do so by default. Basically, optimized libraries like ATLAS, MKL, etc, should all use the standard, stable algorithm, and should be counted as above; they may actually do *extra* flops than this due to scaling, but they are low order terms, and are never computed for MFLOP, where the standard count is always used. Cheers, Clint ************************************************************************** ** R. Clint Whaley, PhD ** Assoc Prof, UTSA ** http://www.cs.utsa.edu/~whaley ** ************************************************************************** 
From: Frantisek Kluknavsky <fkluknav@re...>  20130110 08:32:56

These formulas assume separate multiply and add operations? Does ATLAS use fused multiplyaccumulate when available? On 01/10/2013 04:48 AM, Clint Whaley wrote: > Jose, > >> First of all, I apologize if this question is not a subject of this >> list, but I don't know in which of then could I post >> >> I'm writing some benchmark code based on DGEMM and DSYRK BLAS routines >> and I need to know the number of floating point operations perfermed. >> Use ATLAS DGEMM and/or DSYRK any non basic algorithms as strassen's >> for example? Or uses the reference O(n^3) one? >> Surfing the web I've found that the exact number of floating point >> operations for the reference algorithms are: >> >> DGEMM: 2*M*N*K >> DSYRK: M*(M+1)*N >> >> Are these formulas correct? > Yes, though DSYRK actually takes N, K as dims, so the correct count is > K*N*(N+1) > > Of course, the complex arithmetic costs more. Note that ATLAS's timers > (in ATLAS/bin) provide the standard flop count for pretty much all BLAS > and some of lapack if you want to look it up. > >> And anybody knows about other implementations as MKL, ACML or Goto(Open)BLAS? > They should all be using the standard algorithm. Strassen (and all the other > fast matmuls) are illegal in standard libraries because they are not as > numerically stable. There are some high performance libraries that optionally > provide fast/unstable multiplies (in particular 3M for complex), but they > aren't supposed to do so by default. > > Basically, optimized libraries like ATLAS, MKL, etc, should all use the > standard, stable algorithm, and should be counted as above; they may actually > do *extra* flops than this due to scaling, but they are low order terms, and > are never computed for MFLOP, where the standard count is always used. > > Cheers, > Clint > > ************************************************************************** > ** R. Clint Whaley, PhD ** Assoc Prof, UTSA ** http://www.cs.utsa.edu/~whaley ** > ************************************************************************** > >  > Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, > MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current > with LearnDevNow  3,200 stepbystep video tutorials by Microsoft > MVPs and experts. ON SALE this month only  learn more at: > http://p.sf.net/sfu/learnmore_122712 > _______________________________________________ > Mathatlasdevel mailing list > Mathatlasdevel@... > https://lists.sourceforge.net/lists/listinfo/mathatlasdevel 
From: Clint Whaley <whaley@cs...>  20130110 16:08:30

>These formulas assume separate multiply and add operations? >Does ATLAS use fused multiplyaccumulate when available? Of course ATLAS uses FMAC when available, since not using usually cuts your peak in half. However, MFLOP are always counted by the number of FLOPS you do, not the number of instructions your machine does, so whether you have FMAC or not, the FLOP count is always the same. Cheers, Clint On 01/10/2013 04:48 AM, Clint Whaley wrote: > Jose, > >> First of all, I apologize if this question is not a subject of this >> list, but I don't know in which of then could I post >> >> I'm writing some benchmark code based on DGEMM and DSYRK BLAS routines >> and I need to know the number of floating point operations perfermed. >> Use ATLAS DGEMM and/or DSYRK any non basic algorithms as strassen's >> for example? Or uses the reference O(n^3) one? >> Surfing the web I've found that the exact number of floating point >> operations for the reference algorithms are: >> >> DGEMM: 2*M*N*K >> DSYRK: M*(M+1)*N >> >> Are these formulas correct? > Yes, though DSYRK actually takes N, K as dims, so the correct count is > K*N*(N+1) > > Of course, the complex arithmetic costs more. Note that ATLAS's timers > (in ATLAS/bin) provide the standard flop count for pretty much all BLAS > and some of lapack if you want to look it up. > >> And anybody knows about other implementations as MKL, ACML or Goto(Open)BLAS? > They should all be using the standard algorithm. Strassen (and all the other > fast matmuls) are illegal in standard libraries because they are not as > numerically stable. There are some high performance libraries that optionally > provide fast/unstable multiplies (in particular 3M for complex), but they > aren't supposed to do so by default. > > Basically, optimized libraries like ATLAS, MKL, etc, should all use the > standard, stable algorithm, and should be counted as above; they may actually > do *extra* flops than this due to scaling, but they are low order terms, and > are never computed for MFLOP, where the standard count is always used. > > Cheers, > Clint > > ************************************************************************** > ** R. Clint Whaley, PhD ** Assoc Prof, UTSA ** http://www.cs.utsa.edu/~whaley ** > ************************************************************************** > >  > Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, > MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current > with LearnDevNow  3,200 stepbystep video tutorials by Microsoft > MVPs and experts. ON SALE this month only  learn more at: > http://p.sf.net/sfu/learnmore_122712 > _______________________________________________ > Mathatlasdevel mailing list > Mathatlasdevel@... > https://lists.sourceforge.net/lists/listinfo/mathatlasdevel  Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow  3,200 stepbystep video tutorials by Microsoft MVPs and experts. ON SALE this month only  learn more at: http://p.sf.net/sfu/learnmore_122712 _______________________________________________ Mathatlasdevel mailing list Mathatlasdevel@... https://lists.sourceforge.net/lists/listinfo/mathatlasdevel ************************************************************************** ** R. Clint Whaley, PhD ** Assoc Prof, UTSA ** http://www.cs.utsa.edu/~whaley ** ************************************************************************** 
From: José Luis García Pallero <jgpallero@gm...>  20130110 09:08:20

2013/1/10 Clint Whaley <whaley@...>: > Jose, > >>First of all, I apologize if this question is not a subject of this >>list, but I don't know in which of then could I post >> >>I'm writing some benchmark code based on DGEMM and DSYRK BLAS routines >>and I need to know the number of floating point operations perfermed. >>Use ATLAS DGEMM and/or DSYRK any non basic algorithms as strassen's >>for example? Or uses the reference O(n^3) one? >>Surfing the web I've found that the exact number of floating point >>operations for the reference algorithms are: >> >>DGEMM: 2*M*N*K >>DSYRK: M*(M+1)*N >> >>Are these formulas correct? > > Yes, though DSYRK actually takes N, K as dims, so the correct count is > K*N*(N+1) > > Of course, the complex arithmetic costs more. Note that ATLAS's timers > (in ATLAS/bin) provide the standard flop count for pretty much all BLAS > and some of lapack if you want to look it up. I've found in the Lapack Working Note 41 a list of count operations for BLAS levels 2 and 3 and some Lapack routines: http://www.netlib.org/lapack/#_strong_lawns_strong_lapack_working_notes Frantisek, in the document referred the count operation shows adds and multiplications separated > >>And anybody knows about other implementations as MKL, ACML or Goto(Open)BLAS? > > They should all be using the standard algorithm. Strassen (and all the other > fast matmuls) are illegal in standard libraries because they are not as > numerically stable. There are some high performance libraries that optionally > provide fast/unstable multiplies (in particular 3M for complex), but they > aren't supposed to do so by default. Thanks for this explanation :) > > Basically, optimized libraries like ATLAS, MKL, etc, should all use the > standard, stable algorithm, and should be counted as above; they may actually > do *extra* flops than this due to scaling, but they are low order terms, and > are never computed for MFLOP, where the standard count is always used. And about the count operation, it corresponds to the formulation: FLOP=cores x clockSpeed x FLOPS/cycle. But the problem is, how can I know the FLOPS/cycle of a chip? For example, I have an Intel Core i52500 3.3 GHz (4 cores) and the result of some tests with MKL DGEMM shows about 9596 GFLOPS. Taking into account that MKL can reach about 90% of theoretical performance a 8 FLOPS/cycle (for double) can be deduced. But in the specifications I can't find information about. How does ATLAS (for example) in order to get infomation about theoretical peak of a micro (if ATLAS does)? Thanks > > Cheers, > Clint > > ************************************************************************** > ** R. Clint Whaley, PhD ** Assoc Prof, UTSA ** http://www.cs.utsa.edu/~whaley ** > ************************************************************************** > >  > Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, > MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current > with LearnDevNow  3,200 stepbystep video tutorials by Microsoft > MVPs and experts. ON SALE this month only  learn more at: > http://p.sf.net/sfu/learnmore_122712 > _______________________________________________ > Mathatlasdevel mailing list > Mathatlasdevel@... > https://lists.sourceforge.net/lists/listinfo/mathatlasdevel  ***************************************** José Luis García Pallero jgpallero@... (o< / / \ V_/_ Use Debian GNU/Linux and enjoy! ***************************************** 
From: Clint Whaley <whaley@cs...>  20130114 17:59:18

>And about the count operation, it corresponds to the formulation: >FLOP=cores x clockSpeed x FLOPS/cycle. But the problem is, how can I >know the FLOPS/cycle of a chip? For example, I have an Intel Core >i52500 3.3 GHz (4 cores) and the result of some tests with MKL DGEMM >shows about 9596 GFLOPS. Taking into account that MKL can reach about >90% of theoretical performance a 8 FLOPS/cycle (for double) can be >deduced. But in the specifications I can't find information about. How >does ATLAS (for example) in order to get infomation about theoretical >peak of a micro (if ATLAS does)? You have to know something about the hardware, and derive it from there. For instance, on SandyBridge, you have 1 VFADD and 1 VFMUL unit, each is 32 bytes long. For double, that means 4 doubles done at once, so double peak is: 4 * 2 = 8 flops/cycle With single peak being twice that. You can also usually find it empirically by doing useless flops, but getting that right is not trivial for most folks. Cheers, Clint ************************************************************************** ** R. Clint Whaley, PhD ** Assoc Prof, UTSA ** http://www.cs.utsa.edu/~whaley ** ************************************************************************** 
From: José Luis García Pallero <jgpallero@gm...>  20130114 22:03:03

2013/1/14 Clint Whaley <whaley@...>: >>And about the count operation, it corresponds to the formulation: >>FLOP=cores x clockSpeed x FLOPS/cycle. But the problem is, how can I >>know the FLOPS/cycle of a chip? For example, I have an Intel Core >>i52500 3.3 GHz (4 cores) and the result of some tests with MKL DGEMM >>shows about 9596 GFLOPS. Taking into account that MKL can reach about >>90% of theoretical performance a 8 FLOPS/cycle (for double) can be >>deduced. But in the specifications I can't find information about. How >>does ATLAS (for example) in order to get infomation about theoretical >>peak of a micro (if ATLAS does)? > > You have to know something about the hardware, and derive it from there. > For instance, on SandyBridge, you have 1 VFADD and 1 VFMUL unit, each is > 32 bytes long. For double, that means 4 doubles done at once, so double > peak is: > 4 * 2 = 8 flops/cycle > With single peak being twice that. You can also usually find it empirically > by doing useless flops, but getting that right is not trivial for most folks. Thank you for your explanation, Clint. How could I find the peak empirically? Can ATLAS do it? I have too an iBook with a PPC G4 1.33 GHz. Surfing the net I've found that the PPC G4 is capable for 8 flops/cycle in double precision, so the theoretical peak is 10.64 GFLOPS. But my tests shows:  In MAC OS X with apple's veclib I obtain about 1 GFLOP with DGEMM  In Debian GNU/Linux I obtain too about 1 GFLOP DGEMM with OpenBLAS (old GotoBLAS) and ATLAS (3.8.4) I think that I've not computed correctly the peak Cheers  ***************************************** José Luis García Pallero jgpallero@... (o< / / \ V_/_ Use Debian GNU/Linux and enjoy! ***************************************** 
From: Clint Whaley <whaley@cs...>  20130114 22:30:01

>Thank you for your explanation, Clint. How could I find the peak >empirically? If you have to ask, you will not succeed :) ATLAS does not do it automatically because you usually need to do it in assembly, which requires you know the ISA and what to use (SSE/AVX,etc) in advance. >Can ATLAS do it? I have too an iBook with a PPC G4 1.33 >GHz. Surfing the net I've found that the PPC G4 is capable for 8 >flops/cycle in double precision, so the theoretical peak is 10.64 >GFLOPS. But my tests shows: > In MAC OS X with apple's veclib I obtain about 1 GFLOP with DGEMM > In Debian GNU/Linux I obtain too about 1 GFLOP DGEMM with OpenBLAS >(old GotoBLAS) and ATLAS (3.8.4) > >I think that I've not computed correctly the peak That is not correct. The G5 can do 4 flops/cycle (2 FMAC units); I can't remember the G4 for sure, but it could do either 2 flops/cycle or 4 in double (can't remember if it had 1 or 2 FMAC units). Some older machines are listed at: http://mathatlas.sourceforge.net/timing/ The above GFLOP might possibly be right for single precision using Altivec, but the G4's memory system kept any real kernel from getting close to altivec peak. Cheers, Clint ************************************************************************** ** R. Clint Whaley, PhD ** Assoc Prof, UTSA ** http://www.cs.utsa.edu/~whaley ** ************************************************************************** 
From: José Luis García Pallero <jgpallero@gm...>  20130114 22:53:16

2013/1/14 Clint Whaley <whaley@...>: >>Thank you for your explanation, Clint. How could I find the peak >>empirically? > > If you have to ask, you will not succeed :) ATLAS does not do it automatically > because you usually need to do it in assembly, which requires you know the > ISA and what to use (SSE/AVX,etc) in advance. > >>Can ATLAS do it? I have too an iBook with a PPC G4 1.33 >>GHz. Surfing the net I've found that the PPC G4 is capable for 8 >>flops/cycle in double precision, so the theoretical peak is 10.64 >>GFLOPS. But my tests shows: >> In MAC OS X with apple's veclib I obtain about 1 GFLOP with DGEMM >> In Debian GNU/Linux I obtain too about 1 GFLOP DGEMM with OpenBLAS >>(old GotoBLAS) and ATLAS (3.8.4) >> >>I think that I've not computed correctly the peak > > That is not correct. The G5 can do 4 flops/cycle (2 FMAC units); I can't > remember the G4 for sure, but it could do either 2 flops/cycle or 4 in > double (can't remember if it had 1 or 2 FMAC units). Some older machines > are listed at: > http://mathatlas.sourceforge.net/timing/ > > The above GFLOP might possibly be right for single precision using Altivec, > but the G4's memory system kept any real kernel from getting close to > altivec peak. I've performed more tests in MAC OS X linking Apple VecLib library and the results are (PPC G4 1.33 GHz): M=N=K DGEMM GFLOPS 1000 1.057 2000 1.155 3000 1.254 4000 1.252 So probably G4 can performs 1 double operation per cycle (or two single) > > Cheers, > Clint > > ************************************************************************** > ** R. Clint Whaley, PhD ** Assoc Prof, UTSA ** http://www.cs.utsa.edu/~whaley ** > ************************************************************************** > >  > Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, > MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current > with LearnDevNow  3,200 stepbystep video tutorials by Microsoft > MVPs and experts. SALE $99.99 this month only  learn more at: > http://p.sf.net/sfu/learnmore_122412 > _______________________________________________ > Mathatlasdevel mailing list > Mathatlasdevel@... > https://lists.sourceforge.net/lists/listinfo/mathatlasdevel  ***************************************** José Luis García Pallero jgpallero@... (o< / / \ V_/_ Use Debian GNU/Linux and enjoy! ***************************************** 
From: Clint Whaley <whaley@cs...>  20130114 23:13:27

>I've performed more tests in MAC OS X linking Apple VecLib library and >the results are (PPC G4 1.33 GHz): > >M=N=K DGEMM GFLOPS > >1000 1.057 >2000 1.155 >3000 1.254 >4000 1.252 > >So probably G4 can performs 1 double operation per cycle (or two single) Veclib is a Applehacked ATLAS. There were different procs sold as "G4", I think. If yours has altivec, then the peak is at least 8 flops/cycle, and maybe 16 (GEMM does not get close to this, though). If double gets less than 2 flops/cycle, then it must not have a fullypipelined FMAC unit. Anyway, if knowing the peak is important to you, you'll need to do some research. My only G4 died last semester. Cheers, Clint ************************************************************************** ** R. Clint Whaley, PhD ** Assoc Prof, UTSA ** http://www.cs.utsa.edu/~whaley ** ************************************************************************** 
From: José Luis García Pallero <jgpallero@gm...>  20130115 06:43:29

2013/1/15 Clint Whaley <whaley@...>: >>I've performed more tests in MAC OS X linking Apple VecLib library and >>the results are (PPC G4 1.33 GHz): >> >>M=N=K DGEMM GFLOPS >> >>1000 1.057 >>2000 1.155 >>3000 1.254 >>4000 1.252 >> >>So probably G4 can performs 1 double operation per cycle (or two single) > > Veclib is a Applehacked ATLAS. Quite interesting :) I think this fact is not documented in Apple's docs > There were different procs sold as "G4", > I think. If yours has altivec, then the peak is at least 8 flops/cycle, > and maybe 16 (GEMM does not get close to this, though). If double gets > less than 2 flops/cycle, then it must not have a fullypipelined FMAC > unit. I've exactly: Modelo: PowerBook6,7 CPU Type: PowerPC G4 (1.5) CPU units: 1 CPU speed: 1.33 GHz L2 cache (per CPU): 512 KB > > Anyway, if knowing the peak is important to you, you'll need to do some > research. My only G4 died last semester. Actually it is not so important, only in order to show to students (in a very very basic introduction course to computation) the relation R/Rpeak in other architectures rather than Intel. Cheers > > Cheers, > Clint  ***************************************** José Luis García Pallero jgpallero@... (o< / / \ V_/_ Use Debian GNU/Linux and enjoy! ***************************************** 
From: Brooks Moses <brooks@co...>  20130305 20:30:52

Hi, all  Following up with a bit of tangent off of the belowquoted conversation, does anyone have any halfwaycurrent numbers for singleprecision SGEMM performance on Power architectures? We've been working on compiling for a newer Power platform and have been a ways off of theoretical peak, and it would be nice to know whether our results are expected or not. Clint, when you mention that the G4's memory system "kept any real kernel from getting close to AltiVec peak", do you have a recollection of what sort of numbers that meant? Is this 30% off of peak, or 50%, or more? The latest results I found on the mathatlasresults list are these, from version 3.3.7 about a decade ago: http://sourceforge.net/mailarchive/message.php?msg_id=190029 Thanks much,  Brooks Clint Whaley wrote, at 1/14/2013 2:29 PM: >>  In MAC OS X with apple's veclib I obtain about 1 GFLOP with DGEMM >>  In Debian GNU/Linux I obtain too about 1 GFLOP DGEMM with OpenBLAS >> (old GotoBLAS) and ATLAS (3.8.4) >> >> I think that I've not computed correctly the peak > > That is not correct. The G5 can do 4 flops/cycle (2 FMAC units); I can't > remember the G4 for sure, but it could do either 2 flops/cycle or 4 in > double (can't remember if it had 1 or 2 FMAC units). Some older machines > are listed at: > http://mathatlas.sourceforge.net/timing/ > > The above GFLOP might possibly be right for single precision using Altivec, > but the G4's memory system kept any real kernel from getting close to > altivec peak.  Brooks Moses CodeSourcery / Mentor Graphics brooks@... 15103546729 
From: <whaley@cs...>  20130305 21:11:23

Guys, FYI: I've been ill, and before that on travel, so that's why I've not been responding. Will try to catch up on support and developer messages during spring break (next week). >Following up with a bit of tangent off of the belowquoted conversation, >does anyone have any halfwaycurrent numbers for singleprecision SGEMM >performance on Power architectures? We've been working on compiling for >a newer Power platform and have been a ways off of theoretical peak, and >it would be nice to know whether our results are expected or not. > >Clint, when you mention that the G4's memory system "kept any real >kernel from getting close to AltiVec peak", do you have a recollection >of what sort of numbers that meant? Is this 30% off of peak, or 50%, or >more? I absolutely cannot remember. My G4 died at really bad time. I think the new data format that I've added could actually increase G4 perf, since it is a BW optimization, but the machine died a couple of months before. I did a lot of timings on the old G4, but I can't find any of them now either, sorry. I do plan on writing a new AltiVec kernel eventually for the new framework that I can time with the G5, and maybe someone like you can tell me how it works on the G4. However, this will be a while down the road. Regards, Clint Clint Whaley wrote, at 1/14/2013 2:29 PM: >>  In MAC OS X with apple's veclib I obtain about 1 GFLOP with DGEMM >>  In Debian GNU/Linux I obtain too about 1 GFLOP DGEMM with OpenBLAS >> (old GotoBLAS) and ATLAS (3.8.4) >> >> I think that I've not computed correctly the peak > > That is not correct. The G5 can do 4 flops/cycle (2 FMAC units); I can't > remember the G4 for sure, but it could do either 2 flops/cycle or 4 in > double (can't remember if it had 1 or 2 FMAC units). Some older machines > are listed at: > http://mathatlas.sourceforge.net/timing/ > > The above GFLOP might possibly be right for single precision using Altivec, > but the G4's memory system kept any real kernel from getting close to > altivec peak.  Brooks Moses CodeSourcery / Mentor Graphics brooks@... 15103546729  Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_feb _______________________________________________ Mathatlasdevel mailing list Mathatlasdevel@... https://lists.sourceforge.net/lists/listinfo/mathatlasdevel ************************************************************************** ** R. Clint Whaley, PhD ** Assoc Prof, UTSA ** http://www.cs.utsa.edu/~whaley ** ************************************************************************** 
From: Brooks Moses <brooks@co...>  20130305 21:44:31

whaley@... wrote, at 3/5/2013 12:41 PM: > FYI: I've been ill, and before that on travel, so that's why I've not > been responding. Will try to catch up on support and developer messages > during spring break (next week). Sorry to hear you've been ill, and hope you're feeling better now or at least soon! > I absolutely cannot remember. My G4 died at really bad time. I think > the new data format that I've added could actually increase G4 perf, > since it is a BW optimization, but the machine died a couple of months > before. I did a lot of timings on the old G4, but I can't find any > of them now either, sorry. Thanks  I appreciate you letting me know, in any case.  Brooks 
From: David Fang <fang@cs...>  20130305 21:37:10

Hi ATLASians, I'm very interested in helping test on G4 (powerpcdarwin8), but have been very short on time in recent months. I do have a working machine, and the last buildtests I ran last year were on 3.9.7x took 4.5 days solid (tuning). If I find the time, which release/snapshot would be highest priority for testing? Fang > Guys, > > FYI: I've been ill, and before that on travel, so that's why I've not > been responding. Will try to catch up on support and developer messages > during spring break (next week). > >> Following up with a bit of tangent off of the belowquoted conversation, >> does anyone have any halfwaycurrent numbers for singleprecision SGEMM >> performance on Power architectures? We've been working on compiling for >> a newer Power platform and have been a ways off of theoretical peak, and >> it would be nice to know whether our results are expected or not. >> >> Clint, when you mention that the G4's memory system "kept any real >> kernel from getting close to AltiVec peak", do you have a recollection >> of what sort of numbers that meant? Is this 30% off of peak, or 50%, or >> more? > > I absolutely cannot remember. My G4 died at really bad time. I think > the new data format that I've added could actually increase G4 perf, > since it is a BW optimization, but the machine died a couple of months > before. I did a lot of timings on the old G4, but I can't find any > of them now either, sorry. > > I do plan on writing a new AltiVec kernel eventually for the new framework > that I can time with the G5, and maybe someone like you can tell me how > it works on the G4. However, this will be a while down the road. > > Regards, > Clint > > Clint Whaley wrote, at 1/14/2013 2:29 PM: >>>  In MAC OS X with apple's veclib I obtain about 1 GFLOP with DGEMM >>>  In Debian GNU/Linux I obtain too about 1 GFLOP DGEMM with OpenBLAS >>> (old GotoBLAS) and ATLAS (3.8.4) >>> >>> I think that I've not computed correctly the peak >> >> That is not correct. The G5 can do 4 flops/cycle (2 FMAC units); I can't >> remember the G4 for sure, but it could do either 2 flops/cycle or 4 in >> double (can't remember if it had 1 or 2 FMAC units). Some older machines >> are listed at: >> http://mathatlas.sourceforge.net/timing/ >> >> The above GFLOP might possibly be right for single precision using Altivec, >> but the G4's memory system kept any real kernel from getting close to >> altivec peak. > > >  David Fang http://www.csl.cornell.edu/~fang/ 
From: Brooks Moses <brooks@co...>  20130305 21:42:27

Hello! Personally, I would be most interested in the most recent stable version (3.10.1, I believe), since that's what I expect to be using. Thanks,  Brooks David Fang wrote, at 3/5/2013 1:36 PM: > Hi ATLASians, > > I'm very interested in helping test on G4 (powerpcdarwin8), but have been > very short on time in recent months. I do have a working machine, and the > last buildtests I ran last year were on 3.9.7x took 4.5 days solid > (tuning). If I find the time, which release/snapshot would be highest > priority for testing? > > Fang > >> Guys, >> >> FYI: I've been ill, and before that on travel, so that's why I've not >> been responding. Will try to catch up on support and developer messages >> during spring break (next week). >> >>> Following up with a bit of tangent off of the belowquoted conversation, >>> does anyone have any halfwaycurrent numbers for singleprecision SGEMM >>> performance on Power architectures? We've been working on compiling for >>> a newer Power platform and have been a ways off of theoretical peak, and >>> it would be nice to know whether our results are expected or not. >>> >>> Clint, when you mention that the G4's memory system "kept any real >>> kernel from getting close to AltiVec peak", do you have a recollection >>> of what sort of numbers that meant? Is this 30% off of peak, or 50%, or >>> more? >> >> I absolutely cannot remember. My G4 died at really bad time. I think >> the new data format that I've added could actually increase G4 perf, >> since it is a BW optimization, but the machine died a couple of months >> before. I did a lot of timings on the old G4, but I can't find any >> of them now either, sorry. >> >> I do plan on writing a new AltiVec kernel eventually for the new framework >> that I can time with the G5, and maybe someone like you can tell me how >> it works on the G4. However, this will be a while down the road. >> >> Regards, >> Clint >> >> Clint Whaley wrote, at 1/14/2013 2:29 PM: >>>>  In MAC OS X with apple's veclib I obtain about 1 GFLOP with DGEMM >>>>  In Debian GNU/Linux I obtain too about 1 GFLOP DGEMM with OpenBLAS >>>> (old GotoBLAS) and ATLAS (3.8.4) >>>> >>>> I think that I've not computed correctly the peak >>> >>> That is not correct. The G5 can do 4 flops/cycle (2 FMAC units); I can't >>> remember the G4 for sure, but it could do either 2 flops/cycle or 4 in >>> double (can't remember if it had 1 or 2 FMAC units). Some older machines >>> are listed at: >>> http://mathatlas.sourceforge.net/timing/ >>> >>> The above GFLOP might possibly be right for single precision using Altivec, >>> but the G4's memory system kept any real kernel from getting close to >>> altivec peak. >> >> >> >  Brooks Moses CodeSourcery / Mentor Graphics brooks@... 15103546729  Brooks Moses CodeSourcery / Mentor Graphics brooks@... 15103546729 
From: Jeff Hammond <jhammond@al...>  20130306 01:27:25

There are Mac G4 systems that aren't on display in museums? I have to question the logic of investing in developing new software for antiquated hardware. I'm certainly a big fan of PowerPC, but why not get ATLAS working well on POWER7 and Blue Gene/Q before jumping in the DeLorean. Best, Jeff On Tue, Mar 5, 2013 at 2:41 PM, <whaley@...> wrote: > Guys, > > FYI: I've been ill, and before that on travel, so that's why I've not > been responding. Will try to catch up on support and developer messages > during spring break (next week). > >>Following up with a bit of tangent off of the belowquoted conversation, >>does anyone have any halfwaycurrent numbers for singleprecision SGEMM >>performance on Power architectures? We've been working on compiling for >>a newer Power platform and have been a ways off of theoretical peak, and >>it would be nice to know whether our results are expected or not. >> >>Clint, when you mention that the G4's memory system "kept any real >>kernel from getting close to AltiVec peak", do you have a recollection >>of what sort of numbers that meant? Is this 30% off of peak, or 50%, or >>more? > > I absolutely cannot remember. My G4 died at really bad time. I think > the new data format that I've added could actually increase G4 perf, > since it is a BW optimization, but the machine died a couple of months > before. I did a lot of timings on the old G4, but I can't find any > of them now either, sorry. > > I do plan on writing a new AltiVec kernel eventually for the new framework > that I can time with the G5, and maybe someone like you can tell me how > it works on the G4. However, this will be a while down the road. > > Regards, > Clint > > Clint Whaley wrote, at 1/14/2013 2:29 PM: >>>  In MAC OS X with apple's veclib I obtain about 1 GFLOP with DGEMM >>>  In Debian GNU/Linux I obtain too about 1 GFLOP DGEMM with OpenBLAS >>> (old GotoBLAS) and ATLAS (3.8.4) >>> >>> I think that I've not computed correctly the peak >> >> That is not correct. The G5 can do 4 flops/cycle (2 FMAC units); I can't >> remember the G4 for sure, but it could do either 2 flops/cycle or 4 in >> double (can't remember if it had 1 or 2 FMAC units). Some older machines >> are listed at: >> http://mathatlas.sourceforge.net/timing/ >> >> The above GFLOP might possibly be right for single precision using Altivec, >> but the G4's memory system kept any real kernel from getting close to >> altivec peak. > > >  > Brooks Moses > CodeSourcery / Mentor Graphics > brooks@... > 15103546729 > >  > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today: > http://p.sf.net/sfu/appdyn_d2d_feb > _______________________________________________ > Mathatlasdevel mailing list > Mathatlasdevel@... > https://lists.sourceforge.net/lists/listinfo/mathatlasdevel > > > ************************************************************************** > ** R. Clint Whaley, PhD ** Assoc Prof, UTSA ** http://www.cs.utsa.edu/~whaley ** > ************************************************************************** > >  > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today: > http://p.sf.net/sfu/appdyn_d2d_feb > _______________________________________________ > Mathatlasdevel mailing list > Mathatlasdevel@... > https://lists.sourceforge.net/lists/listinfo/mathatlasdevel  Jeff Hammond Argonne Leadership Computing Facility University of Chicago Computation Institute jhammond@... / (630) 2525381 http://www.linkedin.com/in/jeffhammond https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond 
From: Brooks Moses <brooks@co...>  20130306 03:35:42

Jeff Hammond wrote, at 3/5/2013 5:26 PM: > There are Mac G4 systems that aren't on display in museums? I have to > question the logic of investing in developing new software for > antiquated hardware. I'm certainly a big fan of PowerPC, but why not > get ATLAS working well on POWER7 and Blue Gene/Q before jumping in the > DeLorean. IBM isn't the only company in the Power market. :) Freescale is still selling MPC7448s and MPC8641s, which are direct descendants of the G4. They're also about to introduce a range of processors (T4240 and B4860, notably) with the brandnew e6500 core, which looks closer to a G4 in tuning characteristics than it is to anything IBM's doing. The e6500's are primarily intended for networking systems, as far as I can tell, but there's certainly interest in using them for high performance embedded purposes that would want things like ATLAS. And of course the MPC8641D has been a mainstay of mil/aero highperformance embedded applications for years.  Brooks 
From: José Luis García Pallero <jgpallero@gm...>  20130306 10:44:03

2013/3/6 Brooks Moses <brooks@...>: > Jeff Hammond wrote, at 3/5/2013 5:26 PM: >> There are Mac G4 systems that aren't on display in museums? I have to >> question the logic of investing in developing new software for >> antiquated hardware. I'm certainly a big fan of PowerPC, but why not >> get ATLAS working well on POWER7 and Blue Gene/Q before jumping in the >> DeLorean. > > IBM isn't the only company in the Power market. :) > > Freescale is still selling MPC7448s and MPC8641s, which are direct > descendants of the G4. They're also about to introduce a range of > processors (T4240 and B4860, notably) with the brandnew e6500 core, > which looks closer to a G4 in tuning characteristics than it is to > anything IBM's doing. > > The e6500's are primarily intended for networking systems, as far as I > can tell, but there's certainly interest in using them for high > performance embedded purposes that would want things like ATLAS. And of > course the MPC8641D has been a mainstay of mil/aero highperformance > embedded applications for years. Hello: I have an Apple iBook PPC G4 running Debian GNU/Linux and ATLAS 3.8.4 installed. Probably tonight I could try some benchmarks with cblas_sgemm() and cblas_ssyrk(). I will post the results here. The laptop has 1.5 GB of memory. Which range of dimensions you prefer to do the tests? 100 to 1000, 1000 to 10000? Cheers > >  Brooks > >  > Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester > Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the > endpoint security space. For insight on selecting the right partner to > tackle endpoint security challenges, access the full report. > http://p.sf.net/sfu/symantecdev2dev > _______________________________________________ > Mathatlasdevel mailing list > Mathatlasdevel@... > https://lists.sourceforge.net/lists/listinfo/mathatlasdevel  ***************************************** José Luis García Pallero jgpallero@... (o< / / \ V_/_ Use Debian GNU/Linux and enjoy! ***************************************** 
From: Jeff Hammond <jhammond@al...>  20130306 16:05:54

Ah, that makes sense. I didn't realize that the Macassociated names G4 and G5 were just proxies for the PPC 74xx and 970 series processors, along with similar parts that you mention below. Thanks, Jeff On Tue, Mar 5, 2013 at 9:35 PM, Brooks Moses <brooks@...> wrote: > Jeff Hammond wrote, at 3/5/2013 5:26 PM: >> There are Mac G4 systems that aren't on display in museums? I have to >> question the logic of investing in developing new software for >> antiquated hardware. I'm certainly a big fan of PowerPC, but why not >> get ATLAS working well on POWER7 and Blue Gene/Q before jumping in the >> DeLorean. > > IBM isn't the only company in the Power market. :) > > Freescale is still selling MPC7448s and MPC8641s, which are direct > descendants of the G4. They're also about to introduce a range of > processors (T4240 and B4860, notably) with the brandnew e6500 core, > which looks closer to a G4 in tuning characteristics than it is to > anything IBM's doing. > > The e6500's are primarily intended for networking systems, as far as I > can tell, but there's certainly interest in using them for high > performance embedded purposes that would want things like ATLAS. And of > course the MPC8641D has been a mainstay of mil/aero highperformance > embedded applications for years. > >  Brooks > >  > Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester > Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the > endpoint security space. For insight on selecting the right partner to > tackle endpoint security challenges, access the full report. > http://p.sf.net/sfu/symantecdev2dev > _______________________________________________ > Mathatlasdevel mailing list > Mathatlasdevel@... > https://lists.sourceforge.net/lists/listinfo/mathatlasdevel  Jeff Hammond Argonne Leadership Computing Facility University of Chicago Computation Institute jhammond@... / (630) 2525381 http://www.linkedin.com/in/jeffhammond https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond 
From: José Luis García Pallero <jgpallero@gm...>  20130306 23:07:23

2013/3/6 Brooks Moses <brooks_moses@...>: > [writing offlist so as not to clutter everyone's inboxes] > > Hello José, > > José Luis García Pallero wrote, at 3/6/2013 2:43 AM: >> I have an Apple iBook PPC G4 running Debian GNU/Linux and ATLAS 3.8.4 >> installed. Probably tonight I could try some benchmarks with >> cblas_sgemm() and cblas_ssyrk(). I will post the results here. The >> laptop has 1.5 GB of memory. Which range of dimensions you prefer to >> do the tests? 100 to 1000, 1000 to 10000? > > The smaller range would be ideal  and thank you very much for offering! Hello: Apple iBook G4, CPU type: PowerPC G4 (1.5), Debian GNU/Linux, ATLAS 3.8.4 from the Debian repositories (NOT compiled by myself, but as there are not much variety of PPC G4, it should be well optimized see results compared with veclib for double in the original post of this thread). The tests was performed for M=N=K=100, 200, ..., 1000 using cblas_sgemm and cblas_ssyrk. The results are (each value was computed after 10 runs for each dimensions values): GEMM: M=N=K= 100 > 0.533 GFLOPS/s GEMM: M=N=K= 200 > 0.643 GFLOPS/s GEMM: M=N=K= 300 > 0.783 GFLOPS/s GEMM: M=N=K= 400 > 0.767 GFLOPS/s GEMM: M=N=K= 500 > 0.759 GFLOPS/s GEMM: M=N=K= 600 > 0.808 GFLOPS/s GEMM: M=N=K= 700 > 0.803 GFLOPS/s GEMM: M=N=K= 800 > 0.822 GFLOPS/s GEMM: M=N=K= 900 > 0.821 GFLOPS/s GEMM: M=N=K= 1000 > 0.819 GFLOPS/s SYRK: M=N=K= 100 > 0.371 GFLOPS/s SYRK: M=N=K= 200 > 0.502 GFLOPS/s SYRK: M=N=K= 300 > 0.649 GFLOPS/s SYRK: M=N=K= 400 > 0.653 GFLOPS/s SYRK: M=N=K= 500 > 0.658 GFLOPS/s SYRK: M=N=K= 600 > 0.721 GFLOPS/s SYRK: M=N=K= 700 > 0.724 GFLOPS/s SYRK: M=N=K= 800 > 0.755 GFLOPS/s SYRK: M=N=K= 900 > 0.756 GFLOPS/s SYRK: M=N=K= 1000 > 0.762 GFLOPS/s For dimensions of 100 and 200 the performance varies between 0.5 and 0.75 with GEMM, but with SYRK is more stable. I'm very surprised, the performance for single precision is worst as for double. ??? For double I obtain performances about 1 GFLOPS/s with this machine (with ATLAS in Linux and with Apple's veclib in MAC OS X Tiger). I've tested in my intel pentium m 1.33 GHz laptop and for double the performance is about 1 GFLOPS/s but for single is about 2.5 GFLOPS/s which is spected. I don't know why in this case the single precission functions are slower. Probably Clint could give some hints. By now I have not enough time in order to compile the atlas 3.10.1 version, so I can offer only these results. We can argue that the problem is the debian compilation, but for double presision the results are similar as with apple's veclib in OS X Cheers  ***************************************** José Luis García Pallero jgpallero@... (o< / / \ V_/_ Use Debian GNU/Linux and enjoy! ***************************************** 
From: Brooks Moses <brooks@co...>  20130306 23:13:21

José Luis García Pallero wrote, at 3/6/2013 3:07 PM: > Apple iBook G4, CPU type: PowerPC G4 (1.5), Debian GNU/Linux, ATLAS > 3.8.4 from the Debian repositories (NOT compiled by myself, but as > there are not much variety of PPC G4, it should be well optimized see > results compared with veclib for double in the original post of this > thread). Thanks muchly! That's definitely helpful. > I'm very surprised, the performance for single precision is worst as > for double. ??? > For double I obtain performances about 1 GFLOPS/s with this machine > (with ATLAS in Linux and with Apple's veclib in MAC OS X Tiger). I've > tested in my intel pentium m 1.33 GHz laptop and for double the > performance is about 1 GFLOPS/s but for single is about 2.5 GFLOPS/s > which is spected. I wonder if the Debian package was actually compiled with AltiVec enabled. I know for a while that it was quietly turned off in the Fedora packages because they ran into a compile error with it. (IIRC, it was a trivial one that would have been easy to work around, but such was the state of Fedora's interest in Power at the time that nobody bothered.)  Brooks 
From: Ian Ollmann <iano@ap...>  20130306 23:27:05
Attachments:
Message as HTML

AltiVec on these G4s did single precision but not double precision. Peak throughput was 8 single precision flops per cycle per core. For double precision it was formally 1. However, the FPU was only capable of keeping 4 out of 5 stages busy due to under provisioned reservation stations, so the peak theoretical performance for single precision in something like sgemm was about 5x what you could get for double. For a unicore machine running at 1.33 GHz, we would expect something south of 10.6 GFlops for single precision. Based on the results you present, it would not be surprising if you eventually discover that you are running a simple unoptimized scalar loop for most of the calculation. A formulation of the sgemm that had the double precision accumulators would have single slower than double because of the extra work to convert all the single precision values to double. If you have a tool capable of sampling the code while running at instruction granularity, sending us the assembly code for the hot loop should allow quick discovery as to the problem. Ian On Mar 6, 2013, at 3:07 PM, José Luis García Pallero <jgpallero@...> wrote: > 2013/3/6 Brooks Moses <brooks_moses@...>: >> [writing offlist so as not to clutter everyone's inboxes] >> >> Hello José, >> >> José Luis García Pallero wrote, at 3/6/2013 2:43 AM: >>> I have an Apple iBook PPC G4 running Debian GNU/Linux and ATLAS 3.8.4 >>> installed. Probably tonight I could try some benchmarks with >>> cblas_sgemm() and cblas_ssyrk(). I will post the results here. The >>> laptop has 1.5 GB of memory. Which range of dimensions you prefer to >>> do the tests? 100 to 1000, 1000 to 10000? >> >> The smaller range would be ideal  and thank you very much for offering! > > Hello: > > Apple iBook G4, CPU type: PowerPC G4 (1.5), Debian GNU/Linux, ATLAS > 3.8.4 from the Debian repositories (NOT compiled by myself, but as > there are not much variety of PPC G4, it should be well optimized see > results compared with veclib for double in the original post of this > thread). > > The tests was performed for M=N=K=100, 200, ..., 1000 using > cblas_sgemm and cblas_ssyrk. The results are (each value was computed > after 10 runs for each dimensions values): > > GEMM: M=N=K= 100 > 0.533 GFLOPS/s > GEMM: M=N=K= 200 > 0.643 GFLOPS/s > GEMM: M=N=K= 300 > 0.783 GFLOPS/s > GEMM: M=N=K= 400 > 0.767 GFLOPS/s > GEMM: M=N=K= 500 > 0.759 GFLOPS/s > GEMM: M=N=K= 600 > 0.808 GFLOPS/s > GEMM: M=N=K= 700 > 0.803 GFLOPS/s > GEMM: M=N=K= 800 > 0.822 GFLOPS/s > GEMM: M=N=K= 900 > 0.821 GFLOPS/s > GEMM: M=N=K= 1000 > 0.819 GFLOPS/s > > SYRK: M=N=K= 100 > 0.371 GFLOPS/s > SYRK: M=N=K= 200 > 0.502 GFLOPS/s > SYRK: M=N=K= 300 > 0.649 GFLOPS/s > SYRK: M=N=K= 400 > 0.653 GFLOPS/s > SYRK: M=N=K= 500 > 0.658 GFLOPS/s > SYRK: M=N=K= 600 > 0.721 GFLOPS/s > SYRK: M=N=K= 700 > 0.724 GFLOPS/s > SYRK: M=N=K= 800 > 0.755 GFLOPS/s > SYRK: M=N=K= 900 > 0.756 GFLOPS/s > SYRK: M=N=K= 1000 > 0.762 GFLOPS/s > > For dimensions of 100 and 200 the performance varies between 0.5 and > 0.75 with GEMM, but with SYRK is more stable. > > I'm very surprised, the performance for single precision is worst as > for double. ??? > For double I obtain performances about 1 GFLOPS/s with this machine > (with ATLAS in Linux and with Apple's veclib in MAC OS X Tiger). I've > tested in my intel pentium m 1.33 GHz laptop and for double the > performance is about 1 GFLOPS/s but for single is about 2.5 GFLOPS/s > which is spected. > > I don't know why in this case the single precission functions are > slower. Probably Clint could give some hints. > > By now I have not enough time in order to compile the atlas 3.10.1 > version, so I can offer only these results. > > We can argue that the problem is the debian compilation, but for > double presision the results are similar as with apple's veclib in OS > X > > Cheers > >  > ***************************************** > José Luis García Pallero > jgpallero@... > (o< > / / \ > V_/_ > Use Debian GNU/Linux and enjoy! > ***************************************** > >  > Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester > Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the > endpoint security space. For insight on selecting the right partner to > tackle endpoint security challenges, access the full report. > http://p.sf.net/sfu/symantecdev2dev > _______________________________________________ > Mathatlasdevel mailing list > Mathatlasdevel@... > https://lists.sourceforge.net/lists/listinfo/mathatlasdevel 
From: José Luis García Pallero <jgpallero@gm...>  20130306 23:35:02

2013/3/7 Ian Ollmann <iano@...>: > > AltiVec on these G4s did single precision but not double precision. Peak > throughput was 8 single precision flops per cycle per core. For double > precision it was formally 1. However, the FPU was only capable of keeping 4 > out of 5 stages busy due to under provisioned reservation stations, so the > peak theoretical performance for single precision in something like sgemm > was about 5x what you could get for double. For a unicore machine running at > 1.33 GHz, we would expect something south of 10.6 GFlops for single > precision. Based on the results you present, it would not be surprising if > you eventually discover that you are running a simple unoptimized scalar > loop for most of the calculation. > > A formulation of the sgemm that had the double precision accumulators would > have single slower than double because of the extra work to convert all the > single precision values to double. If you have a tool capable of sampling > the code while running at instruction granularity, sending us the assembly > code for the hot loop should allow quick discovery as to the problem. I could probably next week try to compile atlas 3.10.1 in my G4 laptop. Which flags could I pass to configure script in order to speed up the compilation for G4? Cheers > > Ian > > > On Mar 6, 2013, at 3:07 PM, José Luis García Pallero <jgpallero@...> > wrote: > > 2013/3/6 Brooks Moses <brooks_moses@...>: > > [writing offlist so as not to clutter everyone's inboxes] > > Hello José, > > José Luis García Pallero wrote, at 3/6/2013 2:43 AM: > > I have an Apple iBook PPC G4 running Debian GNU/Linux and ATLAS 3.8.4 > installed. Probably tonight I could try some benchmarks with > cblas_sgemm() and cblas_ssyrk(). I will post the results here. The > laptop has 1.5 GB of memory. Which range of dimensions you prefer to > do the tests? 100 to 1000, 1000 to 10000? > > > The smaller range would be ideal  and thank you very much for offering! > > > Hello: > > Apple iBook G4, CPU type: PowerPC G4 (1.5), Debian GNU/Linux, ATLAS > 3.8.4 from the Debian repositories (NOT compiled by myself, but as > there are not much variety of PPC G4, it should be well optimized see > results compared with veclib for double in the original post of this > thread). > > The tests was performed for M=N=K=100, 200, ..., 1000 using > cblas_sgemm and cblas_ssyrk. The results are (each value was computed > after 10 runs for each dimensions values): > > GEMM: M=N=K= 100 > 0.533 GFLOPS/s > GEMM: M=N=K= 200 > 0.643 GFLOPS/s > GEMM: M=N=K= 300 > 0.783 GFLOPS/s > GEMM: M=N=K= 400 > 0.767 GFLOPS/s > GEMM: M=N=K= 500 > 0.759 GFLOPS/s > GEMM: M=N=K= 600 > 0.808 GFLOPS/s > GEMM: M=N=K= 700 > 0.803 GFLOPS/s > GEMM: M=N=K= 800 > 0.822 GFLOPS/s > GEMM: M=N=K= 900 > 0.821 GFLOPS/s > GEMM: M=N=K= 1000 > 0.819 GFLOPS/s > > SYRK: M=N=K= 100 > 0.371 GFLOPS/s > SYRK: M=N=K= 200 > 0.502 GFLOPS/s > SYRK: M=N=K= 300 > 0.649 GFLOPS/s > SYRK: M=N=K= 400 > 0.653 GFLOPS/s > SYRK: M=N=K= 500 > 0.658 GFLOPS/s > SYRK: M=N=K= 600 > 0.721 GFLOPS/s > SYRK: M=N=K= 700 > 0.724 GFLOPS/s > SYRK: M=N=K= 800 > 0.755 GFLOPS/s > SYRK: M=N=K= 900 > 0.756 GFLOPS/s > SYRK: M=N=K= 1000 > 0.762 GFLOPS/s > > For dimensions of 100 and 200 the performance varies between 0.5 and > 0.75 with GEMM, but with SYRK is more stable. > > I'm very surprised, the performance for single precision is worst as > for double. ??? > For double I obtain performances about 1 GFLOPS/s with this machine > (with ATLAS in Linux and with Apple's veclib in MAC OS X Tiger). I've > tested in my intel pentium m 1.33 GHz laptop and for double the > performance is about 1 GFLOPS/s but for single is about 2.5 GFLOPS/s > which is spected. > > I don't know why in this case the single precission functions are > slower. Probably Clint could give some hints. > > By now I have not enough time in order to compile the atlas 3.10.1 > version, so I can offer only these results. > > We can argue that the problem is the debian compilation, but for > double presision the results are similar as with apple's veclib in OS > X > > Cheers > >  > ***************************************** > José Luis García Pallero > jgpallero@... > (o< > / / \ > V_/_ > Use Debian GNU/Linux and enjoy! > ***************************************** > >  > Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester > Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the > endpoint security space. For insight on selecting the right partner to > tackle endpoint security challenges, access the full report. > http://p.sf.net/sfu/symantecdev2dev > _______________________________________________ > Mathatlasdevel mailing list > Mathatlasdevel@... > https://lists.sourceforge.net/lists/listinfo/mathatlasdevel > > > >  > Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester > Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the > endpoint security space. For insight on selecting the right partner to > tackle endpoint security challenges, access the full report. > http://p.sf.net/sfu/symantecdev2dev > _______________________________________________ > Mathatlasdevel mailing list > Mathatlasdevel@... > https://lists.sourceforge.net/lists/listinfo/mathatlasdevel >  ***************************************** José Luis García Pallero jgpallero@... (o< / / \ V_/_ Use Debian GNU/Linux and enjoy! ***************************************** 
From: Brooks Moses <brooks@co...>  20130306 23:44:35

Brooks Moses wrote, at 3/6/2013 3:13 PM: > I wonder if the Debian package was actually compiled with AltiVec > enabled. I know for a while that it was quietly turned off in the > Fedora packages because they ran into a compile error with it. (IIRC, > it was a trivial one that would have been easy to work around, but such > was the state of Fedora's interest in Power at the time that nobody > bothered.) For the record, it looks like this supposition is indeed the case  see this post in the Debian mailinglist archives: http://debian.2.n7.nabble.com/ATLASonPowerPCtp1997485p1997495.html To quote, from the Debian maintainer responsible for their ATLAS packages: > Looking again at Atlas buildsystem (in CONFIG/src/atlcomp.txt), it is > clear that POWER3 is the less specific. In particular it gets no > CPUspecific GCC flags, contrary to G4 (which gets maltivec > mabi=altivec mcpu=7400 mtune=7400). So I am going to enforce POWER3. I assume we can take as read the usual rant about how OS packagers compile ATLAS for the lowest common denominator and thus end up completely hobbling the performance. :)  Brooks 
Sign up for the SourceForge newsletter:
No, thanks