Re: [Math-atlas-devel] Re: config/xconfig problems
Brought to you by:
rwhaley,
tonyc040457
From: Ed A. S. <ea...@be...> - 2004-12-03 11:09:46
|
In message <200...@na...> (on 2 December 2004 22:27:15 -0500), rw...@cs... (R Clint Whaley) wrote: >>>platform once they made the astoundingly stupid announcement they were >>>killing off their own successful platform in favor of the unproven IA64. >> >>Understand! I think they may have backed off on that decision a bit... > >Looks like it. I thought they weren't even supposed to be selling MIPS >systems by now. Right. They essentially admitted otherwise when they came out with the R14000 and R16000. >I see the Mhz isn't too awesome. Depends on how you look at it, and what code one is doing (how well it takes advantage of things being superscalar and things like that; for instance, while technically (as far as I can tell - I'm a biologist by training, not a computer scientist and still less a hardware person!) the latency went _up_ by one between the R10000 and R12000, the latter is better at being able to schedule things out of order, not cache blocking, etcetera, plus being at a higher MHz in general). We originally went with the SGIs because we do a lot of molecular graphical work (including moving molecules around under user control, in real-time, with 3-D (polarizing goggles) graphics) and SGIs were the best and still are among the best for that (most crystallography labs will have at least one SGI around - most often a workstation or a workstation that's acting as a front-end to a computational server (Origin variety)). (My advisor - who started working with computers in the 60's, incidentally - is a physical biochemist; I am working on combining two fields, namely phylogenetics and protein structural modeling, that haven't been talking much, and which both involve multiple NP-hard problems...) When the SGIs wear out, though, we'll probably move to PCs running Linux. >Does your shop use these for legacy, or do you have some of the newer guys >as well? Depends on what you mean by "newer". With regard to legacy, we have some R5000s (Indys), an R10000 Indigo2, and an R4400 Indigo that are currently sitting on a bench not plugged in; I may see if I can get the R10000 running for computational purposes and the R5000s might be candidates for running Linux. The R4400 is a terminal at this point... >What chips have you got? Running or soon-to-be running Machines (* means preferred for heavy computation): *Origin 2000, 8xR10000, 250MHz; a bit low on memory, though (4GB) - may see if can upgrade the memory for it to 8GB; currently in the process of getting up and running, delayed by lack of time (I'm a graduate student, so you know what I mean...) and worries about electricity (we are needing a 220-volt UPS for this one...) *3 Octanes, 2xR12000, 300MHz; variable memory (512MB-1024, with the first being rather low); currently getting up and fully running (e.g., upgrading compilers to 7.3.1.3m) *1 Octane, 2xR12000, 270Mhz; memory is fine (2GB); current machine I'm testing out ATLAS on *Origin 200, 2xR10000, 180MHz; NFS server, which does place some limits on how heavily it can be loaded by anything else 1 Octane, 1xR10000, 195MHz; low memory (128MB!) - working on upgrading it; this is the one on my desk 6 O2s, 1xR5000, all but one 200MHz (the one is 300MHz); low memory (128MB) but that's enough for what they run Of these, I will be having exclusive access to the Origin2000 (and the Octane on my desk, of course) and have as much access to the others as I need as long as it doesn't cause too much problems (there are times that, due to students having projects due, all the upstairs computers need to be available). >I've never installed on an R16K. Same here. Quite a lot of our machines are either ones we've had for a while (like the Origin200) or ones that are used/refurbished/remanufactured (e.g., the Origin2000 is a rack unit that was converted to desk-side and didn't have a shell (plastic outside on the case) when we got it...). Academics and funding... >I think the last machine >I had access to was an R12K (though it is possible it was an R10K) . . . 5 years or so ago, yeah, that would have been the most common types for high-end SGI systems. R14000 came out in 1999 or so. >Back when I was working on this machine, ATLAS+their compiler would get you >around ~5% of their optimized BLAS. If they hadn't announced their intention >to kill the system, I'd've probably scoped out where that last 5% was by >now (the gap is probably greater by now anyway) . . . Well, my current results are showing ATLAS+their compiler with a 3-4x SpdUp on GEMM for most problems (ATLAS seems to have some difficulties with high-K problems; I am currently checking out various CacheEdges on this) relative to /usr/lib32/mips4/libblas.a (that should be the one in use with USE_F77_BLAS defined and BLASlib set to that, right? just making sure...); however: A. That's the non-multiprocessor version of their BLAS, on a two-proc machine, whereas ATLAS is compiled for pthreads. The reason for this is that the _mp version of their BLAS is using sproc instead of pthreads, and thus can't be linked with anything using pthreads (and I would not, therefore, want to use it anyway, since it would force anything using it to either not use multiprocessing _or_ to be nonportable by using sproc, which is SGI-specific). B. That is also the older version of their BLAS; the newer one is in the SCSL (/usr/lib32/mips4/libscs.so or /usr/lib32/mips4/libscs_mp.so), but that's a CBLAS as well as a F77 BLAS, plus LAPACK (and other stuff like FFT) and I haven't quite figured out how to get it to not override ATLAS' CBLAS - should be able to when I get a chance to work on it (grading to do this weekend...). C. They may well have put more emphasis into optimizing the -64 version of BLAS, while I prefer to work with -n32 programs to avoid some worries about types not being what the program expects, enable portability to non-64-bit systems, etcetera. This is on one system only thus far (until the Origin2000 is up and running). However, I may well be able to pump it up - for the Origin2000 especially, given NUMA and that the R10000 has some problems with L2cache blocking - with putting in prefetching instructions. (I'm afraid these will need to be not through atlas_prefetch.h, though, except for using the functions defined by it as info for where to stick in the necessary pragmas...) >>Got it. It looks like a lot of the roundoff et al things were put in place >>with increased software pipelining, and I suspect you'd have disabled >>software pipelining for MCC if it were doing very much at the time. > >Yeah, ATLAS'll do the pipelining for you. Thought so. Sigh... may wind up doing two different MCC/MMFLAGS settings and test each of them (plus perhaps CC/CCFLAGS, as long as I'm at it), _if_ I can figure out a way to do it automatically (if it would involve too much string/pointer/etcetera manipulation in C, this will probably necessitate my doing it in Perl so it doesn't take me forever...). >I thought back in the day that the ISA actually had a combined multiply/add It does. >(though I had the impression this just reduced the latency by avoiding a >write-to-register step, and that you could get the same throughput using >seperate multiply adds). Is this not the case? I suspect you are correct that one can get the same thing using seperate ones if they're properly scheduled, especially since there are actually two FP pipelines (one for multiply and one for add); this may vary between processors, however - R12000 has some improvements relative to the FP unit including some scheduling differences. ATLAS is not currently showing improvements due to using combined multiply-add; not sure if that's the test, though, since that's the same one that is showing problems detecting the proper latency & the number of registers (FP registers are 48 for the R12000, while the main CPU has 32 registers in the R12000, BTW; I _believe_ it's 32 in both cases for the R10000, but my memory is not reliable). >>Understand! Some testing would seem to be indicated... > >Yeah, its possible even for the same chips, that the compiler has changed >enough . . . I strongly suspect so - and this isn't even the latest version of the compiler (we aren't licensed for the latest version except on one machine at a time (licensing daemon), which would not be very workable; using the auto-parallelization option for netlib's LAPACK, however, looks quite possible, since even though we could only run it on one machine at a time, it can be made to spit out code with the OpenMP mods to F77 (or C) already in place). There are now enough general optimization, loop nest optimization, IPA/inlining, and various runtime debugging options, plus various environment variables and machine compiler defaults files, that _each_ of these has its own manpage - and that's not even counting the multiprocessing and auto-parallelizing stuff, which are in online books... Searching through all the options is no longer a realistic possibility. >>It would seem to make more sense, yes, but unfortunately doesn't seem to >>work in this case due to the line length limits. There's also that if CC or >>MCC ever got used without the flags, it would be problematic since some of >>the flags are necessary to tell the compiler things that, if not set >>properly, would result in a nonlinkable file (-o32/n32/64, -mips3/mips4, >>etcetera). (There are machine-level defaults, but those are not always >>what one would want...) > >Yeah, I have thrown some flags in non-flag areas, but it's usually something >funky like 64/32 bits . . . Another reason is that I wanted to put as much of the flags as I could fit that were common between CC and CLINKER into CC so that using CLINKER as $(CC) would minimize the number of duplicated flags that had to be in both the CCFLAGS and CLINKFLAGS. >>I'm not sure this is used anymore . . . >> >>It appears to play a role (via atlas_[pre]mvN.h) in ATL_GetPartMVN, which is >>in turn used by a number of files: > >Where is ATL_mvpagesize used? GetPart is used a bunch in the L2, but I >didn't think it used ATL_mvpagesize? Well, it may not make any difference, depending on the inputs, but atlas_dmvN.h gets output as: #define ATL_GetPartMVN(A_, lda_, mb_, nb_) \ { \ *(nb_) = (ATL_L1mvelts - (ATL_mvNMU<<1)) / ((ATL_mvNMU<<1)+1); \ if (ATL_mvpagesize > (lda_)) \ { \ *(mb_) = (ATL_mvpagesize / (lda_)) * ATL_mvntlb; \ if ( *(mb_) < *(nb_) ) *(nb_) = *(mb_); \ } \ else if (ATL_mvntlb < *(nb_)) *(nb_) = ATL_mvntlb; \ if (*(nb_) > ATL_mvNNU) *(nb_) = (*(nb_) / ATL_mvNNU) * ATL_mvNNU; \ else *(nb_) = ATL_mvNNU; \ *(mb_) = (ATL_L1mvelts - *(nb_) * (ATL_mvNMU+1)) / (*(nb_)+2); \ if (*(mb_) > ATL_mvNMU) *(mb_) = (*(mb_) / ATL_mvNMU) * ATL_mvNMU; \ else *(mb_) = ATL_mvNMU; \ } Yours, -Allen P.S. ATL_Cachelen - should this be 64 for a 64-bit machine? Admittedly, the L1 data cache on R1[02]000 uses 32-bit cache lines I _believe_, although the L2 cache (combined data/instruction) uses 64-bit cache lines IIRC... -- Allen Smith http://cesario.rutgers.edu/easmith/ February 1, 2003 Space Shuttle Columbia Ad Astra Per Aspera To The Stars Through Asperity |