Re: [Math-atlas-devel] Re: config/xconfig problems

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

In message <200...@na...> (on 2 December 2004 22:27:15 -0500), rw...@cs... (R Clint Whaley) wrote:
>>>platform once they made the astoundingly stupid announcement they were
>>>killing off their own successful platform in favor of the unproven IA64.
>>
>>Understand! I think they may have backed off on that decision a bit...
>
>Looks like it.  I thought they weren't even supposed to be selling MIPS
>systems by now.

Right. They essentially admitted otherwise when they came out with the
R14000 and R16000.

>I see the Mhz isn't too awesome.

Depends on how you look at it, and what code one is doing (how well it takes
advantage of things being superscalar and things like that; for instance,
while technically (as far as I can tell - I'm a biologist by training, not a
computer scientist and still less a hardware person!) the latency went _up_
by one between the R10000 and R12000, the latter is better at being able to
schedule things out of order, not cache blocking, etcetera, plus being at a
higher MHz in general).

We originally went with the SGIs because we do a lot of molecular graphical
work (including moving molecules around under user control, in real-time,
with 3-D (polarizing goggles) graphics) and SGIs were the best and still are
among the best for that (most crystallography labs will have at least one
SGI around - most often a workstation or a workstation that's acting as a
front-end to a computational server (Origin variety)). (My advisor - who
started working with computers in the 60's, incidentally - is a physical
biochemist; I am working on combining two fields, namely phylogenetics and
protein structural modeling, that haven't been talking much, and which both
involve multiple NP-hard problems...) When the SGIs wear
out, though, we'll probably move to PCs running Linux.

>Does your shop use these for legacy, or do you have some of the newer guys
>as well?

Depends on what you mean by "newer". With regard to legacy, we have some
R5000s (Indys), an R10000 Indigo2, and an R4400 Indigo that are currently
sitting on a bench not plugged in; I may see if I can get the R10000 running
for computational purposes and the R5000s might be candidates for running
Linux. The R4400 is a terminal at this point...

>What chips have you got?

Running or soon-to-be running Machines (* means preferred for heavy
computation):
	*Origin 2000, 8xR10000, 250MHz; a bit low on memory, though (4GB) -
		may see if can upgrade the memory for it to 8GB; currently
		in the process of getting up and running, delayed by lack of
		time (I'm a graduate student, so you know what I mean...)
		and worries about electricity (we are needing a 220-volt UPS
		for this one...)
	*3 Octanes, 2xR12000, 300MHz; variable memory (512MB-1024, with the
		    first being rather low); currently getting up and
		    fully running (e.g., upgrading compilers to 7.3.1.3m)
	*1 Octane, 2xR12000, 270Mhz; memory is fine (2GB); current machine
		   I'm testing out ATLAS on
	*Origin 200, 2xR10000, 180MHz; NFS server, which does place some
		limits on how heavily it can be loaded by anything else
	1 Octane, 1xR10000, 195MHz; low memory (128MB!) - working on
		  upgrading it; this is the one on my desk
	6 O2s, 1xR5000, all but one 200MHz (the one is 300MHz); low memory
	       (128MB) but that's enough for what they run
Of these, I will be having exclusive access to the Origin2000 (and the
Octane on my desk, of course) and have as much access to the others as I
need as long as it doesn't cause too much problems (there are times that,
due to students having projects due, all the upstairs computers need to be
available).

>I've never installed on an R16K.

Same here. Quite a lot of our machines are either ones we've had for a while
(like the Origin200) or ones that are
used/refurbished/remanufactured (e.g., the Origin2000 is a rack unit that
was converted to desk-side and didn't have a shell (plastic outside on the
case) when we got it...). Academics and funding...

>I think the last machine
>I had access to was an R12K (though it is possible it was an R10K) . . .

5 years or so ago, yeah, that would have been the most common types for
high-end SGI systems. R14000 came out in 1999 or so.

>Back when I was working on this machine, ATLAS+their compiler would get you
>around ~5% of their optimized BLAS.  If they hadn't announced their intention
>to kill the system, I'd've probably scoped out where that last 5% was by
>now (the gap is probably greater by now anyway) . . .

Well, my current results are showing ATLAS+their compiler with a 3-4x SpdUp
on GEMM for most problems (ATLAS seems to have some difficulties with high-K
problems; I am currently checking out various CacheEdges on this) relative
to /usr/lib32/mips4/libblas.a (that should be the one in use with
USE_F77_BLAS defined and BLASlib set to that, right? just making sure...);
however:
	 A. That's the non-multiprocessor version of their BLAS, on a
	    two-proc machine, whereas ATLAS is compiled for pthreads. The
	    reason for this is that the _mp version of their BLAS is using
	    sproc instead of pthreads, and thus can't be linked with
	    anything using pthreads (and I would not, therefore, want to use
	    it anyway, since it would force anything using it to either not
	    use multiprocessing _or_ to be nonportable by using sproc, which
	    is SGI-specific).
	 B. That is also the older version of their BLAS; the newer one is
	    in the SCSL (/usr/lib32/mips4/libscs.so or
	    /usr/lib32/mips4/libscs_mp.so), but that's a CBLAS as well as a
	    F77 BLAS, plus LAPACK (and other stuff like FFT) and I haven't
	    quite figured out how to get it to not override ATLAS' CBLAS -
	    should be able to when I get a chance to work on it (grading to
	    do this weekend...).
         C. They may well have put more emphasis into optimizing the -64
            version of BLAS, while I prefer to work with -n32 programs to
            avoid some worries about types not being what the program
            expects, enable portability to non-64-bit systems, etcetera.
This is on one system only thus far (until the Origin2000 is up and
running). However, I may well be able to pump it up - for the Origin2000
especially, given NUMA and that the R10000 has some problems with L2cache
blocking - with putting in prefetching instructions. (I'm afraid these will
need to be not through atlas_prefetch.h, though, except for using the
functions defined by it as info for where to stick in the necessary
pragmas...)

>>Got it. It looks like a lot of the roundoff et al things were put in place
>>with increased software pipelining, and I suspect you'd have disabled
>>software pipelining for MCC if it were doing very much at the time.
>
>Yeah, ATLAS'll do the pipelining for you.

Thought so. Sigh... may wind up doing two different MCC/MMFLAGS settings and
test each of them (plus perhaps CC/CCFLAGS, as long as I'm at it), _if_ I
can figure out a way to do it automatically (if it would involve too much
string/pointer/etcetera manipulation in C, this will probably necessitate my
doing it in Perl so it doesn't take me forever...).

>I thought back in the day that the ISA actually had a combined multiply/add

It does.

>(though I had the impression this just reduced the latency by avoiding a
>write-to-register step, and that you could get the same throughput using
>seperate multiply adds).  Is this not the case?

I suspect you are correct that one can get the same thing using seperate
ones if they're properly scheduled, especially since there are actually two
FP pipelines (one for multiply and one for add); this may vary between
processors, however - R12000 has some improvements relative to the FP unit
including some scheduling differences. ATLAS is not currently showing
improvements due to using combined multiply-add; not sure if that's the
test, though, since that's the same one that is showing problems detecting
the proper latency & the number of registers (FP registers are 48 for the
R12000, while the main CPU has 32 registers in the R12000, BTW; I _believe_
it's 32 in both cases for the R10000, but my memory is not reliable).

>>Understand! Some testing would seem to be indicated...
>
>Yeah, its possible even for the same chips, that the compiler has changed
>enough . . .

I strongly suspect so - and this isn't even the latest version of the
compiler (we aren't licensed for the latest version except on one machine at
a time (licensing daemon), which would not be very workable; using the
auto-parallelization option for netlib's LAPACK, however, looks quite
possible, since even though we could only run it on one machine at a time,
it can be made to spit out code with the OpenMP mods to F77 (or C) already
in place). There are now enough general optimization, loop nest
optimization, IPA/inlining, and various runtime debugging options, plus
various environment variables and machine compiler defaults files, that
_each_ of these has its own manpage - and that's not even counting the
multiprocessing and auto-parallelizing stuff, which are in online
books... Searching through all the options is no longer a realistic
possibility.

>>It would seem to make more sense, yes, but unfortunately doesn't seem to
>>work in this case due to the line length limits. There's also that if CC or
>>MCC ever got used without the flags, it would be problematic since some of
>>the flags are necessary to tell the compiler things that, if not set
>>properly, would result in a nonlinkable file (-o32/n32/64, -mips3/mips4,
>>etcetera). (There are machine-level defaults, but those are not always
>>what one would want...)
>
>Yeah, I have thrown some flags in non-flag areas, but it's usually something
>funky like 64/32 bits . . .

Another reason is that I wanted to put as much of the flags as I could fit
that were common between CC and CLINKER into CC so that using CLINKER as
$(CC) would minimize the number of duplicated flags that had to be in both
the CCFLAGS and CLINKFLAGS.

>>I'm not sure this is used anymore . . .
>>
>>It appears to play a role (via atlas_[pre]mvN.h) in ATL_GetPartMVN, which is
>>in turn used by a number of files:
>
>Where is ATL_mvpagesize used? GetPart is used a bunch in the L2, but I
>didn't think it used ATL_mvpagesize?

Well, it may not make any difference, depending on the inputs, but
atlas_dmvN.h gets output as:

#define ATL_GetPartMVN(A_, lda_, mb_, nb_) \
{ \
   *(nb_) = (ATL_L1mvelts - (ATL_mvNMU<<1)) / ((ATL_mvNMU<<1)+1); \
   if (ATL_mvpagesize > (lda_)) \
   { \
      *(mb_) = (ATL_mvpagesize / (lda_)) * ATL_mvntlb; \
      if ( *(mb_) < *(nb_) ) *(nb_) = *(mb_); \
   } \
   else if (ATL_mvntlb < *(nb_)) *(nb_) = ATL_mvntlb; \
   if (*(nb_) > ATL_mvNNU) *(nb_) = (*(nb_) / ATL_mvNNU) * ATL_mvNNU; \
   else *(nb_) = ATL_mvNNU; \
   *(mb_) = (ATL_L1mvelts - *(nb_) * (ATL_mvNMU+1)) / (*(nb_)+2); \
   if (*(mb_) > ATL_mvNMU) *(mb_) = (*(mb_) / ATL_mvNMU) * ATL_mvNMU; \
   else *(mb_) = ATL_mvNMU; \
}

	Yours,

	-Allen

P.S. ATL_Cachelen - should this be 64 for a 64-bit machine? Admittedly, the
L1 data cache on R1[02]000 uses 32-bit cache lines I _believe_, although the
L2 cache (combined data/instruction) uses 64-bit cache lines IIRC...

-- 
Allen Smith                       http://cesario.rutgers.edu/easmith/
February 1, 2003                               Space Shuttle Columbia
Ad Astra Per Aspera                     To The Stars Through Asperity