Great work! It's very cool that comparatively old code like this is being updated for modern machines.
I'm expecting delivery of a 48core AMD next week, so if you send me the circuit, I could do some comparisons.
On that note: Is there any chance that the matrix code will also be parallelized in the future? I'm not using transistors and therefore don't benefit from this particular parallelization.
-----Holger Vogt <holger.vogt@...> schrieb: -----
An: "Ngspice developers mailing list." <ngspice-devel@...>
Von: Holger Vogt <holger.vogt@...>
Datum: 09.06.2010 07:52PM
Betreff: [Ngspice-devel] ngspice parallel processing on multi-core CPUs using OpenMP
todays computers typically come with CPUs having more than one core. It
will thus be useful to enhance ngspice to make use of such processors.
Some time ago I have made an analysis where ngspice spends its time
Using circuits comprising mostly of transistors and BSIM3 model, 2/3 of
the time is spent in the BSIM3Load function. Thus this function should
be parallized, if possible. Then the parrallel processing has to be
within a dedicated device model.
A recent publication (R.K. Perng, T.-H. Weng, and K.-C. Li: "On
Performance Enhancement of Circuit Simulation Using Multithreaded
Techniques", IEEE International Conference on Computational Science and
Engineering, 2009, pp. 158-165) has described a way to exactly do that.
They recommend using OpenMP, which is available on many platforms and is
easy to use, especially if you want to parallelize processing of a
for-loop. I have chosen the BSIM3 version 3.3.0 model, located in the
BSIM3 directory, as an example. The BSIM3load() function in b3ld.c
contains two nested for-loops using linked lists (models and instances,
e.g. individual transistors). Unfortunately OpenMP requires a loop with
an integer index. So in file B3set.c an array is defined, filled with
pointers to all instances of BSIM3 and stored in model->BSIM3InstanceArray.
BSIM3load() is now a wrapper function, calling the for-loop, which runs
through functions BSIM3LoadOMP(), once per instance. Inside
BSIM3LoadOMP() the model equations are caculated.
Typically you now need to synchronize the activities, in that storing
the results into the matrix has to be guarded. The trick offered by the
authors now is that the storage is moved out of the BSIM3LoadOMP()
function. Inside BSIM3LoadOMP() the updated data are stored in extra
locations locally per instance, defined in bsim3def.h. Only after the
complete for-loop is exercised, the update to the matrix is done in an
extra function BSIM3LoadRhsMat() in the main thread after the
parallelized loop. No extra synchronisation is required.
Then the thread programming needed is only a single line!!
#pragma omp parallel for num_threads(nthreads) private(here)
introducing the for-loop.
This of course is made possible only thanks to the OpenMP guys and the
clever trick on no synchronisation introduced by the above cited authors.
Some results on an inverter chain with 627 CMOS inverters, running for
200ns, compiled with Visual Studio professional for Windows 7 (full
optimization) or gcc 4.4, SUSE LINUX 11.2, -O2, on a i7 860 machine
with four real cores (and 4 virtuals using hyperthreading):
Threads CPU time [s] CPU time [s]
1 (standard) 167 165
1 (OpenMP) 174 167
2 110 110
3 95 94-120
4 83 107
6 94 90
8 93 91
So we see a ngspice speed up of nearly a factor of two!
Even on an older notebook with dual core processor, I have got more than
1.5x improvement using two threads.
To not bother you with attached files, I have placed the code (complete
files b3ld.c, b3set.c, bsim3defs.h and configure.in) into the ngspice
patch tracker for download.
Under LINUX you may run
./configure ... --enable-openmp
Under Windows you have to place an additional preprocessor flag USE_OMP,
and then enable openmp. Visual Studio Express might not be sufficient
due to lack of OpenMP support.
The number of threads (1 to 8 useful on my machine) has to be set
manually by placing
into spinit or .spiceinit.
If you run a circuit, please keep in mind to select BSIM3 version 3.3.0
(by placing this version number into your parameter files).
During my first tests I was disappointed by obtaining CPU times much
larger than without OpenMP, until I recognized that the time-measuring
function getrusage() counts tics from any core, adds them up, and thus
reports a CPU time value enlarged by a factor of 8 if 8 threads have
been chosen. So I have made ngspice to use ftime for time measuring if
OpenMP is selected.
If you run ./configure without --enable-openmp (or without USE_OMP
preprocessor flag under Windows), you will get the standard ngspice.
Cygwin and Mingw are not yet tested.
Please try the code and report any problems.
ThinkGeek and WIRED's GeekDad team up for the Ultimate
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
lucky parental unit. See the prize list and enter to win:
Ngspice-devel mailing list