From: Calin A. <cal...@gm...> - 2013-04-18 20:49:06
|
Hi, I'll get the article in the next days. However, right now I don't have a working OpenMP. To compare, I should run both on the same machine. I'll have time for that after I release my app (maybe one month from now?). However, the two options (OMP and PT) co-exist nicely in my code. What I did: 1. I don't save Rhs separately, to copy all at the end. I just protect the critical part with a mutex. 2. The worker threads just compete for data (fully dynamic work allocation). When one thread finishes an instance, it gets the next available. The main thread just waits for them. 3. The worker threads are created once and live forever, waiting for work. 4. The changes are smaller than the ones for OpenMP (less files touched), but there is an extra file with utilities. 5. It would look much nicer with pthread_barrier_wait() but apparently this is optional and it doesn't exist on all systems. My numbers are, on 4 cores (slow machine) running a BSIM4 4-bit adder, no bypass option: threads / cpu seconds / elapsed seconds 0 (no MP compiled in) 318 370 1 317 376 2 356 248 3 360 191 4 370 166 6 383 166 8 396 166 Conclusions: 1. Almost no penalty for having it complied in, due to the separate loop for 1 thread case. Without that trick the numbers are 334 400 for one thread. The little penalty that still exists comes from the mutex calls at the end of b3ld.c (I didn't "if()" those out, only #ifdef them). 2. For threads > cores the cpu time increases with no gain in elapsed time. Quite predictable. Interesting to test on a machine with more cpus. Probably the circuit has to be quite big (many bsim instances) to see improvement above 4 threads even with more cores. 3. More testing would be needed (all this is beta state). Maybe some regression tests to make sure I don't do something very wrong. Below is the bulk of the code. BTW, "good = BSIM3LoadOMP(here, ckt);" is NOK in the OMP code - it overwrites the error flag so only the error from the last instance is returned. Best regards, Calin ***** Beginning of b3ld.c ***** #ifdef USE_OMP int BSIM3LoadOMP(BSIM3instance *here, CKTcircuit *ckt); void BSIM3LoadRhsMat(GENmodel *inModel, CKTcircuit *ckt); #endif #ifdef USE_PTHREAD #include "../PThreads.h" void *BSIM3getInstPT(); int BSIM3loadPT(BSIM3instance *here, CKTcircuit *ckt); #endif int BSIM3load( GENmodel *inModel, CKTcircuit *ckt) { #if defined(USE_OMP) || defined(USE_PTHREAD) #ifdef USE_OMP int idx; BSIM3model *model = (BSIM3model*)inModel; int good = 0; BSIM3instance *here; BSIM3instance **InstArray; InstArray = model->BSIM3InstanceArray; #pragma omp parallel for private(here) for (idx = 0; idx < model->BSIM3InstCount; idx++) { here = InstArray[idx]; good = BSIM3LoadOMP(here, ckt); } BSIM3LoadRhsMat(inModel, ckt); return good; } int BSIM3LoadOMP(BSIM3instance *here, CKTcircuit *ckt) { BSIM3model *model; #endif #ifdef USE_PTHREAD // Initialize PThere; PTmodel is initialized to inModel in PTrun PThere = ((BSIM3model *) inModel)->BSIM3instances; return PTrun(inModel, ckt, (void *(*)()) &BSIM3getInstPT, &BSIM3loadPT); } // Returns current instance (or first non-null) and advances pointers. void *BSIM3getInstPT() { void *here; if (PTmodel == NULL) return NULL; // We're at the end of the list do { here = PThere; if (PThere != NULL) PThere = ((BSIM3instance *) PThere)->BSIM3nextInstance; while (PThere == NULL) { // WHILE not IF, to catch also models with no instances PTmodel = ((BSIM3model *) PTmodel)->BSIM3nextModel; if (PTmodel == NULL) return here; // This is NULL or next will be NULL PThere = ((BSIM3model *) PTmodel)->BSIM3instances; } } while (here == NULL); // Also to catch models with no instances return here; } // Original load function int BSIM3loadPT(BSIM3instance *here, CKTcircuit *ckt) { BSIM3model *model; #endif #else BSIM3model *model = (BSIM3model*)inModel; BSIM3instance *here; #endif ****** PThreads.c ********* /* PThreads * * Functions for multi-threading using pthread library * * */ #include "ngspice/config.h" #ifdef USE_PTHREAD #include "ngspice/iferrmsg.h" #include <pthread.h> extern int nthreads; #define MAX_PTHREADS 8 //#define PT_DEBUG 1 void *PTworker(void *p); void *(*PTgetInst)(); int (*PTload)(void *here, void *ckt); pthread_t PTid[MAX_PTHREADS]; int PTindex[MAX_PTHREADS]; int PTnumber = 0; pthread_mutex_t PTmutexNext = PTHREAD_MUTEX_INITIALIZER; pthread_mutex_t PTmutexData = PTHREAD_MUTEX_INITIALIZER; pthread_mutex_t PTmutexStart = PTHREAD_MUTEX_INITIALIZER; pthread_cond_t PTcondStart = PTHREAD_COND_INITIALIZER; int PTstart[MAX_PTHREADS]; pthread_mutex_t PTmutexDone[MAX_PTHREADS]; pthread_cond_t PTcondDone[MAX_PTHREADS]; int PTdone[MAX_PTHREADS]; #ifdef PT_DEBUG int PTdebugInst; char PTdebugThr[1000]; #endif void *PTckt; void *PTmodel = NULL; void *PThere = NULL; int PTerror; // Main thread int PTrun(void *model, void *ckt, void *(*getInst)(), int (*load)()) { int i; PTerror = 0; PTckt = ckt; PTmodel = model; PTgetInst = getInst; PTload = load; if (nthreads == 1) { // No multi-threading void *here; while ((here = (*PTgetInst)()) != NULL) { int err = (*PTload)(here, PTckt); // Actual work if (err) PTerror = err; } return PTerror; } #ifdef PT_DEBUG PTdebugInst = 0; #endif pthread_mutex_lock(&PTmutexStart); // Initialize the list if (PTnumber == 0) { // No threads, create them PTnumber = nthreads; if (PTnumber < 1) PTnumber = 1; if (PTnumber > MAX_PTHREADS) PTnumber = MAX_PTHREADS; for (i=0; i<PTnumber; i++) { pthread_mutex_init(&PTmutexDone[i], NULL); pthread_cond_init(&PTcondDone[i], NULL); PTdone[i] = 0; PTindex[i] = i; if (pthread_create(&PTid[i], NULL, PTworker, &PTindex[i])) PTerror = E_PANIC; } if (PTerror) return PTerror; } for (i=0; i<PTnumber; i++) { // Start flags PTstart[i] = 1; } pthread_cond_broadcast(&PTcondStart); // List ready to start pthread_mutex_unlock(&PTmutexStart); for (i=0; i<PTnumber; i++) { pthread_mutex_lock(&PTmutexDone[i]); while (!PTdone[i]) { pthread_cond_wait(&PTcondDone[i], &PTmutexDone[i]); // Wait for the threads to finish } PTdone[i] = 0; pthread_mutex_unlock(&PTmutexDone[i]); } #ifdef PT_DEBUG PTdebugThr[PTdebugInst] = 0; LOGD(PTdebugThr); #endif return PTerror; } // Worker thread void *PTworker(void *ixp) { int index = *(int *)ixp; void *here; while (1) { pthread_mutex_lock(&PTmutexStart); while (!PTstart[index]) { pthread_cond_wait(&PTcondStart, &PTmutexStart); // Wait for green light } PTstart[index] = 0; pthread_mutex_unlock(&PTmutexStart); while (1) { pthread_mutex_lock(&PTmutexNext); // Get another instance here = (*PTgetInst)(); pthread_mutex_unlock(&PTmutexNext); if (here == NULL) break; #ifdef PT_DEBUG PTdebugThr[PTdebugInst++] = '0' + index; #endif int err = (*PTload)(here, PTckt); // Actual work if (err) PTerror = err; } pthread_mutex_lock(&PTmutexDone[index]); // Flag done PTdone[index] = 1; pthread_cond_signal(&PTcondDone[index]); pthread_mutex_unlock(&PTmutexDone[index]); } } #endif -----Original Message----- From: Dietmar Warning [mailto:die...@ar...] Sent: Thursday, 18 April, 2013 21:15 To: Ngspice developers mailing list. Subject: Re: [Ngspice-devel] Multithreading with pthread Hi, only for information: There is paper "On performance enhancement of circuit simulation using multithreaded techniques" from Perng/Weng/Li. Calin, can you agree with these results? How large is the change in model code? BR Dietmar Am 18.04.2013 19:59, schrieb Francesco Lannutti: > I think we are very interested in this, but prior to move the existing implementation from OpenMP to Pthreads, you should measure the improvement between OpenMP and Pthread implementations. > Since OpenMP is a PRAGMA style parallelization, the Pthread one should be better, but I don't know how better it is :) . > > Thank you, > Fra > > > Il giorno 18/apr/2013, alle ore 16:29, Calin Andrian <cal...@gm...> ha scritto: > >> Hi, >> >> I am working on a design suite that will use ngspice as the simulation engine. >> Since the target platform has no OpenMP, I solved multi-threading with pthread. Is there interest to move this into the public code? >> >> The same models are benefiting (BSIM3, BSIM4, BSIMSOI). I ran experiments on others too, but there is no gain... >> Results: 4-core 4-thread time is 2.2 times faster than 1 thread. >> >> Best regards, >> Calin Andrian >> >> --------------------------------------------------------------------- >> --------- Precog is a next-generation analytics platform capable of >> advanced analytics on semi-structured data. The platform includes >> APIs for building apps and a phenomenal toolset for data science. >> Developers can use our toolset for easy data analysis & >> visualization. Get a free account! >> http://www2.precog.com/precogplatform/slashdotnewsletter_____________ >> __________________________________ >> Ngspice-devel mailing list >> Ngs...@li... >> https://lists.sourceforge.net/lists/listinfo/ngspice-devel > > ---------------------------------------------------------------------- > -------- Precog is a next-generation analytics platform capable of > advanced analytics on semi-structured data. The platform includes APIs > for building apps and a phenomenal toolset for data science. > Developers can use our toolset for easy data analysis & visualization. > Get a free account! > http://www2.precog.com/precogplatform/slashdotnewsletter > _______________________________________________ > Ngspice-devel mailing list > Ngs...@li... > https://lists.sourceforge.net/lists/listinfo/ngspice-devel ---------------------------------------------------------------------------- -- Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis & visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter _______________________________________________ Ngspice-devel mailing list Ngs...@li... https://lists.sourceforge.net/lists/listinfo/ngspice-devel |