From: Sumit G. <su...@nv...> - 2008-01-18 21:23:30
|
Hi Steve suggested I write to this group. I am wondering if anyone has considered using GPU-based computing to accelerate the Icarus Verilog simulator. NVIDIA's GPUs have evolved over the years to become fully programmable, massively parallel architectures. There are 128 processor cores with floating point units in a NVIDIA GPU today delivering anywhere between 120 to 350 GFLOPs of performance depending on your application. We have a C-based programming model called CUDA that has a wide developer community around it. Getting started just requires a NVIDIA GeForce 8-series GPU (88xx) and the CUDA software tools are freely available at: http://www.nvidia.com/cuda=20 I think the opportunity to parallelize will be in running many evaluations on different threads and then coalescing the results. NVIDIA's GPUs are very good at data number crunching and can handle 1000s of parallel threads. However, doing a pure event-driven simulator port may not be successful due to control-intensive nature of the algo. But I am in no way an expert. Regards Sumit (NVIDIA employee) -------------------------------------------------------------------------= ---------- This email message is for the sole use of the intended recipient(s) and m= ay contain confidential information. Any unauthorized review, use, disclosure or di= stribution is prohibited. If you are not the intended recipient, please contact the= =20sender by reply email and destroy all copies of the original message. -------------------------------------------------------------------------= ---------- |
From: Cary R. <cy...@ya...> - 2008-01-18 23:49:45
|
--- Sumit Gupta <su...@nv...> wrote: > I am wondering if anyone has considered using GPU-based computing to > accelerate the Icarus Verilog simulator. This certainly is interesting, but I think we would need to make the runtime multi threaded before starting something this ambitious. With the current crop of Duo and Quad machines available adding multi threading to Icarus has been one my mind. The problem is finding the time to work on it. > NVIDIA's GPUs have evolved over the years to become fully programmable, > massively parallel architectures. There are 128 processor cores with > floating point units in a NVIDIA GPU today delivering anywhere between > 120 to 350 GFLOPs of performance depending on your application. Much of Icarus is bit based, so FLOPs are not as important as for an analog simulator. Though with those kind of numbers I'm sure they would be more than fast enough ;-). Thanks for the information, Cary ____________________________________________________________________________________ Looking for last minute shopping deals? Find them fast with Yahoo! Search. http://tools.search.yahoo.com/newsearch/category.php?category=shopping |
From: Stephen W. <st...@ic...> - 2008-01-19 00:50:09
|
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Cary R. wrote: > --- Sumit Gupta <su...@nv...> wrote: >> I am wondering if anyone has considered using GPU-based computing to >> accelerate the Icarus Verilog simulator. > > This certainly is interesting, but I think we would need to make the > runtime multi threaded before starting something this ambitious. With the > current crop of Duo and Quad machines available adding multi threading to > Icarus has been one my mind. The problem is finding the time to work on > it. That's always been a problem because of the way that the Verilog timing model works, and especially how behavioral and netlist code interact in a Verilog design. Multi-threading Verilog simulations is highly tricky. >> NVIDIA's GPUs have evolved over the years to become fully programmable, >> massively parallel architectures. There are 128 processor cores with >> floating point units in a NVIDIA GPU today delivering anywhere between >> 120 to 350 GFLOPs of performance depending on your application. > > Much of Icarus is bit based, so FLOPs are not as important as for an > analog simulator. Though with those kind of numbers I'm sure they would be > more than fast enough ;-). Could be especially interesting when (if) analogue/mixed signal gets added in to Icarus Verilog. I'm weak on how exactly all that would work, but it seems likely that resolving a bunch of drivers may be something that can usefully be handed off to a co-processor. - -- Steve Williams "The woods are lovely, dark and deep. steve at icarus.com But I have promises to keep, http://www.icarus.com and lines to code before I sleep, http://www.picturel.com And lines to code before I sleep." -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2 (GNU/Linux) Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org iD8DBQFHkUlErPt1Sc2b3ikRAigaAKDH+AYIQ8bkrOIG/x1cUDloSA/3oQCghY2V 1oSixJSHpu5LLv3hwmHSI+Q= =8RCk -----END PGP SIGNATURE----- |
From: Anthony J B. <net...@nc...> - 2008-01-19 22:00:26
|
On Fri, 18 Jan 2008, Stephen Williams wrote: > That's always been a problem because of the way that the Verilog > timing model works, and especially how behavioral and netlist code > interact in a Verilog design. Multi-threading Verilog simulations > is highly tricky. Right, especially given that if it's done wrong, the results of your sim run could be non-deterministic from run to run. Not necessarily incorrect, but different or irreproduceable results are equally bad and will result in unprecedented levels of fanmail. =) I've been putting it off for years, but sometime I do need to look at the generated vvp to see what kinds of event cones (for lack of a better word) can be extracted. (I don't mean basic blocks.) A lot of those bitwise logic ops can be done as a branchless SWAR op assuming you split your value into bitwise parallel 01 and XZ value planes. (SWAR = SIMD within a register.) There's more going on calculation-wise but the odds of a branch mispredict flush are zero as no branches are required. For example, run the program below. I only print one bit, but by extension you can go as wide as you want. This is exactly what VCS does if you look at the generated C code while gcc is in the middle of chewing on it. -Tony #include <stdio.h> /* * 00 '0' * 01 '1' * 10 'Z' * 11 'X' */ #define MVL(a,b) "01ZX"[((a)&1)*2+((b)&1)] #define ITER2 for(t3=0;t3<2;t3++) for(t4=0;t4<2;t4++) { #define TAIL2 t0&=1; t1&=1; printf("%d%d -> %d%d \t ~%c -> %c\n", t3,t4, t0, t1, MVL(t3,t4), MVL(t0, t1));\ }printf("\n"); #define ITER4 for(t3=0;t3<2;t3++) for(t4=0;t4<2;t4++) for(t5=0;t5<2;t5++) for(t6=0;t6<2;t6++) { #define TAIL4 t0&=1; t1&=1; printf("%d%d & %d%d -> %d%d \t %c & %c -> %c\n", t3,t4, t5, t6, t0, t1, MVL(t3,t4), MVL(t5,t6), MVL(t0, t1));\ }printf("\n"); int main(int argc, char **argv) { int c, d, t0, t1, t3, t4, t5, t6; printf("NOT\n===\n"); ITER2 t0=t3; t1=(t3|~t4); TAIL2 printf("AND\n===\n"); ITER4 d=(t3|t4)&(t5|t6); t0=d&(t3|t5); t1=d; TAIL4 printf("OR\n==\n"); ITER4 c=(t3^t5)^((t3|t4)&(t5|(t6&t3))); t0=c; t1=((t4|t6)|c); TAIL4 printf("XOR\n===\n"); ITER4 c=t3|t5; t0=c; t1=(c|(t4^t6)); TAIL4 printf("NAND\n====\n"); ITER4 d = (t3 | t4) & (t5 | t6); c = d & (t3 | t5); d = c | (~d); t0 = c; t1 = d; TAIL4 printf("NOR\n===\n"); ITER4 c = (t3 ^ t5) ^ ((t3 | t4) & (t5 | (t6 & t3))); d = c | (~((t4 | t6) | c)); t0 = c; t1 = d; TAIL4 printf("XNOR\n====\n"); ITER4 c = t3 | t5; d = c | (~(c | (t4 ^ t6))); t0 = c; t1 = d; TAIL4 exit(0); } |