Re: [Iverilog-devel] A Static Horrible Idea

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Thu, Jul 1, 2010 at 12:11 PM, Kevin Cameron <iv...@gr...> wrote:
>
> Since someone mentioned VCS: I worked on a parallel processing version
> of VCS (VCS-MT back in the '90s). Prior to that I worked on a parallel
> VHDL simulator. My advice is -
>
> 1. Don't do a lock-step/SMP implementation.
>
>    SMP is on it's way out, and lock-step algorithms stop you taking
> advantage of parallelism in the designs.

I'm doing a threadpool-based implementation.  In addition, ordering
dependencies imply that two events with a dependency on one another
also access similar memory areas, meaning that, if we have some sort
of hook where we can send particular tasks to particular processors on
a NUMA arch, we can send dependent tasks to processors that have fast
access to that particular memory.  For example, if a task is dependent
on some other task we put them on a chain on the "first" task that
needs to execute "in order".  That chain can be transformed (after
completing the task) to a stealable deque, so that the same processor
(mostly) handles a set of tasks that are dependent on one another, and
therefore will mostly access the same memory.  If the other processors
become idle they then steal tasks from an overloaded processor.  I
still have to work out fast multithreaded work-stealing deques though;
memfences make my head spin.

For that matter commodity SMP's have SMP semantics but NUMA timings
anyway.  Cache miss, anyone?

>    Even if it does work, you would probably be better off doing a
> proper compiled-code back-end.

Certainly.  But multithreaded is sexy, and gives IVerilog a chance to
be put on a spotlight, with more eyeballs to go through the
compiled-code back-end.  JITted interpreters are not as sexy as they
once were (even tracing JIT is getting mainstream) ;)  IMO anyway.

The main problem is that compiler research has been in the limelight
for a long time, with everyone squeezing every little bit of juice out
of .... a single processor.  JIT isn't sexy, because it's just
compiler research where you're optimizing the compiler while you're
optimizing its output.  People get taught compilers in college.  They
didn't (until recently) get taught the stuff Djikstra's been spamming
his EWD's about cooperating processors and the bins they go through to
communicate with each other.  AFAICT anyway.  After all I took up
electronics in college ;) So what little I know about what they teach
in college compsci is from my cousin who took that.  In a university
in a third-world country.  For all I know some university somewhere
teaches semaphores to freshmen college students.

I suggest using GNU Lightning, although I hear LLVM is getting some
traction.  LLVM feels kinda heavy to me though, which is why I prefer
the header-only Lightning.  But Lightning development is kinda slow
and spurty...

>
> 2. Divide up the design into big chunks (statically or dynamically), and
> process them in separate executables.
>
>    That scales to networks and can avoid stalls that you get with
> lock-step.

This requires creating a new ivl target.  I'll think about this later.
 Most parallelizing commercial simulators (whose ads I've seen,
anyway) do this but require partitioning "by hand", one part of my
proposal is how to (in effect) do partitioning automatically.

For that matter SIMBUS seems to be the iverilog answer to by-hand
partitioning, although I haven't actually looked deeply into that.

But anyway I'll think about mvvp first before hacking into ivl targets.

>
> 3. Parallel processing things like the event-scheduling (on SMP) doesn't
> work.
>
>    Tightly coupled stuff tends to run into the problem that it dirties
> the cache lines.

That's why there's a proposal for event-local scheduling, with the
schedules of each event being merged at a later paralleiizable stage.
Basically we're transforming the simulator to what is effectively a
glorified MapReduce, with the output of the map being the schedules by
each executed event, and using a merge operation on the schedule
structures for the reduction function.

>
> For a static version of (2) it's mostly a front-end job rather than
> backend/runtime effort.

I agree, see above.

Sincerely,
AmkG