felix / Feature Requests / #20 pthread issues

John Skaller - 2006-08-20

Logged In: YES
user_id=5394

"11. Would you like to integrate anything from the SR
programming language?
http://www.cs.arizona.edu/sr/doc.html"

I'm looking.. this may provde very useful. Thanks!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John Skaller - 2006-08-20

Logged In: YES
user_id=5394

FYI: the thing I'm thinking heavily about at the moment is
parallel garbage collection.

At the moment Felix uses naive mark/sweep, and cannot work
with any roots on 'the' machine stack. This is already a
problem in itself -- the GC has to be called when the stack
is empty, because C provides no way to examine the machine
stack.

In a threading context, the problem is vastly worse: ALL the
pre-emptive threads must be, and remain, stack free during
collection. In fact, they have to be stopped.

This is implemented. A request to collect waits until all
threads are stopped, and keeps them stopped until it is
finished. That may be OK on a uniprocessor but on a
multi-processor it will lock up all the CPUs except the one
doing collection.

By contrast, a *purely* functional system, in which memory
can only be written once (initialised), admits a copying
generational incremental collection scheme for which each
thread allocates on a local heap, without any exclusion at
all, and occasionally needs to compact and or age objects
into a shared heap. The aging requires synchronisation.

However the shared heap can be reaped (garbage collected)
entirely WITHOUT synchronisation, except at the start of the
sweep, where synchronisation is required to fix the roots.
Once the sweep starts, the other threads can go off and
continue allocating. They can age things into the heap
during collection.

In other words, the collection can be done in parallel.

There are several issues, the big one is: is it possible to
run multiple reaper threads? This is needed for collectors
to scale to large number of CPU: the design above only
allows one reaper thread, because mark/sweep has to tag all
reachable blocks, and until that is done you don't know
which blocks are unreachable. Data partitioning may help,
certainly it is already available (eg using separate
processes and TCP/IP for communications :) but we're looking
for something better (eg: some type based scheme).

The second issue is whether the situation can be improved if
writes ARE allowed, for example using a write-barrier.

It isn't clear if Felix could ever support this: the C++
object model isn't even remotely close to 'purely
functional'. Nevertheless neither is Ocaml, which uses a
write barrier (however it doesn't support multi-processing).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John Skaller - 2006-08-20

Logged In: YES
user_id=5394

Just looking at MPD (the new version of SR) we see
in the quadrature example:

co
larea = quad(left, mid, fleft, fmid, larea)
//
rarea = quad(mid, right, fmid, fright, rarea)
oc;

Note this is basically Occam syntax, it isn't new.
This executes the two statements in parallel,
then waits until they're finished before proceeding
(fork/join).

Felix can already do this, albiet with ugly and error prone
syntax, something like:

join := mk_pchannel[unit];
spawn_pthread {
larea = quad(left, mid, fleft, fmid, larea);
write (join,());
};
spawn_pthread {
rarea = quad(mid, right, fmid, fright, rarea);
write (join, ());
};
ignore(read join);
ignore(read join);

The 'join' channel is needed because all Felix threads are
detached: they cannot be joined with a primitive.
As you can see joining is not needed, since it can easily
be implemented with channels, which are more general.

As written, the Felix implementation is much less secure
than the MPD one, and it also only works for shared
memory threads (not processes).

Some syntactic sugar, such as that used in MPD,
would be very useful.

Note in this case I chose to use shared memory for the
results (larea, rarea) but I could have actually
read/written the data on the channel instead.

Note also that in Felix, if you make these changes:

pchannel -> schannel
pthread -> sthread

the code will still work, but will execute entirely within a
single pthread.

This is THE major contribution of Felix to programmers.
Control inversion does not require pre-emptive threading.

The interesting thing here is that parallelism doesn't
either: pre-emption is a hack to simulate parellelism
on a uni-processor. The 'right' constructions are free
of explicit synchronisation: in this case MPD is
using 'co .. // .. oc'.

Also note in this case shared memory for the outputs is
used but is not necessary and is a bad idea: it would
be better to read the result into a local variable
with a parallel assignment, than to use parallel
procedures which each do assignments (because
if the functions are pure, you can't trash any
global store, and the reads are serialised, which
ensures they're atomic).

Felix already has parallel assignment:

a,b = c,d;

is parallel. So actually Felix can already do this
BETTER than MPD:

larea, rarea =
quad(left, mid, fleft, fmid, larea),
quad(mid, right, fmid, fright, rarea)
;

is actually better and works now.. however the
current implementation doesn't actually do the calculations
in separate threads, because it is too hard to tell
if that would be efficient or not.

I'm actually working on a way to make this happen at
the moment: the pre-cursor is to have procedures which
can be used like functions: procedures can be threaded,
functions can't because the C stack gets in the way,
but is required for returning results. Procedures
can use a pointer to a variable into which to store
the result, and with that usage, they can be threaded.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Markus Elfring - 2006-08-20

Logged In: YES
user_id=572001

7. How do you think about to avoid high-level locks by
nonblocking implementations for common data structures?
http://www.cs.chalmers.se/~noble/manual/introduction.html

It would be nice and efficient to work with atomic
instructions for lock-free purposes.
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2047.html

11. I like the syntactic sugar. It reduces the amount of
code that must be written by the developer. (A parallelising
compiler can generate the required synchronisation
operations. - http://citeseer.ist.psu.edu/275191.html\)
Does Felix benefit from pre-emptive multitasking on multiple
processors already?

Are you going to introduce GC "background" threads?
http://en.wikipedia.org/wiki/Garbage_collection_%28computer_science%29#Stop-the-world_vs._incremental_vs._concurrent

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John Skaller - 2006-08-20

Logged In: YES
user_id=5394

"7. How do you think about to avoid high-level locks by
nonblocking implementations for common data structures?
http://www.cs.chalmers.se/~noble/manual/introduction.html

It would be nice and efficient to work with atomic
instructions for lock-free purposes.
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2047.html"

As mentioned previously, the difficulty here is making
such resources portable. Some resources can't be implemented
*at all* on some processors. Even if they can be, it may
require privileged (supervisor) access.

AMD64, for example, can supply several user operations.
However more advanced stuff can be done with memory typing
which requires kernel level access, possibly a module,
though I'm not even sure that is enough. For example
AMD64 supports uncacheable memory.

We are likely to use such resource before providing them
to end users. For example and STM implementation may
provide atomic { .. retry .. } stuff, which may use,
for example, CAS .. without exposing the implementation.

A big problem with this stuff is testing. We need a lab
with many multiple CPU boxes of different kinds for that.
I have an AMD64x2...I have no idea how to make a root
kit for Windows.. etc.

And finally, also as mentioned, given some way to configure
access and build various resources, there's a Felix language
issue of how to enable them, what to do if you need them
and they're not there, etc. Apart from conditional
compilations, which as mentioned sucks, I have implemented
prioritorised requirements:

... requires A | B;

which first tries to use A, but if there is no A available,
will use B. Unfortunately, the mechanism isn't transitive,
and in general it has been shown to be an NP-complete
(that is, hard) problem. I've had a look at some code
which can solve it, but it isn't clear doing so would
solve the real problem: if it is that hard, programmers
won't understand the solution anyhow.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John Skaller - 2006-08-20

Logged In: YES
user_id=5394

"11. I like the syntactic sugar."

What is a compiler but an interpreter of sugar? :)

"(A parallelising
compiler can generate the required synchronisation
operations. - http://citeseer.ist.psu.edu/275191.html\)"

The problem isn't that one cannot sugar up such stuff,
the problem is knowing WHEN to. Sometimes it is
more efficient to do things serially than in parallel.

"Does Felix benefit from pre-emptive multitasking on
multiple processors already?"

Nothing stops the OS from scheduling the pre-emptive
threads on multiple CPUs.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John Skaller - 2006-08-20

Logged In: YES
user_id=5394

"Are you going to introduce GC "background" threads?"

I'm not sure it can be made to work with C/C++ object
model. Even making objects mobile (movable to a
different address) isn't supported by C++, this is
required for a copying collector.

A parallel collector is even more demanding:
I believe it can for a purely functional computational
model, but as you know C++ uses mutators. Whether
or not a write barrier can be used to fix this I
do not know, but I suspect not: I think you'd
have to keep a list of EVERY value a variable
had during a collection. Perhaps this can be optimised,
but in general it doesn't seem possible.

Even Haskell cannot do this. I recently suggested on
the GHC mailing list that tail-rec optimisations
are actually a bad idea, because they defeat parallel
collection. Haskell DOES allow multiple threads to
allocate in parallel I believe .. but still needs
world stop for the major heap collection.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Markus Elfring - 2006-08-23

Logged In: YES
user_id=572001

7. How do you concurrently run multiple job queues?
Would you like to be able to adjust their priorities?

11. I hope that this GC proposal will produce useful results.
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n1943.pdf

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

pthread issues

Group

Searches

Help

#20 pthread issues

Discussion