... This is a public continuation of a private thread,
see items at the bottom of this mail.
Contributions are wellcome.
The threads that we used with Millipede are kernel threads.
We work with NT, and handling threads turned out quite easy,
except when we wanted to migrate them with stack on the DSM.
We do have an implementation over Linux. It is not really in
full working conditions, I will tell you about it later.
It seems that your options 2-4 cannot provide the performance
needed for shared-memory HPC applications. As for option 1,
there are two important parameters:
1. How to solve false sharing.
2. Which memory model is supported.
There are perhaps 3 important general approaches (each having
a hand-full of variants) to solve false sharing, and provide
a decent memory model. The first 2 are page-based, namely,
they are based on the page granularity protection provided
by the hardware and the operating system. The third one consists
1. Release-Consistency approaches. With this approach,
modifications are sent over the network only upon arriving at
a memory barrier called release which is usually taken at
synchronization points (e.g., n-way barrier, Java's unlock)
This reduces the amount of communication and avoids false sharing.
However, it adds space overhead (saving twins) and computation
overheads (diffs, merges).
Release consistency-like memory models
(which is BTW where the Java memory model seem to be heading) rely
on the observation, that unless the programmer make sure to
synchronize every read-after-write sequence, he is bound to make
lots of errors, regardless of the strictness of the memory model.
2. MultiView = Millipede approach. We use multiple mapping from
virtual memory to "physical" memory objects. This lets us tailor
the size of the sharing units to the size of the objects used
by the application, thus avoiding false sharing.
Although using MultiView is independent of the memory model supported,
we have found that in the absence of false sharing, the simpler
protocol for sequential consistency is advantageous.
3. Instrumentation. This method avoids the considered-expensive
operating system protection, while supporting low granularity (pages of
256 bytes are commonly considered best) to avoid false sharing.
Instrumentation too is independent of the supported memory model.
The interesting thing about instrumentation is that it is relatively
easy to transparently transform compiled code into a cluster aware
Each of these methods has its own advantages. However with MultiView we
know that it can benefit a lot from kernel support.
From: Bruce Walker [mailto:bruce@...]
Sent: Wednesday, October 24, 2001 10:36 PM
To: Schuster, Assaf
Cc: Walker, Bruce J; Ladin, Rivka
Subject: Re: Clustering for Linux - adding distributed shared memory?
I'm interested in looking at it and we have already had
some discussions here about various options and feasibility.
I think there are perhaps 4 types of shared memory and
maybe two kinds of threads.
a. DSM done outside the kernel (this would be like you have
done in the past). I'm not familiar with what is out there but
I know there was quite a bit of interest in DSMs a few years
back. Should be easy to adapt a given implementation to the
b. System V kernel supported shared memory. We made this
completely transparent and completely coherent across nodes
in the Unixware NonStop Clusters. To attain this we leveraged
our Cluster Filesystem code. Our intent for Linux is
to first port CFS and then make it work for the shared
memory. This will take some months. The end result
should be page level granularity. Instrumentation is needed
to be able to make decision about moving things around.
c. Mapped files - very much like System V shared memory
except for the way they are named. We had full clusterwide
support for them in NSC using our CFS. In current SSI
with GFS (don't you just love the acronyms), they are at
best file level sharing, which wouldn't be acceptable at
d. Process private data. The Linux kernel thread model is
quite a bit different than other Unix models. They are
called clones. Clones can optionally share open files
and optionally share address space. The SSI project
does not currently support having clones on different
nodes in the cluster. To do so would require sharing
the process private memory across nodes. That memory is
not described using inodes so layering our CFS over it
to provide distribution and coherence is not as
obvious as it is for SHM and mapped files.
Two kinds of threads:
There are kernel threads (descibed above as clones) and
then there are library threads, which is what it seems that
you used in millipede. Distributing the kernel-level
clones throughout the cluster is quite a large
project, principally because of the memory shared
mentioned in "d" above. Moving strictly user-level/library-
level thread around wouldn't involve the kernel much if
at all. It would be similar to what you did before.
There may be a compromise situation that could allow
multiple kernel threads per node but migration of
user-level threads. If it is interesting, I'll describe that
in more detail.
What do you think of having all or part of the discussion
(the general thread/DSM part) on the open mail list.
[Charset iso-8859-1 unsupported, filtering to ASCII...]
> Hi Bruce,
> You might recall my mails to SSI mailing list about two weeks ago,
> concerning my paper on thread scheduling, and concerning the failure
> detection scalability.
> I am working at Compaq - Tandem Research labs in Israel on a SSI-JVM
> project for Java.
> The idea (utopic, I admit) is to transparently scale monolithic
> multithreaded Java servers
> to work on Clusters, while providing also cluster high availability and
> application fault tolerance.
> I am thinking of leveraging SSI Linux cluster to provide some of the
> required transparency.
> Looking at your presentation, I realized that the only missing
> ingredients for the SSI cluster are
> a high-performance distributed shared memory and an intra-process
> (per-thread) load sharing
> mechanism. This will enable multithreaded applications to scale up in a
> fashion on the SSI cluster. Some of your slides mention that this
> direction is missing in your
> goals list.
> The best way to implement a distributed shared memory is not clear. One
> method prefers
> instrumentation and compiler generated code. Still, other methods would
> benefit a lot from
> a cluster-aware kernel support.
> Are you interested in looking into this direction?
> Regards, Assaf