Thread: [SSI-users] White paper comparing OpenSSI and OpenMosix
Brought to you by:
brucewalker,
rogertsang
|
From: Bruce W. <br...@ka...> - 2004-10-29 01:34:04
|
For those of you who might have more recent information on OpenMosix than I do,
feel free to correct me privately or publicly.
A Comparison of OpenSSI and OpenMosix
Bruce J. Walker
Oct. 27, 2004
While OpenMosix and OpenSSI have commonality (process-level load balancing
via process migration), their goals and strategy are quite different.
I am no expert in OpenMosix and what I explain below is my current
understanding of what OpenMosix can and cannot do and how it works.
Both technologies claim Single System Image (SSI) but the SSI in each case
is quite different. OpenSSI strives to aggregate all the resources of
all the nodes to result in one big SMP-like environment (one big single
system image (SSI)). OpenMosix does not. Instead, OpenMosix strives to
attain "home-node" SSI, where, while processes can move from their home node
to other nodes, these processes see only their home node. One could argue
this isn't SSI at all, but simply cpu borrowing.
To accomplish the limited goal of cpu borrowing, OpenMosix leaves the
kernel portion of the process back at the home node and for the most part
re-directs all system calls done by the migrated process back to the home
node. Over time the OpenMosix group determined their strategy had
performance and availability limitations and has tried to let some of the
system calls be executed on the new node (eg. DFSA). However, given that
most calls still go back to the home node, loss of the home node means
all processes started there must die. The OpenSSI strategy
has always been that all system calls are executed on the node where the
process is running. This means the whole process moves in OpenSSI and not
just the "user" part of the process. There are several SSI ramifications
to the two approaches. In OpenSSI, a process has a single clusterwide unique
process id which can be seen and accessed from any process on any node.
In OpenMosix, migrated processes get a new pid on their host and visibility
of processes is limited to only those started on the home node (processes
which migrate into a new node (other than home node) are visible on that
new node with a different name).
The situation is similar with Inter-process communication (IPC) objects
(pipes, fifos, semaphores, message queues, shared memory, unix-domain sockets,
etc.). In OpenSSI, all objects are clusterwide unique (SSI) and visible
and accessible from all nodes. Consequently an object is created on the
nodes where the process is currently running, not on the home node. OpenMosix
creates all objects on the home node and processes started on the same home
node can share them (except shared memory objects, which OpenMosix didn't
support across nodes(this may have been enhanced recently);
OpenSSI allows completely coherent read/write shared memory sharing across
nodes).
OpenMosix does not have a strong sense of cluster membership. OpenMosix
has no APIs for membership and no infrastructure for high availability.
OpenSSI ensures that all nodes always agree on the current membership and
through the APIs, cluster-aware applications can see a consistent history of
membership transition events on all nodes. There are APIs in OpenSSI
for membership information, membership history and membership
event notifications. There are also several high availability
facilities integrated and included as part of the base OpenSSI. First,
the cluster filesystem capability (CFS) is highly available;
filesystems will transparently failover from one node to another, with no
errors seen by processes on any node actively working in those filesystems
(more on the filesystem capabilities below). Second, OpenSSI comes with
HA-LVS, which provides a highly available IP address for the cluster as
well as providing load-balancing of incoming tcp/ip connections (like http,
ssh, etc.). Providing a highly available IP address with persistent
connections across failures is an important part of a high availability in
any SSI cluster environment. Next, rc-type services can trivially be
restarted on another node after failure and OpenSSI includes a simple yet
flexible process monitoring and restart subsystem. OpenSSI can also be
used to provide an HA-NFS file service.
The filesystem capabilities for OpenSSI and OpenMosix are quite different.
OpenMosix, through their MFS, provides some access to remote files. It uses a
superroot naming scheme (you can name any file on any node by using the
naming convention //<nodename>/pathname). Such a strategy is clearly not
transparent (node specific names; different name for a file locally than
from another node) and it does not do coherent caching (and thus no shared
read/write mapped file capability). The OpenMosix MFS also has no failover
capability. MFS is perhaps on the bottom rung of cluster filesystems.
OpenSSI is designed to support different cluster filesystem technologies.
It comes, however, with HA-CFS, which is a transparent client-server
stacked cluster filesystems (transparently stacks on ext3, xfs, reiserfs, JFS,
etc.) that is fully coherent and yet caches aggressively, supports
shared read/write mapped files, and can transparently failover on node
failure. OpenSSI has also worked with GFS and OpenGFS, including using them as
a shared root. OpenSSI also works with Lustre and has used Lustre to support
a shared root. OpenSSI has also integrated OCFS (Oracle cluster filesystem).
OpenSSI enforces a clusterwide file namespace without the limitation of a
superroot naming scheme. OpenSSI has always worked with a shared root
(whether CFS, GFS or Lustre). In addition, any mount of any physical
filesystem (ext3, xfs, etc.) or NFS filesystem done on any node is
automatically and transparently visible by the same name on all nodes.
A key design goal for OpenSSI was to provide a platform in which other
open source cluster technologies could be integrated, thus building an
environment suitable for all clustering needs. Earlier, it was mentioned
that HA-LVS has been integrated, as well as GFS, OpenGFS, Lustre and OCFS.
In addition, OpenDLM and DRBD have been integrated. The kernel-based
membership capability of OpenSSI provides a set of APIs so these subsystems,
like those already in OpenSSI, can register for node membership events
and can co-ordinate node up and node down activities. OpenSSI also has a
kernel-to-kernel communication system that can be used by various
subsystems and has RDMA capabilities ready to leverage interconnects like
Infiniband.
Load balancing is, at some level, a point of commonality between OpenSSI
and OpenMosix and in fact the OpenMosix load calculation algorithm was adapted
into OpenSSI. However, OpenSSI has connection load balancing as well as
process load balancing. OpenSSI also supports migrating processes with shared
memory segment (didn't used to work in OpenMosix; may work now). OpenSSI also
supports migrating process groups as an atomic action and supports migrating
threads (which OpenMosix may have added recently). OpenSSI also has exec-time
load balancing as well as process migration. Exec-time is much less expensive
because there is no process data to migrate. OpenSSI leverages the HA
imalive messages to share load information between nodes on a frequent
basis so exec-time load balancing decisions can be made. OpenMosix has
a capability to do process load balancing based on memory pressure;
OpenSSI has not enabled that feature to date.
As is evident above, the goals of OpenSSI are much broader than just
the cpu sharing goal of OpenMosix. The chief goal of OpenSSI is to be a
complete cluster solution, which means addressing availability, scalability
(sharing of all resources), manageability and usability, as well as being the
platform that other open source cluster technology can be integrated and/or
layered on. Managability is a key cluster problem and by having such a high
degree of SSI, OpenSSI largely reduces the management problem from that of
a cluster to that of a single machine. The shared root is key to that, along
with visibility and access to all resources of all nodes from all nodes.
To summarize, I must re-iterate that I am no OpenMosix expert. Nonetheless,
I have tried to capture the significant differences between the two offerings.
A summary of the differences includes:
- OpenSSI has a single management and administrative domain and OpenMosix
does not;
- OpenSSI has a single root filesystem enforced across the cluster (single
copy of binaries, admin files (like password), etc.) and
OpenMosix does not;
- OpenSSI has a single pid per process and a clusterwide process management
space, which OpenMosix does not;
- OpenSSI has a transparent, clusterwide namespace for all IPC objects and
OpenMosix does not;
- OpenSSI has clusterwide device access and a single pty namespace;
OpenMosix may have clusterwide device access.
- OpenSSI has a consistent "single site" file naming across all nodes and
OpenMosix has the superroot naming paradigm;
- OpenSSI has transparent and fully coherent file access across all nodes
while OpenMosix has a limited function ship file access model;
- OpenSSI has integrated with most cluster filesystem technologies so
there is flexibility and choices in what to run on OpenSSI.
- OpenSSI has the kernel interfaces to allow integrating other open source
technologies and several technologies have been integrated;
- OpenSSI has a highly available cluster filesystem with transparent
failover; OpenMosix does not;
- OpenSSI provides a single name and address for the cluster and that
name/address is highly available, with persistent connections.
- OpenSSI and OpenMosix both do process migration but OpenSSI then executes
system calls on the new node and OpenMosix function ships most calls
back to the home node;
- OpenSSI has exec-time process load balancing while OpenMosix does not;
- OpenMosix has memory pressure based process load balancing and OpenSSI has
not enabled that;
- OpenSSI has a variety of high availability features which OpenMosix does
not, including process monitoring and restart, automatics service
failover, automatic filesystem failover, cluster ip address
and connection management failover, and the ability to lose a
home node without killing all the processes that started on it;
- OpenSSI has strong membership guarantees and APIs for membership while
OpenMosix does not;
- OpenSSI has APIs for rexec() and rfork() as well a migrate while OpenMosix
has only process migration;
Both OpenMosix and OpenSSI have roots back to the early 1980s. The OpenSSI
technology started at UCLA with a system called Locus. The OpenMosix code was
adapted to Linux several years before the OpenSSI code was adapted and when
the OpenSSI Linux project was started, the question was asked "Mosix is
already there; why do OpenSSI?" Hopefully this document has explained why
we believe OpenSSI is the technology base that will propel Linux to dominance
in the clustering arena.
|
|
From: Kilian C. <kil...@st...> - 2004-10-29 08:52:29
|
On Friday 29 October 2004 03:33, Bruce Walker wrote: > A Comparison of OpenSSI and OpenMosix Great summarization job, thank you. Your paper bring back a question I had in mind for a while: > Second, OpenSS comes with HA-LVS, which provides a highly available IP=20 > address for the cluster as well as providing load-balancing of incoming > tcp/ip connections (like http, ssh, etc.). =20 You speak about incoming http connections load-balancing. HA-LVS works grea= t=20 for me, I use it to load balance apache connections on my cluster. But I=20 observed that apache processes won't be migrated, thus I'm forced to run=20 apache on each node. Is there a way to run only one instance of apache,=20 whom child threads would be loadleveled accross the cluster? Best regards, =2D-=20 Kilian CAVALOTTI Ing=E9nieur Syst=E8mes & R= =E9seaux Laboratoire STIX =C9cole Polytechniq= ue =4691128 Palaiseau Tel : +33 1 69 33 41 = 13 |
|
From: Javier C. <jc...@un...> - 2004-10-29 09:10:26
|
Kilian CAVALOTTI wrote: > Is there a way to run only one instance of apache,=20 > whom child threads would be loadleveled accross the cluster? >=20 > Best regards, >=20 Another question: Is it possible to migrate a process that is listening=20 to a TCP port on an IP address attached to an interface in a node to=20 another node? --=20 Hay 10 tipos de personas, las que saben binario y las que no. Javier Celaya, Linux User #367634 /"\ jc...@un... \ / Campa=F1a del Lazo ASCII http://pulsar.unizar.es/Members/javi X contra el correo HTML / \ |
|
From: Brian J. W. <Bri...@hp...> - 2004-10-29 20:17:15
|
Javier Celaya wrote: > Another question: Is it possible to migrate a process that is listening > to a TCP port on an IP address attached to an interface in a node to > another node? Yes. Sockets currently don't migrate, so a socket can continue to use the interface to which it was bind()'d. The process can migrate, because it transparently continues to use the socket on the old node, as if it were local. If you want to set up a scalable, highly available server, it would be better for the socket to listen on the CVIP (see HA-LVS) rather than any NICs physical IP address. Brian |
|
From: Brian J. W. <Bri...@hp...> - 2004-10-29 20:12:04
|
Kilian CAVALOTTI wrote: > You speak about incoming http connections load-balancing. HA-LVS works great > for me, I use it to load balance apache connections on my cluster. But I > observed that apache processes won't be migrated, thus I'm forced to run > apache on each node. Is there a way to run only one instance of apache, > whom child threads would be loadleveled accross the cluster? I might be wrong, but I think the Apache processes share memory, so they cannot be distributed among several nodes. That's why HA-LVS TCP connection load balancing is recommended for Apache, rather than process load balancing. If you're running OpenSSI 1.1, the Apache processes can be moved together as a thread group from one node to another. Not sure if this is useful for Apache, since the socket still remains on the old node. Brian |
|
From: Kilian C. <kil...@st...> - 2004-10-29 20:50:05
|
Brian J. Watson wrote: > I might be wrong, but I think the Apache processes share memory, so they > cannot be distributed among several nodes. That's why HA-LVS TCP > connection load balancing is recommended for Apache, rather than process > load balancing. > > If you're running OpenSSI 1.1, the Apache processes can be moved > together as a thread group from one node to another. Not sure if this is > useful for Apache, since the socket still remains on the old node. Well, indeed, since I specified the CVIP address as BindAddress in httpd.conf, I actually can migrate apache threads, even individually. I'm not sure why, but whatever. This way, I can run only one instance of apache on my cluster, instead of one instance per node. My next question is: how can I specify the node I want to run this instance on? In the rc.nodeinfo (maybe debian specific?) file, I can put either 'initnode' or 'all' for a given service, but I'm not sure if I can put a node number there. Any clue? Thanks for your answers, Berst regards, -- Kilian CAVALOTTI | GPGKeyId: 0xD657340C BOFH excuse #370: Virus due to computers having unsafe sex. |
|
From: Brian J. W. <Bri...@hp...> - 2004-10-29 21:29:31
|
Kilian CAVALOTTI wrote: > Well, indeed, since I specified the CVIP address as BindAddress in > httpd.conf, I actually can migrate apache threads, even individually. > I'm not sure why, but whatever. I guess I'm wrong. Apache "threads" don't share memory with each other. > This way, I can run only one instance of apache on my cluster, instead > of one instance per node. My next question is: how can I specify the > node I want to run this instance on? In the rc.nodeinfo (maybe debian > specific?) file, I can put either 'initnode' or 'all' for a given > service, but I'm not sure if I can put a node number there. Any clue? David? Regards, Brian |
|
From: David B. Z. <dav...@hp...> - 2004-10-29 21:47:11
|
Kilian CAVALOTTI wrote:
>
> This way, I can run only one instance of apache on my cluster, instead
> of one instance per node. My next question is: how can I specify the
> node I want to run this instance on? In the rc.nodeinfo (maybe debian
> specific?) file, I can put either 'initnode' or 'all' for a given
> service, but I'm not sure if I can put a node number there. Any clue?
You can put "node=#" or "node=#,#" for multiple nodes into
/etc/rc.d/rc.nodeinfo. In Redhat 9 /sbin/chkconfig manipulates this file.
# /sbin/chkconfig
This may be freely redistributed under the terms of the GNU Public License.
usage: chkconfig [--ssi] --list [name]
chkconfig --add <name>
chkconfig --del <name>
chkconfig [--level <levels>] <name> <on|off|reset>)
chkconfig --ssi <name> reset
chkconfig --node <class> <name>
chkconfig --failover <on/off> <name>
/sbin/chkconfig --node node=# svcname
--
David B. Zafman | Hewlett-Packard Company
mailto:dav...@hp... | http://www.hp.com
"Computer Science" is no more about computers than astronomy is about telescopes - E. W. Dijkstra
|
|
From: Kilian C. <kil...@st...> - 2004-10-30 07:00:23
|
David B. Zafman wrote: > You can put "node=#" or "node=#,#" for multiple nodes into > /etc/rc.d/rc.nodeinfo. In Redhat 9 /sbin/chkconfig manipulates this file. Wonderful! I didn't know that, and it's the solution of many of my problems! Thanks a lot, -- Kilian CAVALOTTI | GPGKeyId: 0xD657340C BOFH excuse #311: transient bus protocol violation |
|
From: Roger T. <pe...@ho...> - 2004-10-30 09:56:18
|
> > For those of you who might have more recent information on OpenMosix than > I do, > feel free to correct me privately or publicly. > > > A Comparison of OpenSSI and OpenMosix > > Bruce J. Walker > > Oct. 27, 2004 > > > While OpenMosix and OpenSSI have commonality (process-level load > balancing I think we might also want to compare performance of openMosix versus openSSI. For example performance of MFS/DFSA vs. CFS, process migration, network load of openMosix vs. openSSI interconnect, comparison of system loads for same HPC tasks, and security. |