On Wed, 2002-09-25 at 19:42, Hans Ekbrand wrote:
> On Wed, Sep 25, 2002 at 02:16:40PM -0400, David Johnston wrote:
> > On Wed, 2002-09-25 at 05:19, Tom Lisjac wrote:
> > > I'd like to set these labs up in other schools but the single point
> > > of failure and lack of scalability makes me nervous
> > Tom,
> > I would run dhpcd on one machine, rsync the dhcpd.leases file from
> > the server to the second machine every hour or so, and use the HA
> > heartbeat to start the second machine's dhcpd whenever the primary
> > failed.
> The point of alternate the dhcp servers is not that they themselves
> put load on their respective box, but to make the servers share the
> login sessions, since the server that will offer the
> IP/kernel/NFS-root will also be the server that will be queried for a
> login promt by the workstion.
AH. I completely missed that possibility. Is there another way to
> > To divide the load, you can set up specialized servers. One runs your
> > window managers, another runs all the browsers, another runs your office
> > suite. In this setup, it's possible for one app to be unavailable while
> > everything else continues to work.
> This is not what the OP wanted. "If one server goes down, I'd like the
> lab to simply slow down... not stop."
I realize that. I was trying to point out that linux-ha won't get us
what the OP wanted, but that it is possible to mitigate the risks. I'm
sorry I wasn't clear; re-reading the message, I think I accidentally
edited out part of what I was getting at, which is that we aren't really
ready for what the OP wants.
> With the exception that the users currently logged in to the server
> that goes down will lose their sessions, a reboot of the workstations
> should give a login prompt to the server that is left. (Depending on
> the backup/sync rutines used, some data in /~ can be lost, but only
> (some of the) changes done in the lost session, users that want better
> crash recovery than that are simply not realistic, but might be
> frequent in a "local elementary school" ;-)
> > As an alternative, you can use the linux-ha heartbeat software to set up
> > a fallback server. If the primary server goes down, the workstations
> > will all fail but they will be able to sign into the fallback server
> > almost immediately. For this to work, you have to use something like
> > NAS so that loosing a server doesn't mean loosing access to the
> > data.
> What is NAS?
NAS is "Network Attached Storage". You set up a file server (NFS or
SMB) that is only a file server, and that only communicates with your
LTSP servers. This is based on the principle (or is it just a hope?)
that a single-purpose machine is less likely to fail than a
multi-purpose machine. This machine will hold users' files. The LTSP
servers mount /home from the NAS. This way, if an LTSP server goes
down, the files are still available and you don't need rsync, et cetera.
> What do you think of my suggestion with a check in
> Xstartup if ~/ are in sync with the other server, combined with a
> logout script that syncs and leaves a file in ~/ that says that ~/ are
> in sync?
The problem I see with your Xstartup idea is that the only time your
sync. script is necessary is when a server has gone down, which is also
the only time when the sync. script cannot access the data it needs.
If you must keep two machines in sync, you have to do it in real time
(or close to it). One way is to cross-connect the SCSI chains of the
two servers; each server has two disks, its own and its brother's
failover. When one server detects that its brother is down, it mounts
its brother's failover disk and does its brother's job until the failed
However, neither NAS nor interconnected SCSI addresses the issue of
incomplete file updates (ie, a server or client crash in the middle of a
file update). Current thinking seems to be that this cannot be
addressed at the hardware or O/S level; it must be addressed at the
For example, when Galeon starts up after a Galeon crash, it recovers my
previous session as bookmarks. This is part of Galeon.
As a better example, any decent database server can take a series of
transactions and only commit them if the complete series is successful.
However, for this to work, the DB frontend (ie, the application) has to
tell the server that a given series of transactions are interdependent.
We have to depend on the applications (Open Office, Galeon, etc) to do
> > If the data are rapidly changing and critical, you can use AFS or
> > shared-scsi disks.
> For "a local elementary school" that might be an overkill ;-)
I think you're right. I would like to ask the group to discuss the
possibility that what OP wants is overkill for a local elementary
school, as well. I went down the same road the OP is going down
(eliminating all single points of failure) for a business client, and
ran into several potential solutions and a lot of dead ends. I have
since tried a different tack; instead of trying to eliminate the
possibility of failure, I'm trying to minimize the effects of a
failure. In other words, I can't promise that a workstation won't
crash, but if it does you should be back at work in under a minute.
I think it would be great if this discussion gets us around some of
these dead ends I found.
Is anyone else interested in sharing failure data from production LTSP
-What is the mean time between failure on LTSP servers?
-Has anyone had a server crash? (I haven't)
-Once an LTSP installation is stable and in production, what
user-noticeable problems are we seeing as a group?
-What questions have I missed?
Here's one problem I've seen:
Under great load, Red Hat will kill off memory hogs. Unfortunately,
this means that sometimes you can lose your window manager without
loosing your session. You can see your programs, but you can't interact