Re: [Clockwork-developers] Implementation of a decentralized schedule

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

-- On Mon, 14 Jan 2002, Shawn McMahon wrote:

|> > 3) Implement two methods of communication:  a protocol for the client to
|> > directly request information from the servers (this would be done using
|> > unicast TCP connections), and then the usage of IP multicast for the sending
|> > of event-related information between the servers and the clients.  This
|> > would keep us from having to use broadcasts (to start up the client, method
|> > 1 or 2 could be used), and make it easy for all the servers to know about
|> > event that occur.
|> 
|> Option three is the least heinous, but still suffers from "I missed a
|> packet, and now I need to request the entire state of the database from
|> you".
|> 
|> If somebody says "show me the state of the job flow on all servers", we
|> don't want that to require connecting to dozens of machines to transfer
|> the entire job database.  Central management is going to be critical
|> to allowing one person to see the enterprise-wide state of a complex
|> application.  Assuming we want to support things as complex as, say,
|> Chronos.

What you're talking about is a bit of a trade-off:  We trade off a
single point of failure which can stop all batch processing for a more
complex design which will involve more communication among the servers,
but not suffer from relying so heavily on a single server.  I'm not yet
prepared to say which might be better for our purposes.

In my e-mail, I'm suggesting that we (a) develop a list of advantages
and disadvantages of each architecture, and possibly (b) prototype one
or both architectures in an effort to see which one will work better for
us.

We do, however, need to be careful in one respect:  the ability to
centrally manage a large schedule doesn't rely on a centralized
architecture based on a single (set of) server(s).  Let's look at
CHRONOS as an example.  Autosys uses the centralized architecture, yet
the number of jobs has outgrown what can comfortably fit in one instance
of this centralized server.  To solve this, we have broken the large
schedule up into multiple instances, loosely based on application
functionality.  My point is that, using either architecture, it's not
feasible to run a GUI that can quickly and easily navigate and manage
thousands of jobs as one logical unit.  We obviously can't do that using
an established commercial product and a reasonably complex application.
What's more, assuming you had the ability to view all of the CHRONOS jobs
in one management GUI, what would that buy you?  Whether our
architecture is centralized or not, we're probably going to have to add
some logical grouping functionality to the scheduler...for example, we
could create a logical group containing just the invoicing cycle for a
billing application, and leave other jobs, such as system maintenance
jobs, in another group.  In addition, I'm thinking that we need to have
a way to have a single logical job that runs on multiple servers.  These
types of enhancements, I think, will contribute to the ability to
effectively manage complex applications more than whether we choose to
implement the application in a centralized or decentralized manner.

To more specifically address your point, though, the idea of using
multicast datagrams is that we communicate the real-time state
information of the schedule using them.  My thinking was that every
single server doesn't necessarily need to know the complete state of the
enterprise schedule, but only the state of the jobs it depends on.
Let's say that server B misses the datagram telling it that job X on
server A has completed successfully, and that job Y on server A is
configured to run pending the successful completion of job X.  We would
need to build a mechanism for server B to get the information of the
status of job X from server A.  The idea is that, when the scheduler
needs specific information about the status of a job, it opens a TCP
connection to the server running that job and queries for the
information it needs.  Our application can implement the logic of
timeouts and exactly when the TCP connections are used, but the basic
idea is that, where appropriate, we use multicast to pass along
information that every server *might* be interested int, and use
point-to-point, reliable, TCP for specific conversations.

We might decide that the increased effort required to make this
scheduler decentralized isn't worth it...communications will be more
complex than simply opening a connection to the server when you have
something to say.  However, I think that before we commit to a
particular approach, we should fully evaluate the two approaches, and
make our decision based on which we feel will make the scheduler a
better application.  As you probably noted in my earlier e-mail, I'm
compiling a list of advantages / disadvantages of each approach, which
I'll post on the SourceForge project site.  I'll be listing your
concerns from this e-mail in the "disadvantages" section for a
decentralized approach, as they are certainly valid and pose a problem
which will have to be solved should we choose such an approach.

When you have some time, let's (all) continue to bring up the pros and
cons of both approaches.  The best way for us to determine a course of
action is to fully understand the problems we might encounter when we
implement them.  The more we understand the question, the better the
answer we'll give.  :-)

--
Charles Brian Hill
br...@do...