Re: [Clockwork-developers] Implementation of a decentralized schedule
Status: Planning
Brought to you by:
jlouder
|
From: Charles B. H. <br...@do...> - 2002-01-14 15:13:51
|
-- On Mon, 14 Jan 2002, Shawn McMahon wrote: |> > 3) Implement two methods of communication: a protocol for the client to |> > directly request information from the servers (this would be done using |> > unicast TCP connections), and then the usage of IP multicast for the sending |> > of event-related information between the servers and the clients. This |> > would keep us from having to use broadcasts (to start up the client, method |> > 1 or 2 could be used), and make it easy for all the servers to know about |> > event that occur. |> |> Option three is the least heinous, but still suffers from "I missed a |> packet, and now I need to request the entire state of the database from |> you". |> |> If somebody says "show me the state of the job flow on all servers", we |> don't want that to require connecting to dozens of machines to transfer |> the entire job database. Central management is going to be critical |> to allowing one person to see the enterprise-wide state of a complex |> application. Assuming we want to support things as complex as, say, |> Chronos. What you're talking about is a bit of a trade-off: We trade off a single point of failure which can stop all batch processing for a more complex design which will involve more communication among the servers, but not suffer from relying so heavily on a single server. I'm not yet prepared to say which might be better for our purposes. In my e-mail, I'm suggesting that we (a) develop a list of advantages and disadvantages of each architecture, and possibly (b) prototype one or both architectures in an effort to see which one will work better for us. We do, however, need to be careful in one respect: the ability to centrally manage a large schedule doesn't rely on a centralized architecture based on a single (set of) server(s). Let's look at CHRONOS as an example. Autosys uses the centralized architecture, yet the number of jobs has outgrown what can comfortably fit in one instance of this centralized server. To solve this, we have broken the large schedule up into multiple instances, loosely based on application functionality. My point is that, using either architecture, it's not feasible to run a GUI that can quickly and easily navigate and manage thousands of jobs as one logical unit. We obviously can't do that using an established commercial product and a reasonably complex application. What's more, assuming you had the ability to view all of the CHRONOS jobs in one management GUI, what would that buy you? Whether our architecture is centralized or not, we're probably going to have to add some logical grouping functionality to the scheduler...for example, we could create a logical group containing just the invoicing cycle for a billing application, and leave other jobs, such as system maintenance jobs, in another group. In addition, I'm thinking that we need to have a way to have a single logical job that runs on multiple servers. These types of enhancements, I think, will contribute to the ability to effectively manage complex applications more than whether we choose to implement the application in a centralized or decentralized manner. To more specifically address your point, though, the idea of using multicast datagrams is that we communicate the real-time state information of the schedule using them. My thinking was that every single server doesn't necessarily need to know the complete state of the enterprise schedule, but only the state of the jobs it depends on. Let's say that server B misses the datagram telling it that job X on server A has completed successfully, and that job Y on server A is configured to run pending the successful completion of job X. We would need to build a mechanism for server B to get the information of the status of job X from server A. The idea is that, when the scheduler needs specific information about the status of a job, it opens a TCP connection to the server running that job and queries for the information it needs. Our application can implement the logic of timeouts and exactly when the TCP connections are used, but the basic idea is that, where appropriate, we use multicast to pass along information that every server *might* be interested int, and use point-to-point, reliable, TCP for specific conversations. We might decide that the increased effort required to make this scheduler decentralized isn't worth it...communications will be more complex than simply opening a connection to the server when you have something to say. However, I think that before we commit to a particular approach, we should fully evaluate the two approaches, and make our decision based on which we feel will make the scheduler a better application. As you probably noted in my earlier e-mail, I'm compiling a list of advantages / disadvantages of each approach, which I'll post on the SourceForge project site. I'll be listing your concerns from this e-mail in the "disadvantages" section for a decentralized approach, as they are certainly valid and pose a problem which will have to be solved should we choose such an approach. When you have some time, let's (all) continue to bring up the pros and cons of both approaches. The best way for us to determine a course of action is to fully understand the problems we might encounter when we implement them. The more we understand the question, the better the answer we'll give. :-) -- Charles Brian Hill br...@do... |