clockwork-developers Mailing List for Clockwork (Page 2)
Status: Planning
Brought to you by:
jlouder
You can subscribe to this list here.
| 2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
(6) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2002 |
Jan
(16) |
Feb
|
Mar
(10) |
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
(4) |
| 2003 |
Jan
(9) |
Feb
(4) |
Mar
(1) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
|
From: Joel L. <jo...@lo...> - 2002-03-12 01:47:09
|
Shawn D. or Brian, Do you guys have an opinion on the centralized/decentralized question? I share Shawn M's feeling that we ought to choose a path and get on with the project. -- Joel |
|
From: Shawn M. <smc...@ei...> - 2002-03-10 14:27:11
|
This one time, at band camp, Joel Loudermilk wrote: >=20 > Does anyone else have an opinion on this now that we've laid some pros and > cons of each approach on the table? We've been at this fork in the road > for a while, and I'm anxious to make a decision one way or the other and > move on with the project. No, I think you hit all the pertinant nails on the head. I think we should move forward with establishing a concrete set of features for the first target, so we can get into some real division of labor. Basically, what we need it to do so that we can use it in non-production to test things, but actually be useful for small applications. |
|
From: Joel L. <jo...@lo...> - 2002-03-09 22:55:47
|
I don't know about everyone else, but I'm leaning toward a centralized schedule myself. The decentralized plan has some great benefits, but it seems like it would be a lot more complicated. While this would allow us to learn some new stuff like multicast IP while building it, we're trying to build a product where reliability is essential, and something that's less complicated is likely to be more reliable. My other reason for choosing the centralized design is that we all understand very well how AutoSys, a centralized scheduler, works. We know the good things about it, and we could improve on the bad things. If we choose decentralized, we have to start from scratch. Does anyone else have an opinion on this now that we've laid some pros and cons of each approach on the table? We've been at this fork in the road for a while, and I'm anxious to make a decision one way or the other and move on with the project. If you have any thoughts not covered in Brian's compilation, now is the time to speak up. -- Joel |
|
From: Charles B. H. <br...@do...> - 2002-03-06 00:07:33
|
-- On Sun, 3 Mar 2002, Joel Loudermilk wrote: |> |> +- On Monday (1/14/2002 10:40) Charles Brian Hill <br...@do...> Wrote- |> | When you have some time, let's (all) continue to bring up the pros and |> | cons of both approaches. The best way for us to determine a course of |> | action is to fully understand the problems we might encounter when we |> | implement them. The more we understand the question, the better the |> | answer we'll give. |> |> It's been several weeks since we talked about this, but I hope it's not |> too late for me to put in my two cents. Here's what I see as the advantages |> of both the centralized and decentralized schedule: |> |> If you see something I missed, please reply. I hope Brian's offer to |> consolidate and post the list is still good. OK. The offer is still good, and I've taken the few ideas I jotted down when last we discussed this, and added Joel's to that list. I've also posted the list at SourceForge: https://sourceforge.net/docman/display_doc.php?docid=9907&group_id=40038 Actually, Joel's ideas were the first that anyone had sent to me, so if people would like to think about it some more and send me some more information, I'll add it to the document. Enjoy! -- Charles Brian Hill br...@do... |
|
From: Joel L. <jo...@lo...> - 2002-03-03 23:12:20
|
+- On Monday (1/14/2002 10:40) Charles Brian Hill <br...@do...> Wrote- | When you have some time, let's (all) continue to bring up the pros and | cons of both approaches. The best way for us to determine a course of | action is to fully understand the problems we might encounter when we | implement them. The more we understand the question, the better the | answer we'll give. It's been several weeks since we talked about this, but I hope it's not too late for me to put in my two cents. Here's what I see as the advantages of both the centralized and decentralized schedule: CENTRALIZED: ------------ good: * Management software is simpler because the entire schedule is in one place. * We already understand a half-decent model for this type of system. bad: * At least one system has to run the scheduler. Probably more than one if you don't want a single point of failure. In a large schedule, these would need to be dedicated systems, adding to the overhead of the scheduler. DECENTRALIZED: -------------- good: * Low/no overhead: you can run a schedule without dedicating any systems to be "the scheduler." bad: * Updates to the schedule would have to propagate to individual nodes. * The management GUI would have to poll (or subscribe to) lots of systems to get an idea of the current state of the schedule, which would probably result in a delay between starting the GUI and looking at current data. * Multicast IP: I don't know anything about it, and some users might not be wild about being forced to use it. * We don't have a well-understood model for this type of schedule, so we won't be able to learn from someone else's mistakes. If you see something I missed, please reply. I hope Brian's offer to consolidate and post the list is still good. -- Joel |
|
From: Charles B. H. <br...@do...> - 2002-01-14 15:13:51
|
-- On Mon, 14 Jan 2002, Shawn McMahon wrote: |> > 3) Implement two methods of communication: a protocol for the client to |> > directly request information from the servers (this would be done using |> > unicast TCP connections), and then the usage of IP multicast for the sending |> > of event-related information between the servers and the clients. This |> > would keep us from having to use broadcasts (to start up the client, method |> > 1 or 2 could be used), and make it easy for all the servers to know about |> > event that occur. |> |> Option three is the least heinous, but still suffers from "I missed a |> packet, and now I need to request the entire state of the database from |> you". |> |> If somebody says "show me the state of the job flow on all servers", we |> don't want that to require connecting to dozens of machines to transfer |> the entire job database. Central management is going to be critical |> to allowing one person to see the enterprise-wide state of a complex |> application. Assuming we want to support things as complex as, say, |> Chronos. What you're talking about is a bit of a trade-off: We trade off a single point of failure which can stop all batch processing for a more complex design which will involve more communication among the servers, but not suffer from relying so heavily on a single server. I'm not yet prepared to say which might be better for our purposes. In my e-mail, I'm suggesting that we (a) develop a list of advantages and disadvantages of each architecture, and possibly (b) prototype one or both architectures in an effort to see which one will work better for us. We do, however, need to be careful in one respect: the ability to centrally manage a large schedule doesn't rely on a centralized architecture based on a single (set of) server(s). Let's look at CHRONOS as an example. Autosys uses the centralized architecture, yet the number of jobs has outgrown what can comfortably fit in one instance of this centralized server. To solve this, we have broken the large schedule up into multiple instances, loosely based on application functionality. My point is that, using either architecture, it's not feasible to run a GUI that can quickly and easily navigate and manage thousands of jobs as one logical unit. We obviously can't do that using an established commercial product and a reasonably complex application. What's more, assuming you had the ability to view all of the CHRONOS jobs in one management GUI, what would that buy you? Whether our architecture is centralized or not, we're probably going to have to add some logical grouping functionality to the scheduler...for example, we could create a logical group containing just the invoicing cycle for a billing application, and leave other jobs, such as system maintenance jobs, in another group. In addition, I'm thinking that we need to have a way to have a single logical job that runs on multiple servers. These types of enhancements, I think, will contribute to the ability to effectively manage complex applications more than whether we choose to implement the application in a centralized or decentralized manner. To more specifically address your point, though, the idea of using multicast datagrams is that we communicate the real-time state information of the schedule using them. My thinking was that every single server doesn't necessarily need to know the complete state of the enterprise schedule, but only the state of the jobs it depends on. Let's say that server B misses the datagram telling it that job X on server A has completed successfully, and that job Y on server A is configured to run pending the successful completion of job X. We would need to build a mechanism for server B to get the information of the status of job X from server A. The idea is that, when the scheduler needs specific information about the status of a job, it opens a TCP connection to the server running that job and queries for the information it needs. Our application can implement the logic of timeouts and exactly when the TCP connections are used, but the basic idea is that, where appropriate, we use multicast to pass along information that every server *might* be interested int, and use point-to-point, reliable, TCP for specific conversations. We might decide that the increased effort required to make this scheduler decentralized isn't worth it...communications will be more complex than simply opening a connection to the server when you have something to say. However, I think that before we commit to a particular approach, we should fully evaluate the two approaches, and make our decision based on which we feel will make the scheduler a better application. As you probably noted in my earlier e-mail, I'm compiling a list of advantages / disadvantages of each approach, which I'll post on the SourceForge project site. I'll be listing your concerns from this e-mail in the "disadvantages" section for a decentralized approach, as they are certainly valid and pose a problem which will have to be solved should we choose such an approach. When you have some time, let's (all) continue to bring up the pros and cons of both approaches. The best way for us to determine a course of action is to fully understand the problems we might encounter when we implement them. The more we understand the question, the better the answer we'll give. :-) -- Charles Brian Hill br...@do... |
|
From: Shawn M. <smc...@ei...> - 2002-01-14 13:14:19
|
This one time, at band camp, C. Brian Hill wrote: >=20 > 1) Allow the client to have a predetermined list of servers (possibly in a > configuration file), and have it make TCP connections to each server as > needed, sending all communications through these servers. This isn't >=20 > 2) Have the client broadcast on its local networks (UDP), and autodiscover > servers. This might work, especially, if we make each of the servers know >=20 > 3) Implement two methods of communication: a protocol for the client to > directly request information from the servers (this would be done using > unicast TCP connections), and then the usage of IP multicast for the send= ing > of event-related information between the servers and the clients. This > would keep us from having to use broadcasts (to start up the client, meth= od > 1 or 2 could be used), and make it easy for all the servers to know about > event that occur. >=20 > 4) Use broadcasts for communications which apply to more than one server, > and use unicast TCP for point-to-point communication. Option three is the least heinous, but still suffers from "I missed a packet, and now I need to request the entire state of the database from you". If somebody says "show me the state of the job flow on all servers", we don't want that to require connecting to dozens of machines to transfer the entire job database. Central management is going to be critical to allowing one person to see the enterprise-wide state of a complex application. Assuming we want to support things as complex as, say, Chronos. |
|
From: C. B. H. <br...@do...> - 2002-01-14 00:13:53
|
----- Original Message ----- From: "Joel Loudermilk" <jo...@lo...> To: <clo...@li...> Sent: Sunday, January 13, 2002 6:53 PM Subject: Re: [Clockwork-developers] Implementation of a decentralized schedule > > I like Brian's ideas of how to run a decentralized schedule using multicast > and unicast where appropriate. Some things to consider would be: > > When a job finishes, does the system always post a notification about the > job, or does it try to figure out if there are dependencies at other systems > and only send to those systems? The way I'm thinking, it would always post a notification, in case there are clients running. In other words, if you (as a client or as an agent on a node) were listening to the multicast traffic, you'd receive all the "real-time" information about the scheduler as it happens. > If job A runs on system X and job B runs on system Y when job A finishes, > what happens when job A finishes, but system Y is down? If system X simply > sent a multicast notification when job A was finished, we're out of luck. > If we decide that the systems need to be smart enough to know who'll be > starting jobs after theirs finish, then system X could resend the message > until acknowledged by system Y. I was thinking that, when job A finishes, a multicast notification would be sent out. However, if system Y is down, then when it comes back up, for any jobs that are in an activated state (to borrow an Autosys term), the agent would directly query the servers hosting the jobs that are the source of the dependency. > We might want to think about breaking up systems in to management units, so > that if a user has 500 systems, but only wants to monitor a schedule that > affects 50 of them, we don't force him to poll all the systems to get > the status of that schedule. Also, I was thinking that jobs could be assigned into logical groups. How much easier would it be to manage CHRONOS' schedule if we could load up a GUI that would, for example, only show the invoicing cycle, or some other group of jobs? > I'll have to do some reading about multicast, as I have no experience with > it. I really don't have any experience writing multicast software, but from the work I've done with Tibco I have a general understanding of how it works. The Tibco software can use either multicast or broadcast, but, in an enterprise situation, multicast gives it a lot of power (eliminates the need for the logical routing daemons we use in the CHRONOS implementation). It seems like the issue of centralization / decentralization is a kind of fork in the road of our design process. I propose we try to come up with a list of possible advantages and disadvantages of each approach, in order to help us make our decision. In addition, if people were interested, we could prototype one or both systems using, say, Java, to give ourselves a feel for how the implementation might proceed, and to possibly help us uncover problems we hadn't yet thought of. I'd be glad to compile the list of pros / cons for the group, so just send them to the mailing list, and I'll put them together. -Brian |
|
From: Joel L. <jo...@lo...> - 2002-01-13 23:53:53
|
I like Brian's ideas of how to run a decentralized schedule using multicast and unicast where appropriate. Some things to consider would be: When a job finishes, does the system always post a notification about the job, or does it try to figure out if there are dependencies at other systems and only send to those systems? If job A runs on system X and job B runs on system Y when job A finishes, what happens when job A finishes, but system Y is down? If system X simply sent a multicast notification when job A was finished, we're out of luck. If we decide that the systems need to be smart enough to know who'll be starting jobs after theirs finish, then system X could resend the message until acknowledged by system Y. Using a centralized monitoring system like Brian suggested, updates to the schedule could be distributed from there as well, since it would already have knowledge of all the nodes in the schedule. And we could implement something akin to Autosys' global variables also, by distributing updates to those variables the same way we distribute updates to the schedule. (I think that's a particularly neat feature of AutoSys.) We might want to think about breaking up systems in to management units, so that if a user has 500 systems, but only wants to monitor a schedule that affects 50 of them, we don't force him to poll all the systems to get the status of that schedule. I'll have to do some reading about multicast, as I have no experience with it. -- Joel |
|
From: C. B. H. <br...@do...> - 2002-01-13 16:05:46
|
----- Original Message ----- From: "Joel Loudermilk" <jo...@lo...> To: <clo...@li...> Sent: Saturday, January 12, 2002 4:11 PM Subject: [Clockwork-developers] Implementation of a decentralized schedule > Don't get me wrong, I really like the idea of not needing a central server > process and database (multiplied by two for redundancy), but I can't quite > figure out how to make this work without them. How about this: a "client" (used for monitoring) that connects to each server in an environment, based on either a predetermined configuration, or some sort of autodiscovery method. Yes, the client would need to talk to each server, but I think some way to centralize monitoring is going to be a requirement, whether the data is centralized or not. The way I see it, we have the following options if we want to decentralize things: 1) Allow the client to have a predetermined list of servers (possibly in a configuration file), and have it make TCP connections to each server as needed, sending all communications through these servers. This isn't difficult to implement, so I'd suggest that even if we go with a more complex design, we might want to implement something like this, even if only really for testing purposes. 2) Have the client broadcast on its local networks (UDP), and autodiscover servers. This might work, especially, if we make each of the servers know about each of the other servers in its environment. If you think about it, the servers will need to know which other servers are part of their environment just to execute the schedule and to know if servers are down. The client could broadcast on its local network, and then grab a server list from one server who responds, then opening TCP connections to communicate with the individual servers. 3) Implement two methods of communication: a protocol for the client to directly request information from the servers (this would be done using unicast TCP connections), and then the usage of IP multicast for the sending of event-related information between the servers and the clients. This would keep us from having to use broadcasts (to start up the client, method 1 or 2 could be used), and make it easy for all the servers to know about event that occur. 4) Use broadcasts for communications which apply to more than one server, and use unicast TCP for point-to-point communication. Of these, if we want to decentralize the application, I'm most in favor of the third option: using multicast to send event-related messages, and TCP unicast for point-to-point communications. Here's how it would work in some scenarios: Scenario 1: Job failure. The server on which the job failed sends a multicast message to the other servers in its environment (and any clients which may be running) to inform them of the failure. From there, servers could respond to the failure as necessary...running other jobs, displaying an alert (on the client), automatically notifying administrators, whatever. Scenario 2: Job dependency between two servers. On completion of the first job, its scheduling daemon sends a multicast message to the group informing them of the completion of the first job and its status. The server on which the second job runs receives this information and begins the dependent job if the first job was successful. Scenario 3: Force Start of a Job. While monitoring the distributed application, an administrator wants to start a job on demand. The client opens a unicast TCP connection to the server in question and issues the command to start the job. That server then sends a multicast message so that the other servers can act on the starting of that job if necessary. I hope this is enough for everyone to see how the idea might work. Using plain old TCP will work too, but will require a lot of connections between a lot of servers. If we can design two protocols so that we use multicast and unicast together, each where it makes the most sense to do so, I think we can accomplished the decentralized design. I don't think relying solely on point-to-point communications will be very scalable in a decentralized design. The reason is that, for an environment of n servers, we might need up to nC2 connections. Using multicast, each server listens to a single multicast group address, and a lot of the traffic would go over that connection. When point-to-point connections are required (should be relatively infrequently), they can be opened and then closed. Multicast might be a little more difficult to implement, but it will reward us in terms of scalability and resource usage. So, what does everyone think? -Brian |
|
From: Shawn M. <smc...@ei...> - 2002-01-13 02:46:04
|
This one time, at band camp, Joel Loudermilk wrote: >=20 > scheduler, since we would want to distribute the database as well. In that > design, does anybody have an idea how a GUI monitoring tool would be able > to see the current state of the schedule? Wouldn't it have to poll lots > and lots of systems? Well, there's nothing that says the monitoring couldn't be centralized, with a server process watching broadcast traffic. However, I really don't like the idea of using broadcasts, since the odds of missing a packet and thus having wrong information is high. |
|
From: Joel L. <jo...@lo...> - 2002-01-12 21:11:29
|
Does anyone have any idea of how a decentralized schedule might be implemented? Shawn D. mentioned that the monitoring could be done with a central server that all the clients periodically report their job status to. But I was thinking about how the jobs flows would work. Assuming that each system already has its portion of the schedule loaded, it's easy to handle a job that starts at noon -- the system just runs it at noon. But what about a job that runs based on the success of a job on another system? If there's no central scheduler, then the systems would have to report their statuses to whoever had dependencies on them. This could make updating a live schedule hairy -- you'd have to make sure all the systems got the updates pretty quickly. And even on the monitoring side, the data needs to be very current or an operator might take an incorrect action. For example, if he thinks one job is still running, he might delay or kill another job. But if that information is out of date ... Don't get me wrong, I really like the idea of not needing a central server process and database (multiplied by two for redundancy), but I can't quite figure out how to make this work without them. -- Joel |
|
From: Shawn D. <sd...@cf...> - 2002-01-12 18:23:35
|
I agree about NTP. No sense re-inventing the wheel. The folks who devised NTP were undoubtedly pretty bright so we won't make any substantial improvement, and the environment for our scheduler will likely already have NTP configured. If we decentralize the schedule (which I do favor) then we have a couple of choices in terms of providing a central view. $Universe does use a central master server that keeps track of jobs on all systems, but if I remember correctly it is passive; it waits to be notified by the clients so little or no polling occurs. In this model you won't have an up-to-the-second view of the state of the schedule, though you could make it pretty close by adjusting how often updates are sent to the master. Another way would be to actively poll all the servers from the master, but this will probably cause substantial processor and network overhead. The polling wouldn't have to be continual, only occurring when someone is viewing the status through the GUI, but many shops will probably want the GUI open all of the time, so you'd be taking the hit all the time. If we go to a centralized schedule then we have easy access to the current state. However, AutoSys illustrates the trade-offs with sort of configuration, and I don't know if we want to accept the heavy compute requirements and single-point failure possibilities that this entails. If we do go with a centralized design then we need to include elements in the design to ensure that the scheduler will cleanly failover or, better yet, support some sort of hot-standby. Shawn ----- Original Message ----- From: "Joel Loudermilk" <jo...@lo...> To: <clo...@li...> Sent: Saturday, January 12, 2002 10:53 AM Subject: Re: [Clockwork-developers] Architecture of job scheduler > > +- On Friday (1/11/2002 14:4) "Charles Brian Hill" <br...@do...> Wrote- > | A valid question may be whether our tool should be responsible for time > | synchronization. The users could always use NTP to keep the clocks on their > | systems synchronized (to a certain degree). > > Anything we would build to syhcnronize time would probably not be as good > as NTP, because we've never tried that before and because it's just one > component of our software. > > When I think of the intended users of this software, I see people whose > scheduling needs have outgrown cron, either because they have too many > systems or too complicated a schedule, or both. In a medium to large network > like that, I think it's safe to assume that the administrators have already > taken care of syhcnronizing the time. > > | If we choose to use > | a full SQL database, we have two possible routes: either choose a database > | that everyone will have to do, or decide to support multiple database > | platforms. > > I'm very much in favor of supporting multiple platforms. Even AutoSys can > support at least Sybase and Oracle (and I don't know how much more). > > | If I wanted to use a full-fledged database, I would want to take advantage > | of some of the more advanced features of the database engine (let's say we > | had one that supports transactions, replication, triggers, etc). To use > | those features we'd need to go with a single database platform. Trying to > | use, for example, JDBC, and let the user choose the database platform would > | mean that we wouldn't be able to use features that aren't commonly > | supported. > > JDBC does support transactions, and aren't replication and triggers things > that happen "behind the scenes" that we wouldn't be controlling through > the database's API? > > If so, what about supporting any database for which the user can find a > JDBC driver and that supports transactions and replication (and whatever > other features we need to use). To set up the database, the user might > have to do some database-specific stuff, since defining replication probably > varies across database platforms, but that's just one-time setup, and we > could probably even include scripts for the most popular databases. > > But I suppose all this is irrelevant if we decide to decentralize the > scheduler, since we would want to distribute the database as well. In that > design, does anybody have an idea how a GUI monitoring tool would be able > to see the current state of the schedule? Wouldn't it have to poll lots > and lots of systems? > > -- > Joel > > _______________________________________________ > Clockwork-developers mailing list > Clo...@li... > https://lists.sourceforge.net/lists/listinfo/clockwork-developers |
|
From: Shawn D. <sd...@cf...> - 2002-01-12 18:13:01
|
I think we can easily accommodate the DB functionality we need by using very generic SQL commands that would be compatible with most SQL DBs. We should have a primary DB that we will use for most of our development and testing. My personal preference is Postgres because I'm more familiar with it and because of its reputation for better data integrity. We'll also need to have access to other major DBs so we could do regression testing to ensure compatibility, but I think we can manage that fairly easily, at least for Oracle, Sybase, and MySQL. Shawn ----- Original Message ----- From: "Joel Loudermilk" <jo...@lo...> To: <clo...@li...> Sent: Saturday, January 12, 2002 10:59 AM Subject: Re: [Clockwork-developers] Architecture of job scheduler > > +- On Friday (1/11/2002 14:18) Shawn McMahon <smc...@ei...> Wrote- > | I think we have to do a free one if we do one at all, because we're a > | free project. We're cutting our own throats if the scheduler is free > | but you have to pay Oracle $100k if you want to use it. I agree that we > | can't support every database, at least not in the beginning. > > I agree. Regardless of how many database platforms we support, at least > one of them should be free. > > One reason I think we should support multiple platforms is that I can see > a user who decides not to use our product, because he's got dozens of > Oracle servers and could easily accomodate another couple of databases > on them, but we would require him to run MySQL, or Sybase, or something else. > > That's something that always irritated me about Bugzilla. It's a great > tool, but you can only use MySQL. They even wrote the thing in Perl and > use DBI, so you'd think you could use any platform, but they then went > and used features specific to MySQL (like enum data types and some other > stuff). > > I think people are about as picky about their favorite database platform > as they are about their favorite UNIX. So why make a program that runs on > anyone's favorite UNIX, but only one database? > > -- > Joel > > _______________________________________________ > Clockwork-developers mailing list > Clo...@li... > https://lists.sourceforge.net/lists/listinfo/clockwork-developers |
|
From: Joel L. <jo...@lo...> - 2002-01-12 15:59:25
|
+- On Friday (1/11/2002 14:18) Shawn McMahon <smc...@ei...> Wrote- | I think we have to do a free one if we do one at all, because we're a | free project. We're cutting our own throats if the scheduler is free | but you have to pay Oracle $100k if you want to use it. I agree that we | can't support every database, at least not in the beginning. I agree. Regardless of how many database platforms we support, at least one of them should be free. One reason I think we should support multiple platforms is that I can see a user who decides not to use our product, because he's got dozens of Oracle servers and could easily accomodate another couple of databases on them, but we would require him to run MySQL, or Sybase, or something else. That's something that always irritated me about Bugzilla. It's a great tool, but you can only use MySQL. They even wrote the thing in Perl and use DBI, so you'd think you could use any platform, but they then went and used features specific to MySQL (like enum data types and some other stuff). I think people are about as picky about their favorite database platform as they are about their favorite UNIX. So why make a program that runs on anyone's favorite UNIX, but only one database? -- Joel |
|
From: Joel L. <jo...@lo...> - 2002-01-12 15:53:31
|
+- On Friday (1/11/2002 14:4) "Charles Brian Hill" <br...@do...> Wrote- | A valid question may be whether our tool should be responsible for time | synchronization. The users could always use NTP to keep the clocks on their | systems synchronized (to a certain degree). Anything we would build to syhcnronize time would probably not be as good as NTP, because we've never tried that before and because it's just one component of our software. When I think of the intended users of this software, I see people whose scheduling needs have outgrown cron, either because they have too many systems or too complicated a schedule, or both. In a medium to large network like that, I think it's safe to assume that the administrators have already taken care of syhcnronizing the time. | If we choose to use | a full SQL database, we have two possible routes: either choose a database | that everyone will have to do, or decide to support multiple database | platforms. I'm very much in favor of supporting multiple platforms. Even AutoSys can support at least Sybase and Oracle (and I don't know how much more). | If I wanted to use a full-fledged database, I would want to take advantage | of some of the more advanced features of the database engine (let's say we | had one that supports transactions, replication, triggers, etc). To use | those features we'd need to go with a single database platform. Trying to | use, for example, JDBC, and let the user choose the database platform would | mean that we wouldn't be able to use features that aren't commonly | supported. JDBC does support transactions, and aren't replication and triggers things that happen "behind the scenes" that we wouldn't be controlling through the database's API? If so, what about supporting any database for which the user can find a JDBC driver and that supports transactions and replication (and whatever other features we need to use). To set up the database, the user might have to do some database-specific stuff, since defining replication probably varies across database platforms, but that's just one-time setup, and we could probably even include scripts for the most popular databases. But I suppose all this is irrelevant if we decide to decentralize the scheduler, since we would want to distribute the database as well. In that design, does anybody have an idea how a GUI monitoring tool would be able to see the current state of the schedule? Wouldn't it have to poll lots and lots of systems? -- Joel |
|
From: Shawn M. <smc...@ei...> - 2002-01-11 19:18:51
|
This one time, at band camp, Charles Brian Hill wrote: >=20 > If we go centralized, I think we might as well pick a full-fledged SQL > database platform (hopefully a free one) and standardize on that. A > centralized system means we're dependent on a single server, and if that's > the case, that server is liable to be very, very busy. If we wanted to h= ave If we do that, we want to take advantage of ALL the features of a real database, so we shouldn't pick a MySQL type of program that's fast but doesn't care about data integrity; we'd look at more of a PostgreSQL, where it doesn't wring the last bit of possible speed out, but it assumes your data is precious. I think we have to do a free one if we do one at all, because we're a free project. We're cutting our own throats if the scheduler is free but you have to pay Oracle $100k if you want to use it. I agree that we can't support every database, at least not in the beginning. |
|
From: Charles B. H. <br...@do...> - 2002-01-11 19:05:22
|
----- Original Message ----- From: "Shawn McMahon" <smc...@ei...> To: <clo...@li...> Sent: Friday, January 11, 2002 10:03 AM Subject: Re: [Clockwork-developers] Architecture of job scheduler > Another concern would be clocks; it's easier to keep one machine synced > than dozens. A valid question may be whether our tool should be responsible for time synchronization. The users could always use NTP to keep the clocks on their systems synchronized (to a certain degree). NTP can generally keep system clocks within a second of each other, so the question is whether it would be valuable to keep system clocks more closely synchronized than is possible with the standard tools like NTP. As we're talking mostly about batch processing, it seems relatively unlikely to me that I would run into a situation where I need to start jobs on multiple systems with that much accuracy. Even supposing that were the case, the developer would likely want to be using real-time programming techniques, which would make the use of a scheduler like we're discussing out of the question. In a decentralized configuration like Joel suggested, I'm thinking it would probably be enough to have the servers start jobs according to their own system clocks, and let the system administrators worry about keeping the system clocks as closely synchronized as they need. Many routers participate in NTP time synchronization, and I'd guess that most large server network installations are configured for NTP as well. (I even run a server at my house to keep all of my PCs' clocks in sync.) > > event processor (to borrow a term from AutoSys). And a SQL-based database > > would make things easier to work with from a development perspective, but > > not until now did I realize that it might make the system less attractive > > for a user, since they would have to manage another database. > > No reason we can't make the database a part of the server program, and > not use a full-blown SQL, is there? That's certainly an option, but if we could achieve the same, or even better, performance (one of Autosys' disadvantages) without having a centralized server to depend on and without having a full-fledged SQL database that humans have to manage, I'd be all for it. If we choose to use a full SQL database, we have two possible routes: either choose a database that everyone will have to do, or decide to support multiple database platforms. If I wanted to use a full-fledged database, I would want to take advantage of some of the more advanced features of the database engine (let's say we had one that supports transactions, replication, triggers, etc). To use those features we'd need to go with a single database platform. Trying to use, for example, JDBC, and let the user choose the database platform would mean that we wouldn't be able to use features that aren't commonly supported. If we go centralized, I think we might as well pick a full-fledged SQL database platform (hopefully a free one) and standardize on that. A centralized system means we're dependent on a single server, and if that's the case, that server is liable to be very, very busy. If we wanted to have redundant servers be an option, we'd really need some database replication to do it right, so there's really no alternative to a real database. However, if we decide to decentralize the application, perhaps we should investigate using something smaller, like maybe Berkeley DB, that may not be SQL, but might have enough functionality to manage the jobs for a single machine, and do a good job at it. Alternatively, we could choose to store the data in XML, and load it into the data structures we're using in whatever language(s) we choose. Or, if we could find an SQL engine that doesn't require human management and is small enough to include with builds of our application, that would be cool too. Just my $0.02 -Brian |
|
From: Charles B. H. <br...@do...> - 2002-01-11 15:35:05
|
----- Original Message ----- From: "Joel Loudermilk" <jo...@lo...> To: <clo...@li...> Sent: Thursday, January 10, 2002 9:46 PM Subject: [Clockwork-developers] Architecture of job scheduler > In the limited research I did looking at features of commercial job > schedulers, I found an interesting idea. I had always taken for granted > that any distributed job scheduler would need to have a central server > process to manage the schedule and distribute the work of the schedule to > other systems. But Orsyp's scheduler (called "Dollar Universe") has no > central server -- although the schedule is still managed centrally. > > Does anyone have an opinion about this kind of design? Like I said, until > I saw Dollar Universe, I had just assumed there would have to be a central > event processor (to borrow a term from AutoSys). And a SQL-based database > would make things easier to work with from a development perspective, but > not until now did I realize that it might make the system less attractive > for a user, since they would have to manage another database. In my experience with Tibco software (namely the system monitoring tool they call Hawk), I've encountered a similar configuration. Hawk is a system monitoring tool, designed to provide and act on statistical information about systems. It can be configured centrally, in other words, it's possible to configure an entire environment using one GUI interface, but it has no central server. I'm not sure exactly how this is implemented, except for the following bits of information: 1) Each server contains its own database of configuration data, and presumably at least some configuration data for other servers. 2) The system itself is monitored by running an application which listens for broadcast or multicast traffic from the agents running on each node. 3) Information about, say, a server going down is either discovered by any client running at the time (Hey, this server went away!), or is reported by agents on other nodes noticing that the server went away. My opinions on this type of design are as follows: First off, we'd have to understand that in order to take away the need for a user to manage a central database, we'd need to find a way to create (and maintain automatically) individual databases on each node. Perhaps we could create an XML schema that would allow us to store the database information on each server. We'd want to keep in mind that using this type of design will likely preclude us from using more advanced database features (such as triggers, internalized locking mechanisms, etc.). If we want these features, and we want to have the information decentralized, we might look into whether there are any "mini-database" systems available. I believe I recall reading some time ago about a version of MySQL that had been made for implementations such as this -- where you wouldn't necessarily want to have an actual database server running, but you might want to be able to use a slimmed-down SQL database. Given that we can come up with an acceptable database framework to use, I think decentralizing the database is a pretty good idea. The client-server mechanism that Autosys uses seems somewhat flawed to me. For starters, what's the point of storing all the jobs that are going to run on server A on server B? What happens when server A goes down? Server B has to deal with this fact. If the configurations were decentralized, and server A went down, it would simply have to worry about recovering its own jobs. Autosys suffers in this regard. We are all familiar with Autosys chase alarms occurring when a server is down. These are (in my opinion) a result of the client-server design -- the server relies on a (pretty much) stateless client to provide state information, which doesn't persist between reboots. If there were no server, the "clients" ( probably actually "agents") would _have_ to maintain all their own relevant state information, so when the system crashes, recovery would be no more difficult than noting which jobs were running at the time of the crash, and either marking them as in a "failed" state (or "terminated"), and/or optionally re-starting them. We'd also want to think about how we'd implement a monitoring application. How would it know which servers to connect to? Maybe a configuration file? How network-intensive will it be to monitor a large network of servers with such a tool? This might be one disadvantage of such an architecture. Would it be possible to generate an overall visual picture of job flows using this architecture? Before I shut up, I wanted to bring up another idea I had: Could we include the capability to have a single logical "job" run on multiple machines? In other words, considering all the jobs in use for CHRONOS, when we run jobs in a "distributed" fashion on multiple application servers, could we allow users somehow to consider those as a unit. This might make visualization of the schedule simpler. OK, I'm done. -Brian |
|
From: Shawn M. <smc...@ei...> - 2002-01-11 15:03:18
|
This one time, at band camp, Joel Loudermilk wrote: >=20 > I don't know exactly how this works, since I've never used their schedule= r, > but it's interesting. What's also interesting is that on their web site, = they > mention this only to say that if a system is isolated from the network, > it will still run its jobs. I don't know about you, but if one of my syst= ems > was isolated from the network, I think I would prefer that it not attempt > to run any jobs, since they most likely wouldn't work. Another concern would be clocks; it's easier to keep one machine synced than dozens. Another would be stopping a job from running; what if the machine on the other end is too busy to listen to you? However, I don't know that these are insoluble problems. For instance, the "client" end could check with a central time server(s) before running a job, and be configurable as to whether or not it cared if it couldn't see it. If you made that configurable per-job, you could actually remove cron, instead of just supplementing it. > event processor (to borrow a term from AutoSys). And a SQL-based database > would make things easier to work with from a development perspective, but > not until now did I realize that it might make the system less attractive > for a user, since they would have to manage another database. No reason we can't make the database a part of the server program, and not use a full-blown SQL, is there? |
|
From: Joel L. <jo...@lo...> - 2002-01-11 02:46:39
|
In the limited research I did looking at features of commercial job schedulers, I found an interesting idea. I had always taken for granted that any distributed job scheduler would need to have a central server process to manage the schedule and distribute the work of the schedule to other systems. But Orsyp's scheduler (called "Dollar Universe") has no central server -- although the schedule is still managed centrally. I don't know exactly how this works, since I've never used their scheduler, but it's interesting. What's also interesting is that on their web site, they mention this only to say that if a system is isolated from the network, it will still run its jobs. I don't know about you, but if one of my systems was isolated from the network, I think I would prefer that it not attempt to run any jobs, since they most likely wouldn't work. At work, TCS really likes Orsyp's scheduler, particularly because there's no central server and no databases to manage. There's a lot less overhead for the scheduler than with AutoSys, or at least it seems. Does anyone have an opinion about this kind of design? Like I said, until I saw Dollar Universe, I had just assumed there would have to be a central event processor (to borrow a term from AutoSys). And a SQL-based database would make things easier to work with from a development perspective, but not until now did I realize that it might make the system less attractive for a user, since they would have to manage another database. If you want to see the list of features in Orsyp's Dollar Universe, see: http://www.orsyp.com/us/dollar_universe.asp -- Joel |
|
From: Joel L. <jo...@lo...> - 2001-12-23 00:34:31
|
I've attempted to distill everyone's comments about features into a first draft of a requirements document. It's available on the Sourceforge project page in the DocManager. Here's a URL to go directly to it: https://sourceforge.net/docman/display_doc.php?docid=8470&group_id=40038 There are only nine items on the list, but some of them are pretty hefty. I figured that once we can agree on all the requirements, we could divide the software into a few (or more) major releases (or milestones, or whatever you like to call them) and decide which features will be implemented in which release. The list is by no means final, so if I've missed or misstated anything, please let me know. -- Joel |
|
From: Charles B. H. <br...@do...> - 2001-12-21 13:08:58
|
Joel, Sorry you're receiving this twice. My SMTP server was missing the "postmaster" alias and SourceForge refused my e-mail. That should be fixed now, and I wanted to make sure this message hits the list archives. ----- Original Message ----- From: "Joel Loudermilk" <jo...@lo...> To: <clo...@li...> Sent: Thursday, December 20, 2001 9:53 PM Subject: Re: [Clockwork-developers] Kicking around some requirements > | 1) Dynamic evaluation of schedule: There should be a capability within the > | scheduler to evaluate certain values dynamically. In other words,> | definitions of jobs > | could include, say, some variable, whose value would not > | be evaluated until runtime. > Can you give an example of this? I'm having a hard time understanding this > feature. As a part of the definition of a job, it should be possible to use variables. For example, in specifying what user account should be used to run a job, it should be possible to, instead of directly specifying an actual account, specify a variable which would contain the name of the user. This variable should not be evaluated until runtime, which would allow for easier and more efficient changes to the schedule by means of manipulating the variables, rather than the definitions of the jobs themselves. One idea I had with regard to this is even allowing the variables to be evaluated on each client if desired, rather than on the central server. Does this make more sense? A more real world example is something that we have been desiring to do with Autosys. Each release, as the application user changes, the entire schedule must be reloaded with the new user coded in the job definition. If we could use a variable that is evaluated dynamically (as opposed to at the time the schedule is loaded), we could just change the value of this variable, rather than needing to reload the schedule. Let me know if this still doesn't make sense. We probably want to not name it the way I did, because it is a bit confusing. -Brian |
|
From: Joel L. <jo...@lo...> - 2001-12-21 02:53:44
|
+- On Tuesday (12/11/2001 12:50) "Charles Brian Hill" <br...@do...> Wrote- | 1) Dynamic evaluation of schedule: There should be a capability within the | scheduler to evaluate certain values dynamically. In other words, | definitions of jobs could include, say, some variable, whose value would not | be evaluated until runtime. Can you give an example of this? I'm having a hard time understanding this feature. Another feature I think would be great is the ability to assign machines to user-defined categories, and to have jobs run on systems that match certain categories, in addition to scheduling jobs on individual systems. For instance, if you assign all your Solaris systems to the "Solaris" category, you can set up your Solaris backup job *once* to run on all the "Solaris"-type systems, and when you add a new system, simply assigning it the relevant categories means that your standard jobs will be executed. I'm going to try to summarize everyone's wish list of features and put it somewhere in the documentation section of the project web site sometime soon. -- Joel |
|
From: Charles B. H. <br...@do...> - 2001-12-11 17:51:47
|
----- Original Message ----- From: "Joel Loudermilk" <jo...@lo...> To: <clo...@li...> Sent: Monday, December 10, 2001 9:35 PM Subject: [Clockwork-developers] Kicking around some requirements > In doing a little research on the web by looking at commercial job schedulers, > it seems that there are very different approaches to the architecture of a > scheduler. But before I started thinking about that too much, I wanted to > build a list of the things I wanted to see in a scheduler. > > Here's what I came up with: <Snip> > If you've got any thoughts along this line, please post them. I'm sure > we've all had enough experience with AutoSys to at least have an opinion > on what's important and what's not. I have a couple of requirements I thought I'd throw out to see what everyone thinks: 1) Dynamic evaluation of schedule: There should be a capability within the scheduler to evaluate certain values dynamically. In other words, definitions of jobs could include, say, some variable, whose value would not be evaluated until runtime. Obviously, certain fields would need to be exempt from dynamic evaluation: job name, start time, etc., but things such as user name, executable locations, etc., should be able to be evaluated by the scheduler at runtime. This is one feature that Autosys doesn't seem to have. It does support dynamic evaluation of variables on the target system (i.e., you can include an environment variable in the executable name, to be expanded by the shell when the job is run), but not as such in the scheduler. I think this feature would be quite valuable to schedule administrators. 2) An efficient way of entering schedule changes: It seems to me that there should not only be a way to edit jobs directly, through, say, a GUI, but there should also be a command-line method for entering changes. Further, this command-line method should support some syntax for making *changes* to an existing job, rather than just supporting the replacing of one job definition with a new one. Perhaps a declarative language somewhat like SQL could be used to interactively make changes to and display information about existing jobs. 3) An efficient way of designing schedules: Autosys lacks (big-time) a way to design a schedule, and then translate that design into something the scheduler can take as input. A GUI application, perhaps, could allow the user the capability to graphically design a schedule, and could then export the schedule in some sensible format. Much of the tedious part of working in SchedEx must be the dual maintenance of spreadsheets and the actual schedule. This might be a difficult application to write, and it should likely be placed in the somewhat-distant future as compared to other pieces of the scheduler, but I think it's important. 4) Truly dynamic data storage: This probably means that we need to implement the schedule in some sort of database, but we can't use anything that (for example) requires a HUP to be sent to daemons when changes are made. Direct file-based I/O might be okay too, provided we're interested in developing and maintaining that code as well. If we choose a database, I propose that we make a (possibly feeble) attempt to support more than one database platform. Depending on what database features we need to use (e.g. triggers), we might lock out some database platforms, but even so, our database interactions ought to be standards (ANSI SQL) compliant if possible. -- C. Brian Hill br...@do... |