clockwork-developers Mailing List for Clockwork (Page 2)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Shawn D. or Brian,

Do you guys have an opinion on the centralized/decentralized question?

I share Shawn M's feeling that we ought to choose a path and get on with
the project.

-- 
Joel

This one time, at band camp, Joel Loudermilk wrote:
>=20
> Does anyone else have an opinion on this now that we've laid some pros and
> cons of each approach on the table? We've been at this fork in the road
> for a while, and I'm anxious to make a decision one way or the other and
> move on with the project.

No, I think you hit all the pertinant nails on the head.

I think we should move forward with establishing a concrete set of
features for the first target, so we can get into some real division
of labor.

Basically, what we need it to do so that we can use it in non-production
to test things, but actually be useful for small applications.

I don't know about everyone else, but I'm leaning toward a centralized
schedule myself. The decentralized plan has some great benefits, but it
seems like it would be a lot more complicated. While this would allow us
to learn some new stuff like multicast IP while building it, we're trying
to build a product where reliability is essential, and something that's
less complicated is likely to be more reliable.

My other reason for choosing the centralized design is that we all
understand very well how AutoSys, a centralized scheduler, works. We know
the good things about it, and we could improve on the bad things. If we
choose decentralized, we have to start from scratch. 

Does anyone else have an opinion on this now that we've laid some pros and
cons of each approach on the table? We've been at this fork in the road
for a while, and I'm anxious to make a decision one way or the other and
move on with the project.

If you have any thoughts not covered in Brian's compilation, now is the
time to speak up.

-- 
Joel

-- On Sun, 3 Mar 2002, Joel Loudermilk wrote:

|> 
|> +- On Monday (1/14/2002 10:40) Charles Brian Hill <br...@do...> Wrote-
|> | When you have some time, let's (all) continue to bring up the pros and
|> | cons of both approaches.  The best way for us to determine a course of
|> | action is to fully understand the problems we might encounter when we
|> | implement them.  The more we understand the question, the better the
|> | answer we'll give.
|> 
|> It's been several weeks since we talked about this, but I hope it's not
|> too late for me to put in my two cents. Here's what I see as the advantages
|> of both the centralized and decentralized schedule:
|> 
|> If you see something I missed, please reply. I hope Brian's offer to
|> consolidate and post the list is still good.

OK.  The offer is still good, and I've taken the few ideas I jotted down
when last we discussed this, and added Joel's to that list.  I've also
posted the list at SourceForge:

https://sourceforge.net/docman/display_doc.php?docid=9907&group_id=40038

Actually, Joel's ideas were the first that anyone had sent to me, so if
people would like to think about it some more and send me some more
information, I'll add it to the document.

Enjoy!

--
Charles Brian Hill
br...@do...

+- On Monday (1/14/2002 10:40) Charles Brian Hill <br...@do...> Wrote-
| When you have some time, let's (all) continue to bring up the pros and
| cons of both approaches.  The best way for us to determine a course of
| action is to fully understand the problems we might encounter when we
| implement them.  The more we understand the question, the better the
| answer we'll give.

It's been several weeks since we talked about this, but I hope it's not
too late for me to put in my two cents. Here's what I see as the advantages
of both the centralized and decentralized schedule:

CENTRALIZED:
------------

good:
* Management software is simpler because the entire schedule is in one place.
* We already understand a half-decent model for this type of system.

bad:
* At least one system has to run the scheduler. Probably more than one if
  you don't want a single point of failure. In a large schedule, these would
  need to be dedicated systems, adding to the overhead of the scheduler.

DECENTRALIZED:
--------------

good:
* Low/no overhead: you can run a schedule without dedicating any systems
  to be "the scheduler."

bad:
* Updates to the schedule would have to propagate to individual nodes.
* The management GUI would have to poll (or subscribe to) lots of systems
  to get an idea of the current state of the schedule, which would probably
  result in a delay between starting the GUI and looking at current data.
* Multicast IP: I don't know anything about it, and some users might not
  be wild about being forced to use it.
* We don't have a well-understood model for this type of schedule, so we
  won't be able to learn from someone else's mistakes.

If you see something I missed, please reply. I hope Brian's offer to
consolidate and post the list is still good.

-- 
Joel

-- On Mon, 14 Jan 2002, Shawn McMahon wrote:

|> > 3) Implement two methods of communication:  a protocol for the client to
|> > directly request information from the servers (this would be done using
|> > unicast TCP connections), and then the usage of IP multicast for the sending
|> > of event-related information between the servers and the clients.  This
|> > would keep us from having to use broadcasts (to start up the client, method
|> > 1 or 2 could be used), and make it easy for all the servers to know about
|> > event that occur.
|> 
|> Option three is the least heinous, but still suffers from "I missed a
|> packet, and now I need to request the entire state of the database from
|> you".
|> 
|> If somebody says "show me the state of the job flow on all servers", we
|> don't want that to require connecting to dozens of machines to transfer
|> the entire job database.  Central management is going to be critical
|> to allowing one person to see the enterprise-wide state of a complex
|> application.  Assuming we want to support things as complex as, say,
|> Chronos.

What you're talking about is a bit of a trade-off:  We trade off a
single point of failure which can stop all batch processing for a more
complex design which will involve more communication among the servers,
but not suffer from relying so heavily on a single server.  I'm not yet
prepared to say which might be better for our purposes.

In my e-mail, I'm suggesting that we (a) develop a list of advantages
and disadvantages of each architecture, and possibly (b) prototype one
or both architectures in an effort to see which one will work better for
us.

We do, however, need to be careful in one respect:  the ability to
centrally manage a large schedule doesn't rely on a centralized
architecture based on a single (set of) server(s).  Let's look at
CHRONOS as an example.  Autosys uses the centralized architecture, yet
the number of jobs has outgrown what can comfortably fit in one instance
of this centralized server.  To solve this, we have broken the large
schedule up into multiple instances, loosely based on application
functionality.  My point is that, using either architecture, it's not
feasible to run a GUI that can quickly and easily navigate and manage
thousands of jobs as one logical unit.  We obviously can't do that using
an established commercial product and a reasonably complex application.
What's more, assuming you had the ability to view all of the CHRONOS jobs
in one management GUI, what would that buy you?  Whether our
architecture is centralized or not, we're probably going to have to add
some logical grouping functionality to the scheduler...for example, we
could create a logical group containing just the invoicing cycle for a
billing application, and leave other jobs, such as system maintenance
jobs, in another group.  In addition, I'm thinking that we need to have
a way to have a single logical job that runs on multiple servers.  These
types of enhancements, I think, will contribute to the ability to
effectively manage complex applications more than whether we choose to
implement the application in a centralized or decentralized manner.

To more specifically address your point, though, the idea of using
multicast datagrams is that we communicate the real-time state
information of the schedule using them.  My thinking was that every
single server doesn't necessarily need to know the complete state of the
enterprise schedule, but only the state of the jobs it depends on.
Let's say that server B misses the datagram telling it that job X on
server A has completed successfully, and that job Y on server A is
configured to run pending the successful completion of job X.  We would
need to build a mechanism for server B to get the information of the
status of job X from server A.  The idea is that, when the scheduler
needs specific information about the status of a job, it opens a TCP
connection to the server running that job and queries for the
information it needs.  Our application can implement the logic of
timeouts and exactly when the TCP connections are used, but the basic
idea is that, where appropriate, we use multicast to pass along
information that every server *might* be interested int, and use
point-to-point, reliable, TCP for specific conversations.

We might decide that the increased effort required to make this
scheduler decentralized isn't worth it...communications will be more
complex than simply opening a connection to the server when you have
something to say.  However, I think that before we commit to a
particular approach, we should fully evaluate the two approaches, and
make our decision based on which we feel will make the scheduler a
better application.  As you probably noted in my earlier e-mail, I'm
compiling a list of advantages / disadvantages of each approach, which
I'll post on the SourceForge project site.  I'll be listing your
concerns from this e-mail in the "disadvantages" section for a
decentralized approach, as they are certainly valid and pose a problem
which will have to be solved should we choose such an approach.

When you have some time, let's (all) continue to bring up the pros and
cons of both approaches.  The best way for us to determine a course of
action is to fully understand the problems we might encounter when we
implement them.  The more we understand the question, the better the
answer we'll give.  :-)

--
Charles Brian Hill
br...@do...

This one time, at band camp, C. Brian Hill wrote:
>=20
> 1) Allow the client to have a predetermined list of servers (possibly in a
> configuration file), and have it make TCP connections to each server as
> needed, sending all communications through these servers.  This isn't
>=20
> 2) Have the client broadcast on its local networks (UDP), and autodiscover
> servers.  This might work, especially, if we make each of the servers know
>=20
> 3) Implement two methods of communication:  a protocol for the client to
> directly request information from the servers (this would be done using
> unicast TCP connections), and then the usage of IP multicast for the send=
ing
> of event-related information between the servers and the clients.  This
> would keep us from having to use broadcasts (to start up the client, meth=
od
> 1 or 2 could be used), and make it easy for all the servers to know about
> event that occur.
>=20
> 4) Use broadcasts for communications which apply to more than one server,
> and use unicast TCP for point-to-point communication.

Option three is the least heinous, but still suffers from "I missed a
packet, and now I need to request the entire state of the database from
you".

If somebody says "show me the state of the job flow on all servers", we
don't want that to require connecting to dozens of machines to transfer
the entire job database.  Central management is going to be critical
to allowing one person to see the enterprise-wide state of a complex
application.  Assuming we want to support things as complex as, say,
Chronos.

----- Original Message -----
From: "Joel Loudermilk" <jo...@lo...>
To: <clo...@li...>
Sent: Sunday, January 13, 2002 6:53 PM
Subject: Re: [Clockwork-developers] Implementation of a decentralized
schedule

>
> I like Brian's ideas of how to run a decentralized schedule using
multicast
> and unicast where appropriate. Some things to consider would be:
>
> When a job finishes, does the system always post a notification about the
> job, or does it try to figure out if there are dependencies at other
systems
> and only send to those systems?

The way I'm thinking, it would always post a notification, in case there are
clients running.  In other words, if you (as a client or as an agent on a
node) were listening to the multicast traffic, you'd receive all the
"real-time" information about the scheduler as it happens.

> If job A runs on system X and job B runs on system Y when job A finishes,
> what happens when job A finishes, but system Y is down? If system X simply
> sent a multicast notification when job A was finished, we're out of luck.
> If we decide that the systems need to be smart enough to know who'll be
> starting jobs after theirs finish, then system X could resend the message
> until acknowledged by system Y.

I was thinking that, when job A finishes, a multicast notification would be
sent out.  However, if system Y is down, then when it comes back up, for any
jobs that are in an activated state (to borrow an Autosys term), the agent
would directly query the servers hosting the jobs that are the source of the
dependency.

> We might want to think about breaking up systems in to management units,
so
> that if a user has 500 systems, but only wants to monitor a schedule that
> affects 50 of them, we don't force him to poll all the systems to get
> the status of that schedule.

Also, I was thinking that jobs could be assigned into logical groups.  How
much easier would it be to manage CHRONOS' schedule if we could load up a
GUI that would, for example, only show the invoicing cycle, or some other
group of jobs?

> I'll have to do some reading about multicast, as I have no experience with
> it.

I really don't have any experience writing multicast software, but from the
work I've done with Tibco I have a general understanding of how it works.
The Tibco software can use either multicast or broadcast, but, in an
enterprise situation, multicast gives it a lot of power (eliminates the need
for the logical routing daemons we use in the CHRONOS implementation).

It seems like the issue of centralization / decentralization is a kind of
fork in the road of our design process.  I propose we try to come up with a
list of possible advantages and disadvantages of each approach, in order to
help us make our decision.  In addition, if people were interested, we could
prototype one or both systems using, say, Java, to give ourselves a feel for
how the implementation might proceed, and to possibly help us uncover
problems we hadn't yet thought of.  I'd be glad to compile the list of pros
/ cons for the group, so just send them to the mailing list, and I'll put
them together.

-Brian

I like Brian's ideas of how to run a decentralized schedule using multicast
and unicast where appropriate. Some things to consider would be:

When a job finishes, does the system always post a notification about the
job, or does it try to figure out if there are dependencies at other systems
and only send to those systems?

If job A runs on system X and job B runs on system Y when job A finishes,
what happens when job A finishes, but system Y is down? If system X simply
sent a multicast notification when job A was finished, we're out of luck.
If we decide that the systems need to be smart enough to know who'll be
starting jobs after theirs finish, then system X could resend the message
until acknowledged by system Y.

Using a centralized monitoring system like Brian suggested, updates to the
schedule could be distributed from there as well, since it would already
have knowledge of all the nodes in the schedule. And we could implement
something akin to Autosys' global variables also, by distributing updates
to those variables the same way we distribute updates to the schedule. (I
think that's a particularly neat feature of AutoSys.)

We might want to think about breaking up systems in to management units, so
that if a user has 500 systems, but only wants to monitor a schedule that
affects 50 of them, we don't force him to poll all the systems to get
the status of that schedule.

I'll have to do some reading about multicast, as I have no experience with
it.

-- 
Joel

----- Original Message -----
From: "Joel Loudermilk" <jo...@lo...>
To: <clo...@li...>
Sent: Saturday, January 12, 2002 4:11 PM
Subject: [Clockwork-developers] Implementation of a decentralized schedule

> Don't get me wrong, I really like the idea of not needing a central server
> process and database (multiplied by two for redundancy), but I can't quite
> figure out how to make this work without them.

How about this:  a "client" (used for monitoring) that connects to each
server in an environment, based on either a predetermined configuration, or
some sort of autodiscovery method.  Yes, the client would need to talk to
each server, but I think some way to centralize monitoring is going to be a
requirement, whether the data is centralized or not.  The way I see it, we
have the following options if we want to decentralize things:

1) Allow the client to have a predetermined list of servers (possibly in a
configuration file), and have it make TCP connections to each server as
needed, sending all communications through these servers.  This isn't
difficult to implement, so I'd suggest that even if we go with a more
complex design, we might want to implement something like this,  even if
only really for testing purposes.

2) Have the client broadcast on its local networks (UDP), and autodiscover
servers.  This might work, especially, if we make each of the servers know
about each of the other servers in its environment.  If you think about it,
the servers will need to know which other servers are part of their
environment just to execute the schedule and to know if servers are down.
The client could broadcast on its local network, and then grab a server list
from one server who responds, then opening TCP connections to communicate
with the individual servers.

3) Implement two methods of communication:  a protocol for the client to
directly request information from the servers (this would be done using
unicast TCP connections), and then the usage of IP multicast for the sending
of event-related information between the servers and the clients.  This
would keep us from having to use broadcasts (to start up the client, method
1 or 2 could be used), and make it easy for all the servers to know about
event that occur.

4) Use broadcasts for communications which apply to more than one server,
and use unicast TCP for point-to-point communication.

Of these, if we want to decentralize the application, I'm most in favor of
the third option:  using multicast to send event-related messages, and TCP
unicast for point-to-point communications.  Here's how it would work in some
scenarios:

Scenario 1:  Job failure.  The server on which the job failed sends a
multicast message to the other servers in its environment (and any clients
which may be running) to inform them of the failure.  From there, servers
could respond to the failure as necessary...running other jobs, displaying
an alert (on the client), automatically notifying administrators, whatever.

Scenario 2:  Job dependency between two servers.  On completion of the first
job, its scheduling daemon sends a multicast message to the group informing
them of the completion of the first job and its status.  The server on which
the second job runs receives this information and begins the dependent job
if the first job was successful.

Scenario 3:  Force Start of a Job.  While monitoring the distributed
application, an administrator wants to start a job on demand.  The client
opens a unicast TCP connection to the server in question and issues the
command to start the job.  That server then sends a multicast message so
that the other servers can act on the starting of that job if necessary.

I hope this is enough for everyone to see how the idea might work.  Using
plain old TCP will work too, but will require a lot of connections between a
lot of servers.  If we can design two protocols so that we use multicast and
unicast together, each where it makes the most sense to do so, I think we
can accomplished the decentralized design.

I don't think relying solely on point-to-point communications will be very
scalable in a decentralized design.  The reason is that, for an environment
of n servers, we might need up to nC2 connections.  Using multicast, each
server listens to a single multicast group address, and a lot of the traffic
would go over that connection.  When point-to-point connections are required
(should be relatively infrequently), they can be opened and then closed.
Multicast might be a little more difficult to implement, but it will reward
us in terms of scalability and resource usage.

So, what does everyone think?

-Brian

This one time, at band camp, Joel Loudermilk wrote:
>=20
> scheduler, since we would want to distribute the database as well. In that
> design, does anybody have an idea how a GUI monitoring tool would be able
> to see the current state of the schedule? Wouldn't it have to poll lots
> and lots of systems?

Well, there's nothing that says the monitoring couldn't be centralized,
with a server process watching broadcast traffic.

However, I really don't like the idea of using broadcasts, since the
odds of missing a packet and thus having wrong information is high.

Does anyone have any idea of how a decentralized schedule might be
implemented? Shawn D. mentioned that the monitoring could be done with a
central server that all the clients periodically report their job status to.

But I was thinking about how the jobs flows would work. Assuming that each
system already has its portion of the schedule loaded, it's easy to handle
a job that starts at noon -- the system just runs it at noon. But what
about a job that runs based on the success of a job on another system? If
there's no central scheduler, then the systems would have to report their
statuses to whoever had dependencies on them. This could make updating a
live schedule hairy -- you'd have to make sure all the systems got the
updates pretty quickly.

And even on the monitoring side, the data needs to be very current or an
operator might take an incorrect action. For example, if he thinks one
job is still running, he might delay or kill another job. But if that
information is out of date ...

Don't get me wrong, I really like the idea of not needing a central server
process and database (multiplied by two for redundancy), but I can't quite
figure out how to make this work without them.

-- 
Joel

I agree about NTP.  No sense re-inventing the wheel.  The folks who devised
NTP were undoubtedly pretty bright so we won't make any substantial
improvement, and the environment for our scheduler will likely already have
NTP configured.

If we decentralize the schedule (which I do favor) then we have a couple of
choices in terms of providing a central view.  $Universe does use a central
master server that keeps track of jobs on all systems, but if I remember
correctly it is passive;  it waits to be notified by the clients so little
or no polling occurs.  In this model you won't have an up-to-the-second view
of the state of the schedule, though you could make it pretty close by
adjusting how often updates are sent to the master.  Another way would be to
actively poll all the servers from the master, but this will probably cause
substantial processor and network overhead.  The polling wouldn't have to be
continual, only occurring when someone is viewing the status through the
GUI, but many shops will probably want the GUI open all of the time, so
you'd be taking the hit all the time.

If we go to a centralized schedule then we have easy access to the current
state.  However, AutoSys illustrates the trade-offs with sort of
configuration, and I don't know if we want to accept the heavy compute
requirements and single-point failure possibilities that this entails.  If
we do go with a centralized design then we need to include elements in the
design to ensure that the scheduler will cleanly failover or, better yet,
support some sort of hot-standby.

Shawn

----- Original Message -----
From: "Joel Loudermilk" <jo...@lo...>
To: <clo...@li...>
Sent: Saturday, January 12, 2002 10:53 AM
Subject: Re: [Clockwork-developers] Architecture of job scheduler

>
> +- On Friday (1/11/2002 14:4) "Charles Brian Hill" <br...@do...>
Wrote-
> | A valid question may be whether our tool should be responsible for time
> | synchronization.  The users could always use NTP to keep the clocks on
their
> | systems synchronized (to a certain degree).
>
> Anything we would build to syhcnronize time would probably not be as good
> as NTP, because we've never tried that before and because it's just one
> component of our software.
>
> When I think of the intended users of this software, I see people whose
> scheduling needs have outgrown cron, either because they have too many
> systems or too complicated a schedule, or both. In a medium to large
network
> like that, I think it's safe to assume that the administrators have
already
> taken care of syhcnronizing the time.
>
> | If we choose to use
> | a full SQL database, we have two possible routes:  either choose a
database
> | that everyone will have to do, or decide to support multiple database
> | platforms.
>
> I'm very much in favor of supporting multiple platforms. Even AutoSys can
> support at least Sybase and Oracle (and I don't know how much more).
>
> | If I wanted to use a full-fledged database, I would want to take
advantage
> | of some of the more advanced features of the database engine (let's say
we
> | had one that supports transactions, replication, triggers, etc).  To use
> | those features we'd need to go with a single database platform.  Trying
to
> | use, for example, JDBC, and let the user choose the database platform
would
> | mean that we wouldn't be able to use features that aren't commonly
> | supported.
>
> JDBC does support transactions, and aren't replication and triggers things
> that happen "behind the scenes" that we wouldn't be controlling through
> the database's API?
>
> If so, what about supporting any database for which the user can find a
> JDBC driver and that supports transactions and replication (and whatever
> other features we need to use). To set up the database, the user might
> have to do some database-specific stuff, since defining replication
probably
> varies across database platforms, but that's just one-time setup, and we
> could probably even include scripts for the most popular databases.
>
> But I suppose all this is irrelevant if we decide to decentralize the
> scheduler, since we would want to distribute the database as well. In that
> design, does anybody have an idea how a GUI monitoring tool would be able
> to see the current state of the schedule? Wouldn't it have to poll lots
> and lots of systems?
>
> --
> Joel
>
> _______________________________________________
> Clockwork-developers mailing list
> Clo...@li...
> https://lists.sourceforge.net/lists/listinfo/clockwork-developers

I think we can easily accommodate the DB functionality we need by using very
generic SQL commands that would be compatible with most SQL DBs.  We should
have a primary DB that we will use for most of our development and testing.
My personal preference is Postgres because I'm more familiar with it and
because of its reputation for better data integrity.  We'll also need to
have access to other major DBs so we could do regression testing to ensure
compatibility, but I think we can manage that fairly easily, at least for
Oracle, Sybase, and MySQL.

Shawn

----- Original Message -----
From: "Joel Loudermilk" <jo...@lo...>
To: <clo...@li...>
Sent: Saturday, January 12, 2002 10:59 AM
Subject: Re: [Clockwork-developers] Architecture of job scheduler

>
> +- On Friday (1/11/2002 14:18) Shawn McMahon <smc...@ei...> Wrote-
> | I think we have to do a free one if we do one at all, because we're a
> | free project.  We're cutting our own throats if the scheduler is free
> | but you have to pay Oracle $100k if you want to use it.  I agree that we
> | can't support every database, at least not in the beginning.
>
> I agree. Regardless of how many database platforms we support, at least
> one of them should be free.
>
> One reason I think we should support multiple platforms is that I can see
> a user who decides not to use our product, because he's got dozens of
> Oracle servers and could easily accomodate another couple of databases
> on them, but we would require him to run MySQL, or Sybase, or something
else.
>
> That's something that always irritated me about Bugzilla. It's a great
> tool, but you can only use MySQL. They even wrote the thing in Perl and
> use DBI, so you'd think you could use any platform, but they then went
> and used features specific to MySQL (like enum data types and some other
> stuff).
>
> I think people are about as picky about their favorite database platform
> as they are about their favorite UNIX. So why make a program that runs on
> anyone's favorite UNIX, but only one database?
>
> --
> Joel
>
> _______________________________________________
> Clockwork-developers mailing list
> Clo...@li...
> https://lists.sourceforge.net/lists/listinfo/clockwork-developers

+- On Friday (1/11/2002 14:18) Shawn McMahon <smc...@ei...> Wrote-
| I think we have to do a free one if we do one at all, because we're a
| free project.  We're cutting our own throats if the scheduler is free
| but you have to pay Oracle $100k if you want to use it.  I agree that we
| can't support every database, at least not in the beginning.

I agree. Regardless of how many database platforms we support, at least
one of them should be free.

One reason I think we should support multiple platforms is that I can see
a user who decides not to use our product, because he's got dozens of
Oracle servers and could easily accomodate another couple of databases
on them, but we would require him to run MySQL, or Sybase, or something else.

That's something that always irritated me about Bugzilla. It's a great
tool, but you can only use MySQL. They even wrote the thing in Perl and
use DBI, so you'd think you could use any platform, but they then went
and used features specific to MySQL (like enum data types and some other
stuff).

I think people are about as picky about their favorite database platform
as they are about their favorite UNIX. So why make a program that runs on
anyone's favorite UNIX, but only one database?

-- 
Joel

+- On Friday (1/11/2002 14:4) "Charles Brian Hill" <br...@do...> Wrote-
| A valid question may be whether our tool should be responsible for time
| synchronization.  The users could always use NTP to keep the clocks on their
| systems synchronized (to a certain degree).

Anything we would build to syhcnronize time would probably not be as good
as NTP, because we've never tried that before and because it's just one
component of our software.

When I think of the intended users of this software, I see people whose
scheduling needs have outgrown cron, either because they have too many
systems or too complicated a schedule, or both. In a medium to large network
like that, I think it's safe to assume that the administrators have already
taken care of syhcnronizing the time.

| If we choose to use
| a full SQL database, we have two possible routes:  either choose a database
| that everyone will have to do, or decide to support multiple database
| platforms.

I'm very much in favor of supporting multiple platforms. Even AutoSys can
support at least Sybase and Oracle (and I don't know how much more).

| If I wanted to use a full-fledged database, I would want to take advantage
| of some of the more advanced features of the database engine (let's say we
| had one that supports transactions, replication, triggers, etc).  To use
| those features we'd need to go with a single database platform.  Trying to
| use, for example, JDBC, and let the user choose the database platform would
| mean that we wouldn't be able to use features that aren't commonly
| supported.

JDBC does support transactions, and aren't replication and triggers things
that happen "behind the scenes" that we wouldn't be controlling through
the database's API?

If so, what about supporting any database for which the user can find a
JDBC driver and that supports transactions and replication (and whatever
other features we need to use). To set up the database, the user might
have to do some database-specific stuff, since defining replication probably
varies across database platforms, but that's just one-time setup, and we
could probably even include scripts for the most popular databases.

But I suppose all this is irrelevant if we decide to decentralize the
scheduler, since we would want to distribute the database as well. In that
design, does anybody have an idea how a GUI monitoring tool would be able
to see the current state of the schedule? Wouldn't it have to poll lots
and lots of systems?

-- 
Joel

This one time, at band camp, Charles Brian Hill wrote:
>=20
> If we go centralized, I think we might as well pick a full-fledged SQL
> database platform (hopefully a free one) and standardize on that.  A
> centralized system means we're dependent on a single server, and if that's
> the case, that server is liable to be very, very busy.  If we wanted to h=
ave

If we do that, we want to take advantage of ALL the features of a real
database, so we shouldn't pick a MySQL type of program that's fast but
doesn't care about data integrity; we'd look at more of a PostgreSQL,
where it doesn't wring the last bit of possible speed out, but it
assumes your data is precious.

I think we have to do a free one if we do one at all, because we're a
free project.  We're cutting our own throats if the scheduler is free
but you have to pay Oracle $100k if you want to use it.  I agree that we
can't support every database, at least not in the beginning.

----- Original Message -----
From: "Shawn McMahon" <smc...@ei...>
To: <clo...@li...>
Sent: Friday, January 11, 2002 10:03 AM
Subject: Re: [Clockwork-developers] Architecture of job scheduler

> Another concern would be clocks; it's easier to keep one machine synced
> than dozens.

A valid question may be whether our tool should be responsible for time
synchronization.  The users could always use NTP to keep the clocks on their
systems synchronized (to a certain degree).  NTP can generally keep system
clocks within a second of each other, so the question is whether it would be
valuable to keep system clocks more closely synchronized than is possible
with the standard tools like NTP.  As we're talking mostly about batch
processing, it seems relatively unlikely to me that I would run into a
situation where I need to start jobs on multiple systems with that much
accuracy.  Even supposing that were the case, the developer would likely
want to be using real-time programming techniques, which would make the use
of a scheduler like we're discussing out of the question.  In a
decentralized configuration like Joel suggested, I'm thinking it would
probably be enough to have the servers start jobs according to their own
system clocks, and let the system administrators worry about keeping the
system clocks as closely synchronized as they need.  Many routers
participate in NTP time synchronization, and I'd guess that most large
server network installations are configured for NTP as well.  (I even run a
server at my house to keep all of my PCs' clocks in sync.)

> > event processor (to borrow a term from AutoSys). And a SQL-based
database
> > would make things easier to work with from a development perspective,
but
> > not until now did I realize that it might make the system less
attractive
> > for a user, since they would have to manage another database.
>
> No reason we can't make the database a part of the server program, and
> not use a full-blown SQL, is there?

That's certainly an option, but if we could achieve the same, or even
better, performance (one of Autosys' disadvantages) without having a
centralized server to depend on and without having a full-fledged SQL
database that humans have to manage, I'd be all for it.  If we choose to use
a full SQL database, we have two possible routes:  either choose a database
that everyone will have to do, or decide to support multiple database
platforms.

If I wanted to use a full-fledged database, I would want to take advantage
of some of the more advanced features of the database engine (let's say we
had one that supports transactions, replication, triggers, etc).  To use
those features we'd need to go with a single database platform.  Trying to
use, for example, JDBC, and let the user choose the database platform would
mean that we wouldn't be able to use features that aren't commonly
supported.

If we go centralized, I think we might as well pick a full-fledged SQL
database platform (hopefully a free one) and standardize on that.  A
centralized system means we're dependent on a single server, and if that's
the case, that server is liable to be very, very busy.  If we wanted to have
redundant servers be an option, we'd really need some database replication
to do it right, so there's really no alternative to a real database.
However, if we decide to decentralize the application, perhaps we should
investigate using something smaller, like maybe Berkeley DB, that may not be
SQL, but might have enough functionality to manage the jobs for a single
machine, and do a good job at it.  Alternatively, we could choose to store
the data in XML, and load it into the data structures we're using in
whatever language(s) we choose.  Or, if we could find an SQL engine that
doesn't require human management and is small enough to include with builds
of our application, that would be cool too.

Just my $0.02

-Brian

----- Original Message -----
From: "Joel Loudermilk" <jo...@lo...>
To: <clo...@li...>
Sent: Thursday, January 10, 2002 9:46 PM
Subject: [Clockwork-developers] Architecture of job scheduler

> In the limited research I did looking at features of commercial job
> schedulers, I found an interesting idea. I had always taken for granted
> that any distributed job scheduler would need to have a central server
> process to manage the schedule and distribute the work of the schedule to
> other systems. But Orsyp's scheduler (called "Dollar Universe") has no
> central server -- although the schedule is still managed centrally.
>
> Does anyone have an opinion about this kind of design? Like I said, until
> I saw Dollar Universe, I had just assumed there would have to be a central
> event processor (to borrow a term from AutoSys). And a SQL-based database
> would make things easier to work with from a development perspective, but
> not until now did I realize that it might make the system less attractive
> for a user, since they would have to manage another database.

In my experience with Tibco software (namely the system monitoring tool they
call Hawk), I've encountered a similar configuration.  Hawk is a system
monitoring tool, designed to provide and act on statistical information
about systems.  It can be configured centrally, in other words, it's
possible to configure an entire environment using one GUI interface, but it
has no central server.  I'm not sure exactly how this is implemented, except
for the following bits of information:

1) Each server contains its own database of configuration data, and
presumably at least some configuration data for other servers.

2) The system itself is monitored by running an application which listens
for broadcast or multicast traffic from the agents running on each node.

3) Information about, say, a server going down is either discovered by any
client running at the time (Hey, this server went away!), or is reported by
agents on other nodes noticing that the server went away.

My opinions on this type of design are as follows:

First off, we'd have to understand that in order to take away the need for a
user to manage a central database, we'd need to find a way to create (and
maintain automatically) individual databases on each node.  Perhaps we could
create an XML schema that would allow us to store the database information
on each server.  We'd want to keep in mind that using this type of design
will likely preclude us from using more advanced database features (such as
triggers, internalized locking mechanisms, etc.).  If we want these
features, and we want to have the information decentralized, we might look
into whether there are any "mini-database" systems available.  I believe I
recall reading some time ago about a version of MySQL that had been made for
implementations such as this -- where you wouldn't necessarily want to have
an actual database server running, but you might want to be able to use a
slimmed-down SQL database.

Given that we can come up with an acceptable database framework to use, I
think decentralizing the database is a pretty good idea.  The client-server
mechanism that Autosys uses seems somewhat flawed to me.  For starters,
what's the point of storing all the jobs that are going to run on server A
on server B?  What happens when server A goes down?  Server B has to deal
with this fact.  If the configurations were decentralized, and server A went
down, it would simply have to worry about recovering its own jobs.  Autosys
suffers in this regard.  We are all familiar with Autosys chase alarms
occurring when a server is down.  These are (in my opinion) a result of the
client-server design -- the server relies on a (pretty much) stateless
client to provide state information, which doesn't persist between reboots.

If there were no server, the "clients" ( probably actually "agents") would
_have_ to maintain all their own relevant state information, so when the
system crashes, recovery would be no more difficult than noting which jobs
were running at the time of the crash, and either marking them as in a
"failed" state (or "terminated"), and/or optionally re-starting them.

We'd also want to think about how we'd implement a monitoring application.
How would it know which servers to connect to?  Maybe a configuration file?
How network-intensive will it be to monitor a large network of servers with
such a tool?  This might be one disadvantage of such an architecture.  Would
it be possible to generate an overall visual picture of job flows using this
architecture?

Before I shut up, I wanted to bring up another idea I had:  Could we include
the capability to have a single logical "job" run on multiple machines?  In
other words, considering all the jobs in use for CHRONOS, when we run jobs
in a "distributed" fashion on multiple application servers, could we allow
users somehow to consider those as a unit.  This might make visualization of
the schedule simpler.

OK, I'm done.

-Brian

This one time, at band camp, Joel Loudermilk wrote:
>=20
> I don't know exactly how this works, since I've never used their schedule=
r,
> but it's interesting. What's also interesting is that on their web site, =
they
> mention this only to say that if a system is isolated from the network,
> it will still run its jobs. I don't know about you, but if one of my syst=
ems
> was isolated from the network, I think I would prefer that it not attempt
> to run any jobs, since they most likely wouldn't work.

Another concern would be clocks; it's easier to keep one machine synced
than dozens.

Another would be stopping a job from running; what if the machine on
the other end is too busy to listen to you?

However, I don't know that these are insoluble problems.  For instance,
the "client" end could check with a central time server(s) before running
a job, and be configurable as to whether or not it cared if it couldn't
see it.  If you made that configurable per-job, you could actually remove
cron, instead of just supplementing it.

> event processor (to borrow a term from AutoSys). And a SQL-based database
> would make things easier to work with from a development perspective, but
> not until now did I realize that it might make the system less attractive
> for a user, since they would have to manage another database.

No reason we can't make the database a part of the server program, and
not use a full-blown SQL, is there?

In the limited research I did looking at features of commercial job
schedulers, I found an interesting idea. I had always taken for granted
that any distributed job scheduler would need to have a central server
process to manage the schedule and distribute the work of the schedule to
other systems. But Orsyp's scheduler (called "Dollar Universe") has no
central server -- although the schedule is still managed centrally.

I don't know exactly how this works, since I've never used their scheduler,
but it's interesting. What's also interesting is that on their web site, they
mention this only to say that if a system is isolated from the network,
it will still run its jobs. I don't know about you, but if one of my systems
was isolated from the network, I think I would prefer that it not attempt
to run any jobs, since they most likely wouldn't work.

At work, TCS really likes Orsyp's scheduler, particularly because there's no
central server and no databases to manage. There's a lot less overhead for
the scheduler than with AutoSys, or at least it seems.

Does anyone have an opinion about this kind of design? Like I said, until
I saw Dollar Universe, I had just assumed there would have to be a central
event processor (to borrow a term from AutoSys). And a SQL-based database
would make things easier to work with from a development perspective, but
not until now did I realize that it might make the system less attractive
for a user, since they would have to manage another database.

If you want to see the list of features in Orsyp's Dollar Universe, see:
	http://www.orsyp.com/us/dollar_universe.asp

-- 
Joel

I've attempted to distill everyone's comments about features into a first
draft of a requirements document.

It's available on the Sourceforge project page in the DocManager. Here's
a URL to go directly to it:

https://sourceforge.net/docman/display_doc.php?docid=8470&group_id=40038

There are only nine items on the list, but some of them are pretty hefty.
I figured that once we can agree on all the requirements, we could
divide the software into a few (or more) major releases (or milestones,
or whatever you like to call them) and decide which features will be
implemented in which release.

The list is by no means final, so if I've missed or misstated anything,
please let me know.

-- 
Joel

Joel, Sorry you're receiving this twice.  My SMTP server was missing the
"postmaster" alias and SourceForge refused my e-mail.  That should be fixed
now, and I wanted to make sure this message hits the list archives.

----- Original Message -----
From: "Joel Loudermilk" <jo...@lo...>
To: <clo...@li...>
Sent: Thursday, December 20, 2001 9:53 PM
Subject: Re: [Clockwork-developers] Kicking around some requirements

> | 1) Dynamic evaluation of schedule:  There should be a capability within
the
> | scheduler to evaluate certain values dynamically.  In other words,> |
definitions of jobs
> | could include, say, some variable, whose value would not
> | be evaluated until runtime.

> Can you give an example of this? I'm having a hard time understanding this
> feature.

As a part of the definition of a job, it should be possible to use
variables.  For example, in specifying what user account should be used to
run a job, it should be possible to, instead of directly specifying an
actual account, specify a variable which would contain the name of the user.
This variable should not be evaluated until runtime, which would allow for
easier and more efficient changes to the schedule by means of manipulating
the variables, rather than the definitions of the jobs themselves.  One idea
I had with regard to this is even allowing the variables to be evaluated on
each client if desired, rather than on the central server.  Does this make
more sense?

A more real world example is something that we have been desiring to do with
Autosys.  Each release, as the application user changes, the entire schedule
must be reloaded with the new user coded in the job definition.  If we could
use a variable that is evaluated dynamically (as opposed to at the time the
schedule is loaded), we could just change the value of this variable, rather
than needing to reload the schedule.

Let me know if this still doesn't make sense.  We probably want to not name
it the way I did, because it is a bit confusing.

-Brian

+- On Tuesday (12/11/2001 12:50) "Charles Brian Hill" <br...@do...> Wrote-
| 1) Dynamic evaluation of schedule:  There should be a capability within the
| scheduler to evaluate certain values dynamically.  In other words,
| definitions of jobs could include, say, some variable, whose value would not
| be evaluated until runtime.

Can you give an example of this? I'm having a hard time understanding this
feature.

Another feature I think would be great is the ability to assign machines to
user-defined categories, and to have jobs run on systems that match
certain categories, in addition to scheduling jobs on individual systems.

For instance, if you assign all your Solaris systems to the "Solaris"
category, you can set up your Solaris backup job *once* to run on all the
"Solaris"-type systems, and when you add a new system, simply assigning
it the relevant categories means that your standard jobs will be executed.

I'm going to try to summarize everyone's wish list of features and put it
somewhere in the documentation section of the project web site sometime
soon.

-- 
Joel

----- Original Message -----
From: "Joel Loudermilk" <jo...@lo...>
To: <clo...@li...>
Sent: Monday, December 10, 2001 9:35 PM
Subject: [Clockwork-developers] Kicking around some requirements

> In doing a little research on the web by looking at commercial job
schedulers,
> it seems that there are very different approaches to the architecture of a
> scheduler. But before I started thinking about that too much, I wanted to
> build a list of the things I wanted to see in a scheduler.
>
> Here's what I came up with:

<Snip>

> If you've got any thoughts along this line, please post them. I'm sure
> we've all had enough experience with AutoSys to at least have an opinion
> on what's important and what's not.

I have a couple of requirements I thought I'd throw out to see what everyone
thinks:

1) Dynamic evaluation of schedule:  There should be a capability within the
scheduler to evaluate certain values dynamically.  In other words,
definitions of jobs could include, say, some variable, whose value would not
be evaluated until runtime.  Obviously, certain fields would need to be
exempt from dynamic evaluation: job name, start time, etc., but things such
as user name, executable locations, etc., should be able to be evaluated by
the scheduler at runtime.  This is one feature that Autosys doesn't seem to
have.  It does support dynamic evaluation of variables on the target system
(i.e., you can include an environment variable in the executable name, to be
expanded by the shell when the job is run), but not as such in the
scheduler.  I think this feature would be quite valuable to schedule
administrators.

2) An efficient way of entering schedule changes:  It seems to me that there
should not only be a way to edit jobs directly, through, say, a GUI, but
there should also be a command-line method for entering changes.  Further,
this command-line method should support some syntax for making *changes* to
an existing job, rather than just supporting the replacing of one job
definition with a new one.  Perhaps a declarative language somewhat like SQL
could be used to interactively make changes to and display information about
existing jobs.

3) An efficient way of designing schedules:  Autosys lacks (big-time) a way
to design a schedule, and then translate that design into something the
scheduler can take as input.  A GUI application, perhaps, could allow the
user the capability to graphically design a schedule, and could then export
the schedule in some sensible format.  Much of the tedious part of working
in SchedEx must be the dual maintenance of spreadsheets and the actual
schedule.  This might be a difficult application to write, and it should
likely be placed in the somewhat-distant future as compared to other pieces
of the scheduler, but I think it's important.

4) Truly dynamic data storage:  This probably means that we need to
implement the schedule in some sort of database, but we can't use anything
that (for example) requires a HUP to be sent to daemons when changes are
made.  Direct file-based I/O might be okay too, provided we're interested in
developing and maintaining that code as well.  If we choose a database, I
propose that we make a (possibly feeble) attempt to support more than one
database platform.  Depending on what database features we need to use (e.g.
triggers), we might lock out some database platforms, but even so, our
database interactions ought to be standards (ANSI SQL) compliant if
possible.

--
C. Brian Hill
br...@do...

2001	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov (1)	Dec (6)
2002	Jan (16)	Feb	Mar (10)	Apr	May	Jun	Jul	Aug (1)	Sep	Oct	Nov	Dec (4)
2003	Jan (9)	Feb (4)	Mar (1)	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2012	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug (1)	Sep	Oct	Nov	Dec

clockwork-developers Mailing List for Clockwork (Page 2)

clockwork-developers — Discussion about development of Clockwork