Re: [Queue-developers] new design details

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Tue, 2005-05-10 at 08:56 -0700, wernerkrebs wrote:
> 
> Two comments.
> 
> 1. Regarding the protocol, GQ's protocols largely
> predated modern RPC standards, such as SOAP and XML. 
> 

I'm not sure any of these things are worth their weight in a homogeneous
system. The communication between the GQ system as I have envisioned it
is pretty lightweight and there is very little structure to the
information. In this case, I think using XML or SOAP for a communication
layer adds complexity (in my mind) which is contrary to their purpose in
general. 

[snip]

> I would think some of the current features of the GQ
> TCP/IP protocol would be best done using some sort of
> SOAP implementation. For example, aspects of the
> initial authentication, and querying load information
> would be best done using SOAP.
> 

I don't think SOAP will do much for us regarding authentication. The
authentication stuff here is really simple (to me). Perhaps for load
information if a lot of detail is returned (like all the information ps
would return say). As for authentication, its already implemented as a
simple challenge handshake (initial authentication):

qd                              qm

auth/register request
(send nonce)          -------->

                                sign nonce with system key,
                      <-------  reply with our own nonce

verify response       --------> 
sign qm nonce

                      <--------  verify response, send session key

If either verification fails, the offended party stops the protocol.
Receipt of the session key indicates to qd that the challenge handshake
protocol completed successfully. After that, all communication between
the qd and qm come with simple signatures using that key. The complexity
of the generation of signatures and verification of them is already more
or less isolated from the logic if handling the message payload. 

> Also, since GQ was written, standard protocols for
> this type of thing have emerged. Look at Apst/Apstd
> system at SDSC (where, ironically, I used to work,
> although not on that project):
> 
> http://grail.sdsc.edu/projects/apst/
> 
> Apst is a meta demon for cluster demons. It doesn't
> currently support starting jobs using GQ, but does
> support starting other (commerical) systems. GQ
> support would be fairly trivial for them to add, if
> they wanted to. SDSC (part of UCSD) receives grant
> money from a firm that makes a GQ-like commercial
> product, so it's not clear if that's a direction they
> want to go in. They do support the commerical product.
> However, the source code is available, so the
> community is free to add support for GQ as well.
> 
> Apst will query each cluster manager (this would
> similar to the qm program you are proposing) and
> obtain load information via an XML file returned from
> the cluster manager. It will then decide how many jobs
> to start on that particular cluster (which it will
> start using a crude ssh command-line protocol to
> submit the jobs and scp to first transfer the relevant
> files into place). It's up to the cluster manager to
> then distribute the jobs to the cluster nodes.
> 
> Apst, which is C/C++ based (Apstd is available in
> Java) is similar to Nimrod, which is Java-based.
> Source code for all of these is available.

This sounds interesting. It would be great for GQ, whether GQ becomes my
new proposed implementation, remains as is, or something else
altogether, contributing a "driver" (so to speak) so that this meta
system can work with it would be cool and perhaps broaden the market for
us.

> 
> 2. Regarding qm, a divison of the Texas Instruments
> actually contributed a SQL-based qm in C++. (It would
> require that an SQL database, preferably Open Source
> and free such as Postgresql, be running on a server).
> 

Cool. I was first thinking about job information being managed by a
mysql (or postgres) backend, where the SQL engine would handle things
like atomicity and persistent state information across failure. Would
have been cake if I wrote qm in perl (I am very familiar with Perl-DBI).
The only thing I don't like about this is the potential high-latency --
one (or more) threads insert to the job table (qs) while some another
thread polls (qm) the table for new rows. Perhaps in postgres there is a
way to install a trigger or something so polling is unnecessary. I don't
think there is a way to do that in mysql. qm is actually unnecessary if
qd's can talk to the SQL engine directly. SQL can handle authentication
and atomicity and qd's can just compete for jobs. That's kind of nice.
Not sure it will scale well though. 1000 qd's each with persistent TCP
connection to mysql would create 1000 forked processes at the database
server. 

> This is part of the GQ distribution, but is optional
> and not compiled by default (due to C++ autoconf
> problems at the time since resolved. Also, users wrote
> to me explaining their preference for a small, simple
> package with peer-to-peer behavior, rather than a
> centralized package with a manager that might crash,
> so the original behavior of GQ remained the default.)
> 
> Beforing writing a manager from scratch, you might
> want to look at the manager code and documentation
> that TI's subsidary contributed.

OK, I'll try to have a look. The manager is almost already all written
though in my haste to flesh out ideas rolling around in my head. I shall
post a tarball of the code shortly. I want to add at least a rudimentary
support for actually submitting a job to the system and having it
execute. While I'm doing that, we can get a better feel for who is out
there reading this list and what interest there is. 

Thanks for your comments Werner, I appreciate your insights greatly.

Cheers,
Koni