Re: [Queue-developers] new design details
Brought to you by:
wkrebs
From: Koni <mh...@co...> - 2005-05-11 00:05:38
|
On Tue, 2005-05-10 at 08:56 -0700, wernerkrebs wrote: > > Two comments. > > 1. Regarding the protocol, GQ's protocols largely > predated modern RPC standards, such as SOAP and XML. > I'm not sure any of these things are worth their weight in a homogeneous system. The communication between the GQ system as I have envisioned it is pretty lightweight and there is very little structure to the information. In this case, I think using XML or SOAP for a communication layer adds complexity (in my mind) which is contrary to their purpose in general. [snip] > I would think some of the current features of the GQ > TCP/IP protocol would be best done using some sort of > SOAP implementation. For example, aspects of the > initial authentication, and querying load information > would be best done using SOAP. > I don't think SOAP will do much for us regarding authentication. The authentication stuff here is really simple (to me). Perhaps for load information if a lot of detail is returned (like all the information ps would return say). As for authentication, its already implemented as a simple challenge handshake (initial authentication): qd qm auth/register request (send nonce) --------> sign nonce with system key, <------- reply with our own nonce verify response --------> sign qm nonce <-------- verify response, send session key If either verification fails, the offended party stops the protocol. Receipt of the session key indicates to qd that the challenge handshake protocol completed successfully. After that, all communication between the qd and qm come with simple signatures using that key. The complexity of the generation of signatures and verification of them is already more or less isolated from the logic if handling the message payload. > Also, since GQ was written, standard protocols for > this type of thing have emerged. Look at Apst/Apstd > system at SDSC (where, ironically, I used to work, > although not on that project): > > http://grail.sdsc.edu/projects/apst/ > > Apst is a meta demon for cluster demons. It doesn't > currently support starting jobs using GQ, but does > support starting other (commerical) systems. GQ > support would be fairly trivial for them to add, if > they wanted to. SDSC (part of UCSD) receives grant > money from a firm that makes a GQ-like commercial > product, so it's not clear if that's a direction they > want to go in. They do support the commerical product. > However, the source code is available, so the > community is free to add support for GQ as well. > > Apst will query each cluster manager (this would > similar to the qm program you are proposing) and > obtain load information via an XML file returned from > the cluster manager. It will then decide how many jobs > to start on that particular cluster (which it will > start using a crude ssh command-line protocol to > submit the jobs and scp to first transfer the relevant > files into place). It's up to the cluster manager to > then distribute the jobs to the cluster nodes. > > Apst, which is C/C++ based (Apstd is available in > Java) is similar to Nimrod, which is Java-based. > Source code for all of these is available. This sounds interesting. It would be great for GQ, whether GQ becomes my new proposed implementation, remains as is, or something else altogether, contributing a "driver" (so to speak) so that this meta system can work with it would be cool and perhaps broaden the market for us. > > 2. Regarding qm, a divison of the Texas Instruments > actually contributed a SQL-based qm in C++. (It would > require that an SQL database, preferably Open Source > and free such as Postgresql, be running on a server). > Cool. I was first thinking about job information being managed by a mysql (or postgres) backend, where the SQL engine would handle things like atomicity and persistent state information across failure. Would have been cake if I wrote qm in perl (I am very familiar with Perl-DBI). The only thing I don't like about this is the potential high-latency -- one (or more) threads insert to the job table (qs) while some another thread polls (qm) the table for new rows. Perhaps in postgres there is a way to install a trigger or something so polling is unnecessary. I don't think there is a way to do that in mysql. qm is actually unnecessary if qd's can talk to the SQL engine directly. SQL can handle authentication and atomicity and qd's can just compete for jobs. That's kind of nice. Not sure it will scale well though. 1000 qd's each with persistent TCP connection to mysql would create 1000 forked processes at the database server. > This is part of the GQ distribution, but is optional > and not compiled by default (due to C++ autoconf > problems at the time since resolved. Also, users wrote > to me explaining their preference for a small, simple > package with peer-to-peer behavior, rather than a > centralized package with a manager that might crash, > so the original behavior of GQ remained the default.) > > Beforing writing a manager from scratch, you might > want to look at the manager code and documentation > that TI's subsidary contributed. OK, I'll try to have a look. The manager is almost already all written though in my haste to flesh out ideas rolling around in my head. I shall post a tarball of the code shortly. I want to add at least a rudimentary support for actually submitting a job to the system and having it execute. While I'm doing that, we can get a better feel for who is out there reading this list and what interest there is. Thanks for your comments Werner, I appreciate your insights greatly. Cheers, Koni |