Re: [Kestrelhpc-developers] Ubuntu 11.04 progress :)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Dear JonAn,

On Fri, Aug 19, 2011 at 7:22 PM, Jon Ander Hernandez <jo...@gm...>wrote:

> 2011/8/19 Eray Ozkural <exa...@gm...>:
> > All right, my pure Ubuntu 11.04 installation adventure goes on, I've now
> > managed to get a login on the slave node.
>
> Nice! :-)
>
> > I think it's time I start figuring
> > out how to register nodes, because I can't login like this :) Since I've
> > changed too many things I can't reconfigure, and I don't think
> registering
> > will work like this, I'm going to read that bit of the code and figure
> out
> > how it gets done, I'm going to have to give the register option manually
> I
> > suppose. There was a boot option "register=<name>" IIRC. I'm beginning to
> > like KestrelHPC, this is actually a better approach than warewulf.
>
> Well the register system is pretty simple. The
> register/connect/disconnect is handled by a python rpc with a plugable
> system, and this way can be easily extended. So when nodes boot they
> run /etc/init.d/kestrel_connect which is a python script which makes a
> rpc call to the frontend. To distinguis between connect or register
> events it simply reads /proc/cmdline a checks for the option
> "register=<name>".
>
> So if you have physical access to the nodes you can manually run
> /etc/init.d/kestrel_connect or modify it easily (because is really
> simple).
>

I do have physical access. I saw that's how it happens, it just modifies the
pxe boot options, and then as you say on the client it throws an rpc, but
right now those rpc's freeze on my system. Have you seen such a thing?
Stopping/starting the daemon on the front end didn't help. I suspect that
concurrent activation of new nodes was the culprit, the register logic
probably couldn't handle that, and I don't see any POST messages on
/var/log/kestrel_rpc.log (IIRC, or whatever its log was) anymore. This could
mean either the nodes stopped submitting requests, or that the server
process doesn't work (although I think the latter because I once saw a
timeout message on a node). I think I should try to boot with init=/bin/sh
into the nodes and try to issue the kestrel_connect command manually. I
suspect that the frontend daemon may be broken though. Anyway, this is a
bug, the rpc system is too fragile. I need ipython, too :)

> BTW, I have seen that KestrelHPC 2.0 is pretty broken on Ubuntu 11.04.
> I'm really surprised to see that so much things broke down in this
> release... :-S
>

Uh, just needs some testing and fixing, though of course it's notoriously
difficult to test such software.

Though I would personally give priority to Ubuntu, because it's the most
popular system. It should "just work" on Ubuntu. We've used those terrible
distros before (fedora, centos, mandriva etc.) and I swore never again to
use them! Debian FTW :)

The problem with most cluster toolkits I've tried was, they were error prone
and not portable enough. It's important for such toolkits to have a lot of
failsafe defaults and just work on a bunch of standard distros.

Cheers,

-- 
Eray Ozkural, PhD candidate.  Comp. Sci. Dept., Bilkent University, Ankara
http://groups.yahoo.com/group/ai-philosophy
http://myspace.com/arizanesil http://myspace.com/malfunct