From: Murray, P. (HP L. Bristol) <pm...@hp...> - 2007-12-05 13:13:23
|
Just a couple of corrections: > We also have an anubisdeployer component that exists to deploy anubis it= self. The AnubisDeployer is a SmartFrog deployer that uses Anubis to determine wh= ere to deploy things - it doesn't deploy Anubis. > 3. It is currently dependent on the clocks being synchronised. > If NTP is not working properly across all nodes, you have problems. t= his is something that > could be fixed, as it is less fundamental to the design than multicas= ting. This statement is not true (but once was). Anubis has two timing protocols: - one times clock skew, comms delay and scheduling dely so machines with c= locks that drift become untimely relative to one another - one times comms delay and scheduling delay only, so clocks are allowed t= o drift. Paul. Paul Murray Hewlett-Packard Laboratories, Bristol Hewlett-Packard Limited registered Office: Cain Road, Bracknell, Berks RG12= 1HN Registered No: 690597 England The contents of this message and any attachments to it are confidential and= may be legally privileged. If you have received this message in error, you= should delete it from your system immediately and advise the sender. To any recipient of this message within HP, unless otherwise stated you sho= uld consider this message and attachments as "HP CONFIDENTIAL". -----Original Message----- From: sma...@li... [mailto:smartfrog-d= eve...@li...] On Behalf Of Steve Loughran Sent: 05 December 2007 12:36 To: Zhang Qian; smartfrog-developer Subject: Re: [Smartfrog-developer] Questions about SmartFrog Zhang Qian wrote: > Hi All, > Hello! > I have two questions about SmartFrog: > 1. Can SmartFrog be used to synchronize configuration on many hosts? I > have a cluster which contains hundreds of hosts, so it's very > important to make config changes synchronoursly on these hosts. Is it > possible to write my own SmartFrog component whose responsibility is > communicate with my own daemons on all the hosts, and deliver config > changes to them synchronoursly? This is one of those really interesting areas where automated deployment ge= ts both challenging and fun. I'm actually preparing some slides for a talk = on that topic for presentation to undergraduates on friday -though I wont b= e going into any details on how to get it to work. Once you have that many hosts you can't assume that any small set of them w= ill remain functional; if you have one or two nodes that are declared manag= ers, you can be sure that eventually they will fail and your entire farm wi= ll go offline. If you have a simple hierarchy of deployed components, you c= an easily create such a failure point What you have to do instead is make every node standalone, sharing awarenes= s of their role amongst their peers The Anubis component is what we use for this kind of farm management; here = are the papers: http://smartfrog.svn.sourceforge.net/viewvc/smartfrog/trunk/core/components= /anubis/doc/ Anubis is a partition-aware version of a tuple space; you insert facts into= the space, the machines talk by a heartbeat and a tick after you insert a = fact, it is shared amongst all peers, a tick after that they know you know = that fact, and so on. if there is a partition event - a host goes away, the= network gets split- you get notified and the tuple space(s) that now exist have to re-evaluate who is in there and who is not. You bring up every node in the cluster 'unpurposed' and then let them decid= e -based on what else is live- what they are going to be. The first one cou= ld be the resource manager that allocates work to others; then you could br= ing up some as a database, a filestore, and then finally your application i= tself. All the anubis components are in the redistributables, and the sf-anubis.rp= m, where they can be pushed out to servers. We also have an anubisdeployer = component that exists to deploy anubis itself. There's also a nice visualiz= er application that lets you see what is going on, and test failure recover= y by triggering partition events. There's some limitations of the (current) design I should flag early, two o= f which are all based on the use of multicast IP to share information. 1. it doesnt run on Amazon EC2, as their network doesnt support multicast= s. 2. it doesnt like farms where the machines are connected over long-haul n= etworks. Single site networks are OK, though the more complex the network, = the bigger the TTL has to be and the slower you need to make the heartbeat. 3. It is currently dependent on the clocks being synchronised. If NTP is = not working properly across all nodes, you have problems. this is something= that could be fixed, as it is less fundamental to the design than multicas= ting. I'd recommend you have a look at the papers and the examples; talk to us if= you want more details or help bringing it up. We have used Anubis successf= ully for 500+ node deployments. > > 2. I am trying to write a sample SmartFrog component for my own, but I > find it can not take effect until sfDaemon restarts. I have written a > java class which extends PrimImpl and implements Prim, also written > related .sf file to describe the config. At first, it works fine, but > when I change some codes in my java class, and build it by ant, run it > by sfStart, the code I added will not take effect until I restart > sfDaemon. Did I missed some thing or made some mistake? It comes down to if/whether you are using dynamic code downloading, and JVM= classloading quirks. If Java has loaded a class and there are still refere= nces to it around somewhere, it tends to keep the old classes loaded. if you dont use dynamic classloading then yes, you must restart the JVM to = get the JARs reloaded. If you do want to dynamically classload, then you must -create a list of urls to the JAR files in the sfCodebase. There's some d= ocumentation on this as you can specify a codebase for parsing the deployme= nt files as a JVM property in the sfDeploy operation, and a codebase for th= e actual process classloading. All java URLs are supported: http:, https:, ftp:, file: -for absolute reliability start your deployments in a new JVM. This is as s= imple as using the sfProcessName attribute in a component/compound; it tell= s the runtime to put everything beneath there in a different process. A new= process is dynamically created, which will pick up all the changed files. = This is something to try if somehow the JVM is hanging on to old class defi= nitions after the components are unloaded. We have ant tasks to do the start/deploy, tasks that can help set up the co= debase. If you want some help setting up your build file, you can send us t= hose bits of your build.xml and I'll take a look at them. I just want to close by saying yes, big clusters is what SmartFrog can do; = its just the changing nature of the hardware changes your deployment archit= ecture in interesting ways. -Steve -- ----------------------- Hewlett-Packard Limited Registered Office: Cain Road, Bracknell, Berks RG12 1HN Registered No: 6905= 97 England ------------------------------------------------------------------------- SF.Net email is sponsored by: The Future of Linux Business White Paper from= Novell. From the desktop to the data center, Linux is going mainstream. = Let it simplify your IT future. http://altfarm.mediaplex.com/ad/ck/8857-50307-18918-4 _______________________________________________ Smartfrog-developer mailing list Sma...@li... https://lists.sourceforge.net/lists/listinfo/smartfrog-developer |