From: Jason H. <jh...@ap...> - 2010-06-04 02:14:59
|
master-master works fine. We used to do it here but ran out of disk space a while back and turned it off and then never got around to turning it back on. More specifically, we have four etch servers total, two in each data center. We had master-master between a primary server in the two data centers and then master-slave between the primary and secondary within the datacenter. We ran in that configuration for probably a year. We turned off the master-master replication due to disk space issues but still run the master-slave replication within each data center, so the two data centers are islands with regards to the database. It's mostly just status data so it doesn't really matter that they aren't aware of each other. Jason On Jun 3, 2010, at 2:45 PM, Kenneth Williams wrote: > Jason, > I started spreading the hosts out using the wrapper and it's been running fine for a week now. Thanks! > > I've got another identical server that I want to balance between anyway for fail over. It's currently doing just nventory, i'd like both servers to handle both services.... you mentioned master-slave, are there any known caveats to master-master mysql replication and pointing the etch-server ruby app to mysql locally on each? > > On Thu, May 27, 2010 at 11:14 PM, Jason Heiss <jh...@ap...> wrote: > Do you randomize the timing of etch running on your clients? I.e. when cron kicks off etch every 10 minutes do all 200 clients try to contact the etch server simultaneously? You mentioned that you get a load spike when the cron kicks off, which makes it sound like you aren't randomizing. > > We randomize our clients using the etch_cron_wrapper script that is included with the etch distribution. It seeds the random number generator with the client's hostname, so that each client runs at a consistent time, but the overall client load is randomized over time. We only run etch once an hour so you'd have to adjust the calculation in the script. > > In our larger datacenter we have over 1200 clients hitting two etch servers. Each runs 20 unicorn processes. Our hardware is older than yours (total of 6 cores between the two servers). Load average sits around 1.5 on the two servers, average CPU utilization at about 35%. So your 200 clients running every 10 minutes should be about equivalent to our 1200 running hourly. Which means your one box should still be able to handle things, although it would be fairly busy. > > (Well, that assumes your configuration repository is roughly the same size as ours. We have etch managing 191 files based on running `find . -name config.xml | wc -l` in our repo.) > > So if you aren't currently randomizing your client timing I'd encourage you to try that. Even just spreading things out over a few minutes might help. If that's not possible or you're already doing it then it might be time to get another server. In our case we run mysqld on the beefier server and point Rails on both servers to that mysqld instance. We also have mysqld running on the other server in slave mode getting data updates via mysql replication in case we ever need it. > > Jason > > PS: As an alternative to modifying etch_cron_wrapper you can get a more generic version of the script that allow you to specify the max sleep interval on the command line here: https://sourceforge.net/apps/trac/tpkg/browser/trunk/pkgs/randomizer/randomizer > > On May 27, 2010, at 11:31 AM, Kenneth Williams wrote: > >> I switched from 90 unicorn processes to 60 yesterday and this error went away, but the latency between a config change and that change getting out to all hosts went up to a max of around 5 hours. Each host hits etch server every 10 minutes, at that 10 minute mark 5m load goes up to around 10 -- this is on a dual xeon quad core. >> >> Do you have any recoemndations on how many processes or some additional details on how one might go about setting this up for a robust production environment? Currently I have Apache proxy as the entry point to just one server running etch server with three unicorn processes, once with 8 workers serving as my "gui", the other two with 30 workers each serving connections from the client machines. I switched from sqlite to mysql, which does not seem to be the bottle neck anymore, and have the etch config files stored on a four disk sas 15k raid 10. Do I need another server, or two? I only have ~200 client machines connecting to it. >> >> Thanks!! >> >> >> On Thu, May 27, 2010 at 9:05 AM, Jason Heiss <jh...@ap...> wrote: >> I haven't seen this before, but it looks like you're probably hitting Ruby's HTTP client default "read_timeout" of 60 seconds. I.e. reading the response from the etch server took more than 60 seconds so the client gave up. How's the load on your etch server(s)? How many unicorn (or whatever Ruby app server you use) processes are you running? What XML parser are you using on the server, REXML or LibXML? >> >> You could bump up read_timeout to a larger value, but that may just mask the problem. If you want to try adjusting read_timeout, around line 141 in etchclient.rb there's a line that looks like: >> >> http = Net::HTTP.new(@filesuri.host, @filesuri.port) >> >> Right below that you can add: >> >> http.read_timeout = 120 >> >> Jason >> >> On May 26, 2010, at 3:15 PM, Kenneth Williams wrote: >> >>> I recently started seeing servers timing out when connecting to etch. The error I am seeing in the logs on the client is new to me: >>> >>> /usr/lib/ruby/site_ruby/1.8/facter/ec2.rb:8: warning: method redefined; discarding old can_connect? >>> /usr/lib/ruby/site_ruby/1.8/facter/ec2.rb:16: warning: method redefined; discarding old metadata >>> execution expired >>> /usr/lib/ruby/1.8/net/protocol.rb:133:in `sysread': Connection reset by peer (Errno::ECONNRESET) >>> from /usr/lib/ruby/1.8/net/protocol.rb:133:in `rbuf_fill' >>> from /usr/lib/ruby/1.8/timeout.rb:62:in `timeout' >>> from /usr/lib/ruby/1.8/timeout.rb:93:in `timeout' >>> from /usr/lib/ruby/1.8/net/protocol.rb:132:in `rbuf_fill' >>> from /usr/lib/ruby/1.8/net/protocol.rb:116:in `readuntil' >>> from /usr/lib/ruby/1.8/net/protocol.rb:126:in `readline' >>> from /usr/lib/ruby/1.8/net/http.rb:2020:in `read_status_line' >>> from /usr/lib/ruby/1.8/net/http.rb:2009:in `read_new' >>> from /usr/lib/ruby/1.8/net/http.rb:1050:in `request' >>> from /usr/lib/ruby/gems/1.8/gems/etch-3.15.2/lib/etchclient.rb:438:in `process_until_done' >>> from /usr/lib/ruby/gems/1.8/gems/etch-3.15.2/bin/etch:99 >>> from /usr/bin/etch:19:in `load' >>> from /usr/bin/etch:19 >>> >>> Has anyone seen this error before, or know what may be causing it, or some things I can look for? Thanks! >>> >>> >>> -- >>> Kenneth Williams >>> ------------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> etch-users mailing list >>> etc...@li... >>> https://lists.sourceforge.net/lists/listinfo/etch-users >> >> >> >> >> -- >> Kenneth Williams > > > > > -- > Kenneth Williams |