[etch-users] problems with "broken" hosts when moving to production use

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi all!

I've started moving out of my test environment and beginning to move to
production use. As part of that I've gone from using unicorn with one worker
to testing four workers and an Apache proxy. Everything seems to work, and
scales better when deploying to more hosts as you'd expect, but the etch
dashboard reports hosts as broken using this setup. I've tested it in
various combinations, using just unicorn without apache and multiple workers
directly, and with apache using multiple masters with only one worker. The
only setup I can get working without hosts being listed as broken is one
master with one worker. Unfortunately, and as you could probably guess, it
takes an eternity to push changes using only one worker once you throw in
more than just a couple hosts... Apache as a proxy does not seem to make a
difference, accessing unicorn through it's own port, or through the Apache
proxy has no noticeable change in the number of broken hosts. In the end I'd
like Apache to proxy to multiple unicorn masters on different hosts, but
right now I'd settle for being able to have more than one worker running ;)

The list of "broken" hosts steadily increases over the day at around the ten
minute interval when etch client kicks off from cron. It starts off with
just a few in a pool of 40 hosts listed as broken and goes up from there by
one or two hosts every ten minutes. It seems to stop around 25 +/- 3
"broken" hosts, and the hosts will alternate at the ten minute interval. If
I put a change in my etch source directory it does get pushed out to the
hosts, even the ones listed as broken, and if I log into a broken host and
run etch manually it runs fine, except for two warnings. When running etch
client manually it removes the host from the broken list, only to add it
back in later. I've always ignored the warning because it did not seem to
have any impact under the previous test setup. It seemed to have cropped up
when I upgraded from 3.11 to the ruby gem 3.13 version. There are two hosts
still running the 3.11 client that don't produce this warning, but they're
also subject to being listed as broken along with the others. Just in case
its important, the warning is:

/usr/lib/ruby/site_ruby/1.8/facter/ec2.rb:8: warning: method redefined;
discarding old can_connect?
/usr/lib/ruby/site_ruby/1.8/facter/ec2.rb:16: warning: method redefined;
discarding old metadata

I don't think this is related to my problem though.The etch client command
I'm running that produces this is:

/usr/bin/etch --generate-all --server http://etch:8080/

Otherwise there are no errors produced by the etch client. Port 8080 is
running through the Apache proxy, behind it is currently only one unicorn
master with 20 workers. I'm running etch client version 3.13 on the nodes,
and on the server I'm running 3.11. Please let me know if you need any
additional details, any help is truly appreciated.Thanks!!

-- 
Kenneth Williams