[Keepalived-devel] Re: Keepalived, UML and TONS of mail

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Diego,

Sorry for delay :/ I am quite busy currently...

> I have recently setup a virtual cluster for my own testing purposes. 
> It's comprised of 2 directors and 2 realservers.
> 
> The idea is to have active-passive failover on the directors, so that
> service coming from the realservers is not interrupted.
> 
> LVS and healthchecks would insure that only the available realservers
> are used when requests come in.
> 
> Although I can get it to work, I'm trying to use TCP_CHECK as the
> healthcheck mechanism.

Ok nice.

> I'm exporting two virtual services: http and ssh, and my intention is to
> test the availability of each service by doing the TCP_CHECK to each
> corresponding port.
> 
> However, I get tons of e-mail notifying me that "Realserver xxxx:yy
> DOWN" and shortly thereafter "Relaserver xxxx:yy UP"...this goes on and
> on...

Yes this is because the final service (ssh, http) flap. This can be due 
to the fact that the delay_loop is too short... and your final service 
(ssh, http) seems to be flooded by healthcheck... 30s in your conf sound 
good... Strange...

> I have attached my configuration - maybe you can tell me which of the
> timeouts I have misconfigured, since I'm sure this is abnormal behavior.

The most important is delay_loop it drives the healthcheck frequency... 
30 sound good... tcp connection_timeout to 10 sounds good too... hmm 
there seems to have a trouble with your listener... hmmm... can be due 
to the fact the server are overloaded...

> On a separate note: congratulations on a great product!!  This is coming
> in very handy in planning for the three clusters we need to implement!

thanks :)

> I'll be sure and forward you the details of the implementation so you
> can use it in a "case studies"-type section in the website!

Any documentations are very welcome... This is a part in the website 
that need to be expanded :) So, if you can write something, fill free, I 
will publish it on the website.

> Also, is there any documentation that describes the timeouts, and how
> they work in relation to each other?  This I think is important, since
> the existing docs (that I've seen) don't cover this.

Yes... the only things : delay_loop is the frequency launching 
healthchecker... connection_timeout is the timeout considering service 
fail (driving the remove healtchecker removing decision).

Best regards,
Alexandre