We encountered a serious problem in xcatd that when kill 'UDP listener' manually, the process 'SSL listener' will use up all the cpu resource. It also sometimes happens when restart xcatd that the 'UDP listener' process is stopped first, then 'SSL listener' gets into infinite loop.
Base on my investigation that this issue caused by the following part of code in xcatd. When the UDP process is ended, the select->can_read() cannot return with what we expected. And idea?
1134 until ($quit) {
1135 $SIG{CHLD} = \&ssl_reaper; #set here to ensure that signal handler is not corrupted during loop
1136 while ($udpwatcher->can_read(0)) { # take an intermission to broker some state requests from udp traffic control
[program in certain case that always runs into here when UDP process topped]
1137 eval {
1138 my $msg = fd_retrieve($udpctl);
1139 if ($msg->{req} eq 'get_client_count') {
1140 store_fd({'clientfudge'=>$sslfudgefactor, 'sslclientcount' => $sslclients}, $udpctl);
1141 } elsif ($msg->{req} eq 'set_fudge_factor') {
1142 $sslfudgefactor = $msg->{fudge};
1143 store_fd({'clientfudge'=>$sslfudgefactor, 'sslclientcount' => $sslclients}, $udpctl);
1144 }
1145 };
1146 }
1147 if (@pendingconnections) {
1148 while ($listenwatcher->can_read(0)) { #grab everything we can, but don't spend any time waiting for more
1149 $tconn = $listener->accept;
1150 unless ($tconn) { next; }
1151 push @pendingconnections,$tconn;
1152 }
1153 } else {
1154 $bothwatcher->can_read(30); [it returns immediately when UDP process stopped]
1155 if (not $listenwatcher->can_read(0)) { # check for udpctl messages since
1156 # we have no listen to hear
1157 next;
1158 }
There were two issues which came to the same appearance that ssl_listener used up all the cpu.
Issue 1:
Refer to the code in xcatd line 1136, sometimes when the udp_process is stopped gracefully (e.g. restart xcatd), the select.can_read(udp_fd) in ssl_listener process would get into an infinite loop . In this case, the udp_fd is still in read_ready status, so the select always return success. The fix f00a23aa5818ba19a91bdb6fbae57a7e31608808 fixed this issue by checking whether the udp_fd broken after reading the udp_fd.
Issue 2:
Refer to the code in xcatd line 1154, when the udp_process is killed arbitrarily (e.g. kill -9) the select.can_read(udp_fd and ssl_listener) will return immediately instead of waiting 30s. This no sleep caused the infinite loop. The fix 921b7d5ea3553ca6de61fe67caa6f11ee033366e defects the broken udp_fd and remove it from fdset of $bothwatcher.