Menu

#518 Memcached failure causes cascading database failures

workingwiki
open
None
5
2014-04-03
2014-03-31
Lee Worden
No

Because when the background jobs code can't reach memcached, it sleeps 1 second and polls again, forever. So we accumulate more and more apache processes, waiting and polling. But keeping all these apache processes alive keeps database connections open, I think, so pretty soon you can't open any more database connections and all wiki pages start to fail.

Discussion

  • Lee Worden

    Lee Worden - 2014-03-31
    • status: open --> closed
     
  • Lee Worden

    Lee Worden - 2014-03-31

    Solution: quit after polling 100 times. Done.

     
  • Lee Worden

    Lee Worden - 2014-03-31
    • status: closed --> open
     
  • Lee Worden

    Lee Worden - 2014-03-31

    Actually let's quit after like 3 times, because nobody wants to wait 100 seconds. It would be best to have it return an error if it never gets the job listing. Right now it's just returning an empty list, which means in the browser the list of background jobs will temporarily vanish, and reappear later. So I think I'll reopen this ticket until I get that done right.

     
  • Lee Worden

    Lee Worden - 2014-04-03

    Looks like the cascade of failed db connections may have been actually caused somehow by a disk failure on one of the cluster nodes, but regardless it's good that I fixed this issue.

     
  • Lee Worden

    Lee Worden - 2014-04-03
     

Anonymous
Anonymous

Add attachments
Cancel