Memcached failure causes cascading database failures

Status: Beta

Brought to you by: worden

#518 Memcached failure causes cascading database failures

Milestone: workingwiki

Status: open

Owner: Lee Worden

Labels: None

Priority: 5

Updated: 2014-04-03

Created: 2014-03-31

Creator: Lee Worden

Private: No

Because when the background jobs code can't reach memcached, it sleeps 1 second and polls again, forever. So we accumulate more and more apache processes, waiting and polling. But keeping all these apache processes alive keeps database connections open, I think, so pretty soon you can't open any more database connections and all wiki pages start to fail.

Discussion

Lee Worden - 2014-03-31

status: open --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Lee Worden - 2014-03-31

Solution: quit after polling 100 times. Done.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Lee Worden - 2014-03-31

status: closed --> open
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Lee Worden - 2014-03-31

Actually let's quit after like 3 times, because nobody wants to wait 100 seconds. It would be best to have it return an error if it never gets the job listing. Right now it's just returning an empty list, which means in the browser the list of background jobs will temporarily vanish, and reappear later. So I think I'll reopen this ticket until I get that done right.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Lee Worden - 2014-04-03

Looks like the cascade of failed db connections may have been actually caused somehow by a disk failure on one of the cluster nodes, but regardless it's good that I fixed this issue.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Lee Worden - 2014-04-03

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Anonymous