Menu

#69 sr_audit Autosave our lives...

Sarra Beta
closed
nobody
None
5
2018-06-02
2017-07-20
psilva
No

When a client is in trouble (goes down for an extended period.) the queue on the broker will grow.
We now have a manual intervention that is possible (sr_shovel -save -queue ) and then

We have had multiple incidents where queues built up to the extent that the broker is compromized, the memory fills up, swapping and all manner of sadness ensue. It would be great if, sr_audit see when a queue is getting to big, and automatically drain it with save_queue.

a Nagios check should detect when an autosave has been done on a queue, and tell folks who are monitoring, but that´s all. no intervention required. Not sure when we would restore, but at least it would avoid disaster.

Discussion

  • psilva

    psilva - 2017-07-29

    at some point we check by stopping the shovel, and waiting to see if the queue builds up.
    if it does, go back to save mode. If the queue remains at zero, then start re-flowing messages that were queued... easily said... needs baking.

     
  • psilva

    psilva - 2017-12-15

    The automated re-flowing messages was implemented as the refonte of retry methods in the 2.17.12 series. It is now implemented, but we haven't made it to large scale deployment yet. Once fully deployed (so we know it is really working.) we can close this bug.

     
  • psilva

    psilva - 2018-06-02

    This is all taken care of, in 2017 versions by manual analyst intervention using save/restore in 2017, and by the automated retry logic in versions in 2018.

    Either way: Done and Dusted!

     
  • psilva

    psilva - 2018-06-02
    • status: open --> closed
     
MongoDB Logo MongoDB