Menu

#47 Concurrent backups broken

1.x
open
nobody
None
2014-05-28
2014-05-28
No

The way the concurrent backups are implemented is fundamentally unsafe.

If barman for some reason (such a a simple out of disk space) crashes after running pgespresso_start_backup(), it will leave the server with a "leaked" reference counter, and the server will never exit backup mode. There needs to be protection against this, similar to how pg_basebackup does it. Just bypassing the safety features in the backend is very dangerous, especially for something as important as backups.

This needs to be fixed in coordination with pgespresso (I've filed a couple of bugs there as well). Proper reference counting needs to be implemented. I suggest a look at the implementation of the base backup protocol for an idea of how this could be done - basically you need server side protection aganist a disconnected client that will automatically terminate the backup.

Discussion

  • Gabriele Bartolini

    Hi Magnus,

    thanks for posting your bug submissions, both on Barman and pgespresso. And thanks for finally using Barman! ;)

    We have already started investigating them and are working on a general fix for pgespresso (in order to make it more robust).

    However, I might have misunderstood your bug report, since in my understanding the crash scenario you describe is not possible in case of an "out of disk space" error, or when Barman is terminated by SIGTERM; in these cases Barman manages the exception and correctly terminates the backup procedure.

    A backup would be left "hanging" only in case of abrupt shutdown, i.e. SIGKILL or similar, or in case of network issues.

    Given that the only use case for concurrent backups is to take backups from a standby, the workaround in such cases would be to restart the standby.

    In any case, backups produced via concurrent backup are valid and not broken.

    Ciao,
    Gabriele

     
  • Magnus Hagander

    Magnus Hagander - 2014-05-28

    I did see it in an out of space scenario, but that may have been a secondary effect. What did happen was that barman died/crashed, at which point it of course disconnected from the server, and left a "hanging" refcount. Unfortunately the output from barman wasn't collected, so it might have been an unhandled exception, or a SIGKILL or something like that - I'm not sure.

    If barman catches "normal" exceptions that's obviously good - but there is always a risk of "abnormal exceptions". (Another example that I guess would produce the same problem - barman looses the libpq connection to the server while it's running, and therefor can't call the stop backup?)

    And yes, the backup itself should be OK - but the internal state on the server is corrupt. Yes, restarting the server will bring it back, but that's a workaround and not a solution. But yes, it does mean that the issues in pgespresso are the bad ones, and the barman issue is the minor one :)

    Why would there not be a usecase for concurrent backup on the master?

     

Log in to post a comment.