#32 Command list-backup fails intermittently during cron when processing multiple dabatases

Damon Snyder

There appears to be an issue where the list-backup command is failing intermittently when executed while a cron command is being run in parallel in a configuration with multiple databases. Sample log output can be found here.

I believe the issue is caused by IO buffering in the cron method in cli.py. In this loop:

with lockfile(filename) as locked:
    if not locked:
        yield "ERROR: Another cron is running"
        raise SystemExit, 1
        servers = [ Server(conf) for conf in barman.__config__.servers()]
        for server in servers:
            for lines in server.cron(verbose=True):
                yield lines

server.cron() calls backup_manager.cron() which does

with self.server.xlogdb('a') as fxlogdb:
   # this is the only write to fxlogdb
   fxlogdb.write("%s\t%s\t%s\t%s\n" % (basename, size, time, self.config.compression))

The call to self.server.xlogdb('a') uses the context manager interface to manage the lock file around the open file. The file is opened with open(xlogdb, mode) which indicates (through omission of the third parameter) that the IO on the file will be buffered.

There is no closing or flushing of the opened xlogdb. If I'm not mistaken, once the for server in servers loop completes in cron you could have buffered data that still needs to be written to the xlogdb when the lock is relinquished and another loop begins.

I think this may be the cause of these errors. When I investigate these errors in the absence of a running cron task list-backup completes successfully and the xlog.db appears normal.


  • Damon  Snyder
    Damon Snyder

    I opened merge 5 as a possible solution to the problem.

  • Damon  Snyder
    Damon Snyder

    I'm pretty confident that merge 5 fixes the issue. I can't reproduce the problem manually with the code change and I haven't seen the issues resurface in a week in my production environment.

    To try and manually reproduce the issue i'm running list-backup repeatedly for 20+ servers while cron is running continuously (processing incoming WALs) in the background. Previously this would have caused a fatal error in list-backup a non-zero percentage of the time.

    Intuitively this makes sense. If you buffer writes to a file, relinquish the lock, another reader will eventually encounter a partially written line before the buffer is flushed to disk.

    I wonder if pgbarman should use a an open source file-based database with ACID compliance (e.g. SQLite) for storing critical meta data. The data being stored is nearly as critical as the data being backed up. If you can't read your backup data you can't restore your critical data. Thoughts?


    • status: open --> closed
    • assigned_to: Giulio Calacoci