#18 delete breaks because of missing wal-files

1.x
closed
delete wal (2)
2014-01-29
2012-11-12
No

Hi,

today i discovered that my barman server is getting low on diskspace. i checked the log file and saw that many of my delete commands are failing because of missing wal-files or *.history files.

Well, my environment looks like this. I have 24 databases configured to ship there wal-files to the barman server. Barman makes two basebackups a day and before it starts a new backup it deletes the oldest one. On the last step it seems to fail. it deletes the basebackup but during the deletion of old wal-files it gets exception messages like this:
2012-11-12 03:29:08,227 root ERROR: ERROR: Unhandled exception. See log file for more details.
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/barman/cli.py", line 339, in main
p.dispatch(pre_call=global_config, output_file=_output_stream)
File "/usr/lib/python2.7/dist-packages/argh/helpers.py", line 381, in dispatch
return dispatch(self, args, *kwargs)
File "/usr/lib/python2.7/dist-packages/argh/helpers.py", line 270, in dispatch
for line in lines:
File "/usr/lib/python2.7/dist-packages/argh/helpers.py", line 354, in _execute_command
for line in result:
File "/usr/lib/python2.7/dist-packages/argh/helpers.py", line 342, in _call
for line in result:
File "/usr/lib/python2.7/dist-packages/barman/cli.py", line 259, in delete
for line in server.delete_backup(backup):
File "/usr/lib/python2.7/dist-packages/barman/backup.py", line 366, in delete_backup
self.delete_wal(name)
File "/usr/lib/python2.7/dist-packages/barman/backup.py", line 747, in delete_wal
os.unlink(os.path.join(hashdir, name))
OSError: [Errno 2] No such file or directory: '/var/lib/barman/fr0/wals/000000030000043C/000000030000043C00000032'

or:

2012-11-12 03:29:08,002 root ERROR: ERROR: Unhandled exception. See log file for more details.
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/barman/cli.py", line 339, in main
p.dispatch(pre_call=global_config, output_file=_output_stream)
File "/usr/lib/python2.7/dist-packages/argh/helpers.py", line 381, in dispatch
return dispatch(self, args, *kwargs)
File "/usr/lib/python2.7/dist-packages/argh/helpers.py", line 270, in dispatch
for line in lines:
File "/usr/lib/python2.7/dist-packages/argh/helpers.py", line 354, in _execute_command
for line in result:
File "/usr/lib/python2.7/dist-packages/argh/helpers.py", line 342, in _call
for line in result:
File "/usr/lib/python2.7/dist-packages/barman/cli.py", line 259, in delete
for line in server.delete_backup(backup):
File "/usr/lib/python2.7/dist-packages/barman/backup.py", line 366, in delete_backup
self.delete_wal(name)
File "/usr/lib/python2.7/dist-packages/barman/backup.py", line 747, in delete_wal
os.unlink(os.path.join(hashdir, name))
OSError: [Errno 2] No such file or directory: '/var/lib/barman/en0/wals/00000002.history'

If it fails like this, it does not continue to erase the wal-files, which means, that my diskspace is really getting low right now, for example:

du -ch --max-depth=1
788M ./000000030000039D
1.4G ./000000030000039E
1.5G ./000000030000039F
1.5G ./00000003000003A0
1.5G ./00000003000003A1
1.5G ./00000003000003A2
1.6G ./00000003000003A3
1.6G ./00000003000003A4
1.5G ./00000003000003A5
1.6G ./00000003000003A6
1.6G ./00000003000003A7
1.6G ./00000003000003A8
1.6G ./00000003000003A9
1.6G ./00000003000003AA
578M ./00000003000003AB

there should only be two directories, but it fails at the first, so it doesn't delete the rest.

Right now i don't know why the wal-files are missing, this has to be investigated further by me, but for now it neither would solve the problem. is there any chance to manually delete the old directorys, or would this fuck up everything?

And i also dont understand why it needs the 0000002.history file. Because on the most servers i already had timeline 4 when i was setting up barman. so i would assume that the 00000002.history is not needed.

And on the other hand, if there are wal-files missing which sould not be missing, thats a problem ok, but if the wal-directory is sooooo old, that a restore would never care about it, it should get deleted. or am i wrong?

regards

Discussion

  • To give you some more information. I was just looking on my backup server, to check if i could manually delete some directorys. Then I stumpled upon this:

    /var/lib/barman/es0/wals# ls -ltrh
    total 384K
    drwxr-xr-x 2 barman barman 12K Nov 4 04:05 00000003000002BB
    drwxr-xr-x 2 barman barman 12K Nov 5 04:01 00000003000002BC
    drwxr-xr-x 2 barman barman 12K Nov 6 01:43 00000003000002BD
    drwxr-xr-x 2 barman barman 12K Nov 6 12:49 00000003000002BA
    drwxr-xr-x 2 barman barman 12K Nov 7 01:08 00000003000002BE
    drwxr-xr-x 2 barman barman 12K Nov 7 22:59 00000003000002BF
    drwxr-xr-x 2 barman barman 12K Nov 8 21:49 00000003000002C0
    drwxr-xr-x 2 barman barman 12K Nov 9 20:34 00000003000002C1
    drwxr-xr-x 2 barman barman 12K Nov 10 18:57 00000003000002C2
    drwxr-xr-x 2 barman barman 12K Nov 11 17:33 00000003000002C3
    drwxr-xr-x 2 barman barman 12K Nov 12 16:20 00000003000002C4
    drwxr-xr-x 2 barman barman 12K Nov 13 09:30 00000003000002C5
    -rw-r--r-- 1 barman barman 157K Nov 13 09:30 xlog.db
    -rw-r--r-- 1 barman barman 0 Nov 13 09:33 xlog.db.new

    how is it possible that "2BA" comes after "2BD" and before "2BE"?

    Am I right that I just have to drop the lines from xlog.db? And after that drop the faulty directorys? I just want a quick, and maybe dirty solution. because right now I just cant go on with the backups because disk space is getting low

    regards

     
    • labels: --> delete wal
    • status: open --> accepted
    • assigned_to: Gabriele Bartolini
     
  • Bernhard, thanks for opening the ticket. I have submitted a patch that does not stop the barman delete command when a WAL file is not found. It simply throws a warning in the log file and continues.

    Please check commit: 66904be

     
  • works so far, thanks a lot.
    did not tested restore procedure right now.

     
    • status: accepted --> closed