Hello,
Barman has ran flawlessly for years... litterally 2 years without a reboot. All of a sudden, it got stuck with the following error:
Archiving segment 1 of 1 from file archival: star/00000001000004A80000002F
Error: 00000001000004A80000002F is already present in server star. File moved to errors directory.
I can confirm that this file is present twice on the system, with different contents:
find . | grep 00000001000004A80000002F
./errors/00000001000004A80000002F.20191203T073401Z.duplicate
./wals/00000001000004A8/00000001000004A80000002F
ls -lah ./errors/00000001000004A80000002F.20191203T073401Z.duplicate ./wals/00000001000004A8/00000001000004A80000002F
-rw------- 1 barman barman 16M Dec 3 08:33 ./errors/00000001000004A80000002F.20191203T073401Z.duplicate
-rw------- 1 barman barman 27K Jan 9 2018 ./wals/00000001000004A8/00000001000004A80000002F
Moving the files out of the errors directory doesn't fix the issue - the issue re-appears when the next WAL file get pulled. I have tried barman switch-wal --force --archive staras well - it doesn't help.
The details logs on the Barman side are:
Setup:
Barman running on Debian 8.11
barman --version
2.3
PostgreSQL running on Debian 9.9
Streaming replications setup:
cat /etc/postgresql/10/main/conf.d/archive.conf
wal_level = replica
archive_mode = on
archive_command = 'rsync -a %p barman@xxx:/var/lib/barman/star/incoming/%f'
archive_timeout = 60
To be noted that I have increased the max_wal_size to 10GB in an attempt to fix the issue - when checking for anomalies, I've seen this in the server logs:
2019-12-03 02:13:09.875 CET [27780] HINT: Consider increasing the configuration parameter "max_wal_size".
2019-12-03 02:13:23.247 CET [27780] LOG: checkpoints are occurring too frequently (14 seconds apart)
2019-12-03 02:13:23.247 CET [27780] HINT: Consider increasing the configuration parameter "max_wal_size".
2019-12-03 02:13:37.152 CET [27780] LOG: checkpoints are occurring too frequently (14 seconds apart)
2019-12-03 02:13:37.152 CET [27780] HINT: Consider increasing the configuration parameter "max_wal_size".
2019-12-03 02:13:51.370 CET [27780] LOG: checkpoints are occurring too frequently (14 seconds apart)
2019-12-03 02:13:51.370 CET [27780] HINT: Consider increasing the configuration parameter "max_wal_size".
But it doesn't seem to fix the issue.
Any idea on how to troubleshoot the issue? Or do you need more information?
Thank you :)
Hi,
what you see means that at the some time in the past, your postgres server restarted the wal sequence. That usually happen when you use pg_upgrade or you replace your cluster in another way without changing the name of the server in barman. After some time you reach the old WAL name and therefore the error you see.
The version 2.10 of barman (it will be officially out on 5 December 2019) has a new mechanism to prevent this kind of issues.
The best way to fix the issue in your installation is to move the whole
/var/lib/barman/stardirectory to a different name and restart from scratch with a new backup.I had overlooked the creation date of the WAL files... one of them is more than 1 year old, we actually have a bunch of them.
For some reason they were not purged by Barman - given the age of those files, I don't know if this was linked to a Barman upgrade or to some other reason.
Deleting those old WAL files have fixed the issue.
As far as I'm concerned this ticket can be closed.