Hi,
I've encountered a problem with the restore of a Postgresql Cluster. Here's some detail information on this :
OS : Debian stable (wheezy)
Barman version : 1.3.2 (from the Postgresql repository)
Postgresql versions : 9.2
Can the bug be reproduced : yes
OS : Debian stable (wheezy)
Barman version : 1.3.2 (from the Postgresql repository)
Postgresql versions : 9.2
Can the bug be reproduced : yes
My current setup is :
- one PG 9.2 in master mode, accepting R/W requests. It's accessible using a VIP (managed by Pacemaker / Corosync). The server is doing WAL archiving through rsync (in daemon / server mode).
- one PG 9.2 in standby mode, replicating the master (async mode), and in hot standby. The server is setup to do WAL archiving on the same server as the master, but as it's currently in standby_mode, it's not doing anything. The server is also able to wake up using a trigger file.
- one backup server, running barman. A backup configuration has been created, called "pgcluster". The SSH and the conninfo are pointing to the VIP, currently on the master.
Here's the steps :
- take a backup on the master engine. The WAL will begin to stack on the backup server.
- shut down the master.
- At that point, the VIP is switching to the slave.
- Wake up the slave (using the trigger file or by promoting it).
- The slave wake up and increase its timeline (let's say from 1 to 2). It's able to handle R/W connections. It's also beginning to send WAL to the backup server, using the same rsync connection as the master. The WAL are stacking in a new folder, because of the timeline ID modification.
- I decide to reinstall the master from the backup, using the "recover" command. So, on the backup :
barman $> barman recover --remote-ssh-command "ssh postgres@pgmaster" pgcluster 20140609T182623 /var/lib/postgresql/9.2/main
The recover begins, but failed on the transfer of the last WAL file. For example, if the master crashed on the WAL "0000000100000000000000FC", and the slave take over with first WAL "0000000200000000000000FC" file.
Here's the error :
Processing xlog segments for pg92cluster1 00000002000000010000001D 00000002000000010000001E Starting remote restore for server pg92cluster1 using backup 20140609T182623 Destination directory: /var/lib/postgresql/9.2/main Copying the base backup. Copying required wal segments. EXCEPTION: [Errno 2] No such file or directory: '/tmp/barman_xlog-18Lz2a/0000000100000000000000FC'
The recover is sending all WAL files, from both timeline ID + timeline.history file on the Postgresql server, in order for him to restore to the current old-slave/new-master point.
I know that I can restore using the new-master server, but I want to :
- test the worst-case scenario (two servers crashing in a row)
- not bothering the new-master if possible.
Thanks !
Sorry to have double post the same content in the ticket body. I would be happy to correct it but apparently, I don't have the right to do so.
Sorry about that.
Diff:
Thank you for submitting this bug request. We place it in the current backlog.
Thanks !
I'm eagerly waiting for this correction / new version !
Cheers !
I've followed all the steps written by the user, but actually I'm unable to replicate the error.
I've tried using different postgres and barman versions, but still i'm unable to replicate the error. Could you help me to replicate this error?
Hi,
as Giulio was saying, we'd need a reproducible test case.
I think a few more info would be extremely useful, for example:
In general, it is good practice to take a new full base backup immediately after the switch.
Thanks,
Gabriele
OK, I'll try to put additional information asap, but I've other things to do right now.
Will try to do this in the upcoming days.
Thanks,
You know what ? I can't reproduce it either.
I started a master, pgbench on it, sync a slave on the master, crash the master and wake up the slave (which will continue to archive WAL on the same folder on barman server).
Then I did a :
and no issue, the transfer of all WAL is OK, without the previous crash I encountered.
Now, few comments :
Output:
Definitly something fishy here.
My aim is to restore the master using the backup done on the master, and all the WAL files of both timeline 1 & 2, to recover the master to a point very close to the "old-slave / new-master".
Thanks,
Hi,
we have releases 1.3.3-alpha.1 and we believe that that issue could be resolved now. Could you please try with that version and let us know?
Thanks,
Gabriele
Tested today, it's a bit better, as there's no more crashs, but the WAL are not correctly copied.
Thanks,
Hi,
OK, the week-end helped me to see things more clearly. I tried again with "--target-tli" options, and see that you are using a temporary folder "barman_xlog" to store all WAL files for both timeline 1 & 2.
This is working as expecting, all WAL files are being copied from the backup server to the PG server, and are recovered during the recovery phase.
I managed to put in the generated recovery.conf file the "trigger_file", "standby_mode" and "primary_conninfo", in order to start the replication. The new server starts, recover all the files from the "barman_xlog" folder, then connects to the master in order to start the replication.
So, it's working fine !
The only minor issue I have, is that the "barman_xlog" folder is not destroyed, because I'm starting the replication right after the initial WAL recovery, thus keeping the engine in recovery state. So after the initial recovery, I have to open the recovery.conf file, removing the now useless options "restore_command", "recovery_end_command" and "recovery_target_timeline", and manually remove the "barman_xlog".
Thanks for your help on this subject !
I can confirm that this is now working correctly - tested with PG 9.3.5 & Barman 1.3.3 ; almost identical setup as described in ticket. Using the --target-tli option, I'm able to restore from:
last base backup from previous master
WAL files from master before "crash"
* WAL files from new master, with new timeline.
with all data present.