Barman for PostgreSQL / Tickets / #48 Restore command failing on a master-slave configuration

Sterfield - 2014-06-09

Sorry to have double post the same content in the ticket body. I would be happy to correct it but apparently, I don't have the right to do so.

Sorry about that.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Description has changed:

Diff:

--- old
+++ new
@@ -10,30 +10,6 @@
 Postgresql versions : 9.2
 Can the bug be reproduced : yes

-
-How to reproduce:
------------------
-
-My current setup is :
- - one PG 9.2 in master mode, accepting R/W requests. It's accessible using a VIP (managed by Pacemaker / Corosync). The server is doing WAL archiving through rsync (in daemon / server mode).
- - one PG 9.2 in standby mode, replicating the master (async mode), and in hot standby. The server is setup to do WAL archiving on the same server as the master, but as it's currently in standby_mode, it's not doing anything. The server is also able to wake up using a trigger file.
- - one backup server, running barman. A backup configuration has been created, called "pgcluster". The SSH and the conninfo are pointing to the VIP, currently on the master.
-
-Here's the steps :
- - take a backup on the master engine. The WAL will begin to stack on the backup server.
- - shut down the master.
- - At that point, the VIP is switching to the slave.
- - Wake up the slave (using the trigger file or by promoting it).
- - The slave wake up and increase its timeline (let's say from 1 to 2). It's able to handle R/W connections. It's also beginning to send WAL to the backup server, using the same rsync connection as the master. The WAL are stacking in a new folder, because of the timeline ID modification.
- - I decide to reinstall the master from the backup, using the "recover" command. So, on the backup :
-    barman $> barman recover --remote-ssh-command "ssh postgres@pgmaster" pgcluster 20140609T182623 /var/lib/postgresql/9.2/main
-
-Current results:
-----------------
-
-The recover begins, but failed on the transfer of the last WAL file. For example, if the master crashed on the WAL "0000000100000000000000FC", and the slave take over with first WAL "0000000200000000000000FC", the recover will crash on the "Hi,
-
-I've encountered a problem with the restore of a Postgresql Cluster. Here's some detail information on this :

 Information:
 ------------

Gabriele Bartolini - 2014-06-10

labels: --> recovery, wal files

status: open --> accepted

assigned_to: Marco Nenciarini
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Gabriele Bartolini - 2014-06-10

Thank you for submitting this bug request. We place it in the current backlog.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sterfield - 2014-06-13

Thanks !

I'm eagerly waiting for this correction / new version !

Cheers !

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Giulio Calacoci - 2014-07-15

I've followed all the steps written by the user, but actually I'm unable to replicate the error.
I've tried using different postgres and barman versions, but still i'm unable to replicate the error. Could you help me to replicate this error?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Gabriele Bartolini - 2014-07-16

Hi,

as Giulio was saying, we'd need a reproducible test case.

I think a few more info would be extremely useful, for example:

did you specify any timeline ID for recovery?

could you please cut/paste the snippet from the xlogdb of that barman server around the switch WAL file?

could you also send us the output of 'barman diagnose'?

In general, it is good practice to take a new full base backup immediately after the switch.

Thanks,
Gabriele
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sterfield - 2014-07-16

OK, I'll try to put additional information asap, but I've other things to do right now.

Will try to do this in the upcoming days.

Thanks,

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sterfield - 2014-07-17

You know what ? I can't reproduce it either.

I started a master, pgbench on it, sync a slave on the master, crash the master and wake up the slave (which will continue to archive WAL on the same folder on barman server).

Then I did a :

barman recover --remote-ssh-command "ssh postgres@pgmaster" pg92cluster 20140717T121354 /var/lib/postgresql/9.2/main

and no issue, the transfer of all WAL is OK, without the previous crash I encountered.

Now, few comments :

the above command is restoring all WAL with the same timeline as the original backup (currently 1). The awaken slave is sending WAL to barman with the timeline 2, but the recover is not copying those WAL, nor the .history file.

if I set manually "--target-tli" to "2", I encounter a crash

Output:

Processing xlog segments for pg92cluster1 0000000100000000000000F4 000000020000000200000031 000000020000000200000032 000000020000000200000033 000000020000000200000034 000000020000000200000035 000000020000000200000036 Starting remote restore for server pg92cluster1 using backup 20140717T121354 Destination directory: /var/lib/postgresql/9.2/main Doing PITR. Recovery target timeline: '2' Copying the base backup. Copying required wal segments. Failure copying WAL files: data transfer failure while copying WAL files to directory '/var/lib/postgresql/9.2/main/barman_xlog' rsync error: ERROR: destination must be a directory when copying more than 1 file rsync error: errors selecting input/output files, dirs (code 3) at main.c(571) [Receiver=3.0.9] rsync: connection unexpectedly closed (9 bytes received so far) [sender] rsync error: error in rsync protocol data stream (code 12) at io.c(605) [sender=3.0.9]

Definitly something fishy here.

My aim is to restore the master using the backup done on the master, and all the WAL files of both timeline 1 & 2, to recover the master to a point very close to the "old-slave / new-master".

Thanks,
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Gabriele Bartolini - 2014-07-17

Hi,

we have releases 1.3.3-alpha.1 and we believe that that issue could be resolved now. Could you please try with that version and let us know?

Thanks,
Gabriele

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sterfield - 2014-07-18

Tested today, it's a bit better, as there's no more crashs, but the WAL are not correctly copied.

if there's no "--target-tli" option, the WAL of timeline 1 are being copied, but not the WAL of timeline 2 (which is, I guess, the correct behavior, as the backup.info file is mentionning "timeline=1").

if the option "--target-tli" is set to 2, no WAL are copied (the destination pg_xlog folder is empty), but there's no more application crash.

Thanks,
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sterfield - 2014-07-21

Hi,

OK, the week-end helped me to see things more clearly. I tried again with "--target-tli" options, and see that you are using a temporary folder "barman_xlog" to store all WAL files for both timeline 1 & 2.

This is working as expecting, all WAL files are being copied from the backup server to the PG server, and are recovered during the recovery phase.

I managed to put in the generated recovery.conf file the "trigger_file", "standby_mode" and "primary_conninfo", in order to start the replication. The new server starts, recover all the files from the "barman_xlog" folder, then connects to the master in order to start the replication.

So, it's working fine !

The only minor issue I have, is that the "barman_xlog" folder is not destroyed, because I'm starting the replication right after the initial WAL recovery, thus keeping the engine in recovery state. So after the initial recovery, I have to open the recovery.conf file, removing the now useless options "restore_command", "recovery_end_command" and "recovery_target_timeline", and manually remove the "barman_xlog".

Thanks for your help on this subject !

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2014-12-19

I can confirm that this is now working correctly - tested with PG 9.3.5 & Barman 1.3.3 ; almost identical setup as described in ticket. Using the --target-tli option, I'm able to restore from:
last base backup from previous master
WAL files from master before "crash"
* WAL files from new master, with new timeline.

with all data present.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Gabriele Bartolini - 2015-01-13

status: accepted --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Restore command failing on a master-slave configuration

Backup and Recovery Manager for PostgreSQL

Milestone

Searches

Help

#48 Restore command failing on a master-slave configuration

Information:

Information:

How to reproduce:

Current results:

Expected results:

Notes:

Discussion