#48 Restore command failing on a master-slave configuration

1.x
closed
2015-01-29
2014-06-09
Sterfield
No

Hi,

I've encountered a problem with the restore of a Postgresql Cluster. Here's some detail information on this :

Information:

OS : Debian stable (wheezy)
Barman version : 1.3.2 (from the Postgresql repository)
Postgresql versions : 9.2
Can the bug be reproduced : yes

Information:

OS : Debian stable (wheezy)
Barman version : 1.3.2 (from the Postgresql repository)
Postgresql versions : 9.2
Can the bug be reproduced : yes

How to reproduce:

My current setup is :
- one PG 9.2 in master mode, accepting R/W requests. It's accessible using a VIP (managed by Pacemaker / Corosync). The server is doing WAL archiving through rsync (in daemon / server mode).
- one PG 9.2 in standby mode, replicating the master (async mode), and in hot standby. The server is setup to do WAL archiving on the same server as the master, but as it's currently in standby_mode, it's not doing anything. The server is also able to wake up using a trigger file.
- one backup server, running barman. A backup configuration has been created, called "pgcluster". The SSH and the conninfo are pointing to the VIP, currently on the master.

Here's the steps :
- take a backup on the master engine. The WAL will begin to stack on the backup server.
- shut down the master.
- At that point, the VIP is switching to the slave.
- Wake up the slave (using the trigger file or by promoting it).
- The slave wake up and increase its timeline (let's say from 1 to 2). It's able to handle R/W connections. It's also beginning to send WAL to the backup server, using the same rsync connection as the master. The WAL are stacking in a new folder, because of the timeline ID modification.
- I decide to reinstall the master from the backup, using the "recover" command. So, on the backup :

barman $> barman recover --remote-ssh-command "ssh postgres@pgmaster" pgcluster 20140609T182623 /var/lib/postgresql/9.2/main

Current results:

The recover begins, but failed on the transfer of the last WAL file. For example, if the master crashed on the WAL "0000000100000000000000FC", and the slave take over with first WAL "0000000200000000000000FC" file.

Here's the error :

Processing xlog segments for pg92cluster1
    00000002000000010000001D
     00000002000000010000001E
Starting remote restore for server pg92cluster1 using backup 20140609T182623 
Destination directory: /var/lib/postgresql/9.2/main
Copying the base backup.
Copying required wal segments.
EXCEPTION: [Errno 2] No such file or directory: '/tmp/barman_xlog-18Lz2a/0000000100000000000000FC'

Expected results:

The recover is sending all WAL files, from both timeline ID + timeline.history file on the Postgresql server, in order for him to restore to the current old-slave/new-master point.

Notes:

I know that I can restore using the new-master server, but I want to :
- test the worst-case scenario (two servers crashing in a row)
- not bothering the new-master if possible.

Thanks !

Discussion

  • Sterfield
    Sterfield
    2014-06-09

    Sorry to have double post the same content in the ticket body. I would be happy to correct it but apparently, I don't have the right to do so.

    Sorry about that.

     
    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -10,30 +10,6 @@
     Postgresql versions : 9.2
     Can the bug be reproduced : yes
    
    -
    -How to reproduce:
    ------------------
    -
    -My current setup is :
    - - one PG 9.2 in master mode, accepting R/W requests. It's accessible using a VIP (managed by Pacemaker / Corosync). The server is doing WAL archiving through rsync (in daemon / server mode).
    - - one PG 9.2 in standby mode, replicating the master (async mode), and in hot standby. The server is setup to do WAL archiving on the same server as the master, but as it's currently in standby_mode, it's not doing anything. The server is also able to wake up using a trigger file.
    - - one backup server, running barman. A backup configuration has been created, called "pgcluster". The SSH and the conninfo are pointing to the VIP, currently on the master.
    -
    -Here's the steps :
    - - take a backup on the master engine. The WAL will begin to stack on the backup server.
    - - shut down the master.
    - - At that point, the VIP is switching to the slave.
    - - Wake up the slave (using the trigger file or by promoting it).
    - - The slave wake up and increase its timeline (let's say from 1 to 2). It's able to handle R/W connections. It's also beginning to send WAL to the backup server, using the same rsync connection as the master. The WAL are stacking in a new folder, because of the timeline ID modification.
    - - I decide to reinstall the master from the backup, using the "recover" command. So, on the backup :
    -    barman $> barman recover --remote-ssh-command "ssh postgres@pgmaster" pgcluster 20140609T182623 /var/lib/postgresql/9.2/main
    -
    -Current results:
    -----------------
    -
    -The recover begins, but failed on the transfer of the last WAL file. For example, if the master crashed on the WAL "0000000100000000000000FC", and the slave take over with first WAL "0000000200000000000000FC", the recover will crash on the "Hi,
    -
    -I've encountered a problem with the restore of a Postgresql Cluster. Here's some detail information on this :
    
     Information:
     ------------
    
     
    • labels: --> recovery, wal files
    • status: open --> accepted
    • assigned_to: Marco Nenciarini
     
  • Thank you for submitting this bug request. We place it in the current backlog.

     
  • Sterfield
    Sterfield
    2014-06-13

    Thanks !

    I'm eagerly waiting for this correction / new version !

    Cheers !

     
  • I've followed all the steps written by the user, but actually I'm unable to replicate the error.
    I've tried using different postgres and barman versions, but still i'm unable to replicate the error. Could you help me to replicate this error?

     
  • Hi,

    as Giulio was saying, we'd need a reproducible test case.

    I think a few more info would be extremely useful, for example:

    • did you specify any timeline ID for recovery?
    • could you please cut/paste the snippet from the xlogdb of that barman server around the switch WAL file?
    • could you also send us the output of 'barman diagnose'?

    In general, it is good practice to take a new full base backup immediately after the switch.

    Thanks,
    Gabriele

     
  • Sterfield
    Sterfield
    2014-07-16

    OK, I'll try to put additional information asap, but I've other things to do right now.

    Will try to do this in the upcoming days.

    Thanks,

     
  • Sterfield
    Sterfield
    2014-07-17

    You know what ? I can't reproduce it either.

    I started a master, pgbench on it, sync a slave on the master, crash the master and wake up the slave (which will continue to archive WAL on the same folder on barman server).

    Then I did a :

    barman recover --remote-ssh-command "ssh postgres@pgmaster" pg92cluster 20140717T121354 /var/lib/postgresql/9.2/main
    

    and no issue, the transfer of all WAL is OK, without the previous crash I encountered.

    Now, few comments :

    • the above command is restoring all WAL with the same timeline as the original backup (currently 1). The awaken slave is sending WAL to barman with the timeline 2, but the recover is not copying those WAL, nor the .history file.
    • if I set manually "--target-tli" to "2", I encounter a crash

    Output:

    Processing xlog segments for pg92cluster1
    
    0000000100000000000000F4
    000000020000000200000031
    000000020000000200000032
    000000020000000200000033
    000000020000000200000034
    000000020000000200000035
    000000020000000200000036
    Starting remote restore for server pg92cluster1 using backup 20140717T121354 
    Destination directory: /var/lib/postgresql/9.2/main
    Doing PITR. Recovery target timeline: '2'
    Copying the base backup.
    Copying required wal segments.
    Failure copying WAL files: data transfer failure while copying WAL files to directory '/var/lib/postgresql/9.2/main/barman_xlog'
    rsync error:
    ERROR: destination must be a directory when copying more than 1 file
    rsync error: errors selecting input/output files, dirs (code 3) at main.c(571) [Receiver=3.0.9]
    rsync: connection unexpectedly closed (9 bytes received so far) [sender]
    rsync error: error in rsync protocol data stream (code 12) at io.c(605) [sender=3.0.9]
    

    Definitly something fishy here.

    My aim is to restore the master using the backup done on the master, and all the WAL files of both timeline 1 & 2, to recover the master to a point very close to the "old-slave / new-master".

    Thanks,

     
  • Hi,

    we have releases 1.3.3-alpha.1 and we believe that that issue could be resolved now. Could you please try with that version and let us know?

    Thanks,
    Gabriele

     
  • Sterfield
    Sterfield
    2014-07-18

    Tested today, it's a bit better, as there's no more crashs, but the WAL are not correctly copied.

    • if there's no "--target-tli" option, the WAL of timeline 1 are being copied, but not the WAL of timeline 2 (which is, I guess, the correct behavior, as the backup.info file is mentionning "timeline=1").
    • if the option "--target-tli" is set to 2, no WAL are copied (the destination pg_xlog folder is empty), but there's no more application crash.

    Thanks,

     
  • Sterfield
    Sterfield
    2014-07-21

    Hi,

    OK, the week-end helped me to see things more clearly. I tried again with "--target-tli" options, and see that you are using a temporary folder "barman_xlog" to store all WAL files for both timeline 1 & 2.

    This is working as expecting, all WAL files are being copied from the backup server to the PG server, and are recovered during the recovery phase.

    I managed to put in the generated recovery.conf file the "trigger_file", "standby_mode" and "primary_conninfo", in order to start the replication. The new server starts, recover all the files from the "barman_xlog" folder, then connects to the master in order to start the replication.

    So, it's working fine !

    The only minor issue I have, is that the "barman_xlog" folder is not destroyed, because I'm starting the replication right after the initial WAL recovery, thus keeping the engine in recovery state. So after the initial recovery, I have to open the recovery.conf file, removing the now useless options "restore_command", "recovery_end_command" and "recovery_target_timeline", and manually remove the "barman_xlog".

    Thanks for your help on this subject !

     
  • Apertoso
    Apertoso
    2014-12-19

    I can confirm that this is now working correctly - tested with PG 9.3.5 & Barman 1.3.3 ; almost identical setup as described in ticket. Using the --target-tli option, I'm able to restore from:
    last base backup from previous master
    WAL files from master before "crash"
    * WAL files from new master, with new timeline.

    with all data present.

     
    • status: accepted --> closed