DUMP IS ABORTED from NetApp

Help
2007-03-11
2013-05-20
  • Frode Tennebø

    Frode Tennebø - 2007-03-11

    I really just need a confirmation that something other than rmt is at fault here:  I have a NetApp (running OnTap 6.5.6) which I had planned to backup by dumping to a RAID-5 array on a Linux server, coruscant.  However, when I try to do the actual dump, the dump is never finished and I get a "DUMP IS ABORTED" message.  Here is an example:

    netapp> dump 0f coruscant:/mnt/backup/test4 /
    DUMP: creating "/vol/vol0/../snapshot_for_backup.68" snapshot.
    DUMP: Using Full Volume Dump
    DUMP: Dumping tape file 1 on /mnt/backup/test4
    DUMP: Date of this level 0 dump: Sun Mar 11 15:16:37 2007.
    DUMP: Date of last level 0 dump: the epoch.
    DUMP: Dumping /vol/vol0 to coruscant
    DUMP: mapping (Pass I)[regular files]
    DUMP: mapping (Pass II)[directories]
    DUMP: estimated 19382290 KB.
    DUMP: dumping (Pass III) [directories]
    DUMP: dumping (Pass IV) [regular files]
    DUMP: Sun Mar 11 15:22:33 2007 : We have written 315321 KB.
    DUMP: Sun Mar 11 15:27:33 2007 : We have written 666167 KB.
    DUMP: Sun Mar 11 15:32:33 2007 : We have written 1016260 KB.
    DUMP: Sun Mar 11 15:37:33 2007 : We have written 1366540 KB.
    DUMP: Sun Mar 11 15:42:33 2007 : We have written 1717137 KB.
    DUMP: Sun Mar 11 15:47:33 2007 : We have written 2068045 KB.
    DUMP: Sun Mar 11 15:52:33 2007 : We have written 2418447 KB.
    DUMP: Remote write failed: RMT protocol response error
    DUMP: Sun Mar 11 16:01:32 2007 : We have written 2748379 KB.
    DUMP: DUMP IS ABORTED
    DUMP: Deleting "/vol/vol0/../snapshot_for_backup.68" snapshot.

    This is dumping to the server when the actual disks resides.  strace on the /etc/rmt process gives:

    read(0, "W", 1)                         = 1
    read(0, "6", 1)                         = 1
    read(0, "4", 1)                         = 1
    read(0, "5", 1)                         = 1
    read(0, "1", 1)                         = 1
    read(0, "2", 1)                         = 1
    read(0, "\n", 1)                        = 1
    read(0, "\377@\0\1\1@\377@@\377@\0\1\1@\377@@\377@\0\1\1@\377@@"..., 64512) = 8688
    read(0, "\t\1\t\0\10P\337\377\200\0\v\1\t\0\10P\337\377\200\0\t"..., 55824) = 7240
    read(0, "\257\0\317\257\0\0\17\1\22\0\0P\3770P\3770\0\0\r\1\22\0"..., 48584) = 7240
    read(0, "\0\217\357\0\217\357\7\0\v\1\10\217\357\0\217\357\5\0\t"..., 41344) = 7240
    read(0, "\6p\377\20\0\17\1\5\0\10\357\217\0\357\217\21\1\7\0\4p"..., 34104) = 7240
    read(0, "\357\v\377\0\217\t\0\21\1\4\0\20\237\t\377\2\257@\v\0\r"..., 26864) = 8688
    read(0, "\0\20\5@\4 \0\0\7\1\t\0\00600\0\0\v\1\t\0\00200\5\0\v\1"..., 18176) = 7240
    read(0, "\0\0\1\5\0\0\1\5\0\0\1\377\0\377\0\377\0\377\0\377\0\'"..., 10936) = 7240
    read(0, "\0\7\1\20\200\377\200\0@\377\277\0\1\377\0\377\0\377\0"..., 3696) = 2896
    read(0, "\16\377\377\0\0\277\377@\0\7\1\2\377\377\7\0\6\277\377"..., 800) = 800
    write(3, "\377@\0\1\1@\377@@\377@\0\1\1@\377@@\377@\0\1\1@\377@@"..., 64512) = 64512
    write(1, "A64512\n", 7)                 = 7
    read(0, "W", 1)                         = 1
    read(0, "6", 1)                         = 1
    read(0, "4", 1)                         = 1
    read(0, "5", 1)                         = 1
    read(0, "1", 1)                         = 1
    read(0, "2", 1)                         = 1
    read(0, "\n", 1)                        = 1
    read(0, "\4\377\377\0\v\1\0\277\21\377\2\277\0\v\1\0\200\23\377"..., 64512) = 1448
    read(0, "\377\200\v\0\4\377\377\0\t\1\35\0\4\1\1p\5\377\n\257P\0"..., 63064) = 46336
    read(0, "\5\1\21\0\25\1\r\0\4\217\377p\t\0\10\20\377\357\0\0\5\1"..., 16728) = 4344
    read(0, 0x9e7cae0, 12384)               = -1 ECONNRESET (Connection reset by peer)
    exit_group(2)                           = ?

    Apparently, either the connection is lost or closed by the netapp server.  Just for good measure I set up a Sparc Solaris 10 server, vader, for test.  It has the coruscant:/mnt/backup mounted at /mnt.  I get:

    netapp> dump 0f vader:/mnt/test3 /
    DUMP: creating "/vol/vol0/../snapshot_for_backup.69" snapshot.
    DUMP: Using Full Volume Dump
    DUMP: Dumping tape file 1 on /mnt/test3
    DUMP: Date of this level 0 dump: Sun Mar 11 16:46:46 2007.
    DUMP: Date of last level 0 dump: the epoch.
    DUMP: Dumping /vol/vol0 to vader
    DUMP: mapping (Pass I)[regular files]
    DUMP: mapping (Pass II)[directories]
    DUMP: estimated 19384212 KB.
    DUMP: dumping (Pass III) [directories]
    DUMP: dumping (Pass IV) [regular files]
    DUMP: Sun Mar 11 16:52:40 2007 : We have written 306759 KB.
    DUMP: Sun Mar 11 16:57:40 2007 : We have written 641030 KB.
    DUMP: Sun Mar 11 17:02:40 2007 : We have written 976318 KB.
    DUMP: Sun Mar 11 17:07:40 2007 : We have written 1312926 KB.
    DUMP: Sun Mar 11 17:12:40 2007 : We have written 1649725 KB.
    DUMP: Sun Mar 11 17:17:40 2007 : We have written 1985639 KB.
    DUMP: Sun Mar 11 17:22:40 2007 : We have written 2322940 KB.
    DUMP: Sun Mar 11 17:27:40 2007 : We have written 2659739 KB.
    DUMP: Remote write failed: RMT protocol response error
    DUMP: DUMP IS ABORTED
    DUMP: Deleting "/vol/vol0/../snapshot_for_backup.69" snapshot.

    The truss output looked like this:

    read(0, " W", 1)                                = 1
    read(0, " 6", 1)                                = 1
    read(0, " 4", 1)                                = 1
    read(0, " 5", 1)                                = 1
    read(0, " 1", 1)                                = 1
    read(0, " 2", 1)                                = 1
    read(0, "\n", 1)                                = 1
    read(0, "\0\0FF\0\001\v\0 701\bFF".., 64512)    = 7240
    read(0, " G\0\0FF01\0FF G\0 GFF01".., 57272)    = 50680
    read(0, "80\0070102FF8007\004FF80".., 6592)     = 6592
    kill(2999, SIGPOLL)                             = 0
        Received signal #21, SIGURG, in sigsuspend() [caught]
    sigsuspend(0x00022864)                          Err#91 ERESTART
    lwp_sigmask(SIG_SETMASK, 0x0031E000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
    setcontext(0xFFBFF7F0)
    write(3, "\0\0FF\0\001\v\0 701\bFF".., 64512)   = 64512
    kill(2999, SIGURG)                              = 0
        Received signal #14, SIGALRM, in sigsuspend() [caught]
    sigsuspend(0x00022874)                          Err#91 ERESTART
    lwp_sigmask(SIG_SETMASK, 0x0031E000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
    setcontext(0xFFBFF7F0)
    write(1, " A 6 4 5 1 2\n", 7)                   = 7
    kill(2999, SIGALRM)                             = 0
        Received signal #22, SIGPOLL, in sigsuspend() [caught]
    sigsuspend(0x000228A4)                          Err#91 ERESTART
    lwp_sigmask(SIG_SETMASK, 0x00302000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
    setcontext(0xFFBFF7F0)
    read(0, " W", 1)                                = 1
    read(0, " 6", 1)                                = 1
    read(0, " 4", 1)                                = 1
    read(0, " 5", 1)                                = 1
    read(0, " 1", 1)                                = 1
    read(0, " 2", 1)                                = 1
    read(0, "\n", 1)                                = 1
    read(0, "FFFF\0\t0107\00680FF80\0".., 64512)    = 2896
    read(0, 0x00024630, 61616)      (sleeping...)
    read(0, 0x00024630, 61616)                      Err#131 ECONNRESET
    _exit(2)

    Pretty much the same behaviour.  If nothing is done on netapp between two consecutive dumps the resulting dump file-sizes are identical.

    Dumping from linux->solaris and vice versa works fine.  Furthermore, if I on coruscant (the Linux machine) do a "rsh netapp dump 0f - / | cat > /mnt/backup/test1" (i.e. avoid rmt completely), the results is (similarly for the Solaris machine):

    DUMP: creating "/vol/vol0/../snapshot_for_backup.70" snapshot.
    DUMP: Using Full Volume Dump
    DUMP: Date of this level 0 dump: Sun Mar 11 20:19:21 2007.
    DUMP: Date of last level 0 dump: the epoch.
    DUMP: Dumping /vol/vol0 to standard output
    DUMP: mapping (Pass I)[regular files]
    DUMP: mapping (Pass II)[directories]
    DUMP: estimated 19391089 KB.
    DUMP: dumping (Pass III) [directories]
    DUMP: dumping (Pass IV) [regular files]
    DUMP: Sun Mar 11 20:25:15 2007 : We have written 1596060 KB.
    DUMP: Error writing to standard output: Broken pipe
    DUMP: Sun Mar 11 20:31:48 2007 : We have written 2748492 KB.
    DUMP: DUMP IS ABORTED
    DUMP: Deleting "/vol/vol0/../snapshot_for_backup.70" snapshot.

    Now, I find it hard to believe that there actually is a bug in the dump infrastructure of NetApp, but all evidence point to that.  The source code of the rmt client is not complex at all, and I find very little that can go wrong.  I'm not familiar with the inner workings of the dump/rmt protocol, but is it possible that the netapp is aborting due to some (missing) stimuli from the linux/solaris machines?

     
    • Stelian Pop

      Stelian Pop - 2007-03-12

      Hi,

      Let's analyse a bit the strace:
      > read(0, "W", 1) = 1
      > read(0, "6", 1) = 1
      > read(0, "4", 1) = 1
      > read(0, "5", 1) = 1
      > read(0, "1", 1) = 1
      > read(0, "2", 1) = 1
      > read(0, "\n", 1) = 1

      (remote) dump says it will send 64512 bytes over to rmt

      > read(0, "\377@\0\1\1@\377@@\377@\0\1\1@\377@@\377@\0\1\1@\377@@"..., 64512) = 8688
      > read(0, "\t\1\t\0\10P\337\377\200\0\v\1\t\0\10P\337\377\200\0\t"..., 55824) = 7240
      > read(0, "\257\0\317\257\0\0\17\1\22\0\0P\3770P\3770\0\0\r\1\22\0"..., 48584) = 7240
      > read(0, "\0\217\357\0\217\357\7\0\v\1\10\217\357\0\217\357\5\0\t"..., 41344) = 7240
      > read(0, "\6p\377\20\0\17\1\5\0\10\357\217\0\357\217\21\1\7\0\4p"..., 34104) = 7240
      > read(0, "\357\v\377\0\217\t\0\21\1\4\0\20\237\t\377\2\257@\v\0\r"..., 26864) = 8688
      > read(0, "\0\20\5@\4 \0\0\7\1\t\0\00600\0\0\v\1\t\0\00200\5\0\v\1"..., 18176) = 7240
      > read(0, "\0\0\1\5\0\0\1\5\0\0\1\377\0\377\0\377\0\377\0\377\0\'"..., 10936) = 7240
      > read(0, "\0\7\1\20\200\377\200\0@\377\277\0\1\377\0\377\0\377\0"..., 3696) = 2896
      > read(0, "\16\377\377\0\0\277\377@\0\7\1\2\377\377\7\0\6\277\377"..., 800) = 800

      8688*2+7240*6+2896+800 = 64512, ok, all the data rmt was waiting for was received

      > write(3, "\377@\0\1\1@\377@@\377@\0\1\1@\377@@\377@\0\1\1@\377@@"..., 64512) = 64512

      so rmt will write the data into the output file

      > write(1, "A64512\n", 7) = 7

      And reply to dump the success status.

      Next, we start over:

      > read(0, "W", 1) = 1
      > read(0, "6", 1) = 1
      > read(0, "4", 1) = 1
      > read(0, "5", 1) = 1
      > read(0, "1", 1) = 1
      > read(0, "2", 1) = 1
      > read(0, "\n", 1) = 1

      Remote dump says it will send 64512 bytes again.

      > read(0, "\4\377\377\0\v\1\0\277\21\377\2\277\0\v\1\0\200\23\377"..., 64512) = 1448
      > read(0, "\377\200\v\0\4\377\377\0\t\1\35\0\4\1\1p\5\377\n\257P\0"..., 63064) = 46336
      > read(0, "\5\1\21\0\25\1\r\0\4\217\377p\t\0\10\20\377\357\0\0\5\1"..., 16728) = 4344

      It sends 1448+46336+4344 = 52128 bytes, then

      > read(0, 0x9e7cae0, 12384) = -1 ECONNRESET (Connection reset by peer)

      oops, the remote dump ended the connection.

      > Now, I find it hard to believe that there actually is a bug in the dump
      > infrastructure of NetApp, but all evidence point to that.

      It surely does.

      > The source code of
      > the rmt client is not complex at all, and I find very little that can go wrong.
      > I'm not familiar with the inner workings of the dump/rmt protocol, but is it
      > possible that the netapp is aborting due to some (missing) stimuli from the
      > linux/solaris machines?

      I don't believe so, it really looks like a bug in NetApp dump. It seems to break when
      reaching some 32 bit limit (errors are constantly reported around 2^31 bytes, which is
      the max value of an signed 32 bit value), so it could be related to that. Or not.

      Stelian.

       

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.





No, thanks