Menu

Crashed during sync, how to get SR to continue syncing

Help
therealjmc
2014-11-07
2014-11-10
  • therealjmc

    therealjmc - 2014-11-07

    Hi,

    After restoring all drives one-after-another and moving a lot of files later I recreated my Snapraid 2 days ago. I had to move a bunch of files from one drive to another today and ran a sync afterwards. During the sync the power went out and of course my UPS didn't protect my external HDD case where my parity drives are stored (just... It seems I didn't pay a lot attention where I pluged it.... After power came back some seconds later the sync was stopped. But I can't get SR to sync, it's allways reporting that it has nothing to do. Moving 3-4 files results in just these file getting synced (he had some hours to go and syncing just took 2 minutes). So it looks like the parity is ok, but it isn't synced.

    Using 7.0 Beta or 6.3 didn't affect this. Any suggestions?

    Thanks
    Peter

     
  • John

    John - 2014-11-07

    Assuming you are not starting a recovery (in which case the safest move would be to image first anything that you might change...) and you are comfortable that the originals are fine - I think first you need to start with a fsck on the parity filesystem. After that run snapraid -p 100 scrub to see what errors finds (should be only in parity if the array is otherwise fully synced and ok). After that, IF AND ONLY IF THE ERRORS ARE ONLY IN PARITY you can fix them with "snapraid -e fix".

    I hope I didn't forget anything - and I did not test this ever. So I strongly recommend to wait for confirmation from some other users.

     
  • Andrea Mazzoleni

    Hi therealjmc,

    Just do what John is recommending. Anyway, if SnapRAID says that is has nothing to do, I expect that everything is really OK.

    SnapRAID is designed to handle also these kind of crashes. It saves the content file, only after having updated the parity. So, if from the content file everything is updated, also the parity should be.

    Ciao,
    Andrea

     
  • therealjmc

    therealjmc - 2014-11-07

    Hi Andrea,

    I started over again with clean parity drives and deleted the content files. BUT:

    61%, 2945189 MiB, 234 MiB/s, CPU 12%, 2:23 ETA
    61%, 2945378 MiB, 233 MiB/s, CPU 12%, 2:23 ETA
    61%, 2945548 MiB, 232 MiB/s, CPU 12%, 2:24 ETA
    Stopping at block 11801256
    Saving state to C:/SnapRAID.content...
    Saving state to W:/SnapRAID.content...
    Saving state to X:/SnapRAID.content...
    Saving state to Y:/SnapRAID.content...
    Saving state to Z:/SnapRAID.content...
    WARNING! Ignoring special 'system-directory' file 'Z:/PoolPart.984f4aaa-62bd-44e1-9bda-543831aa1ef8/.covefs'
    WARNING! Ignoring special 'system-directory' file 'Y:/PoolPart.40c4d6e8-c7b3-4409-9f04-1f8108aa0c30/.covefs'
    WARNING! Ignoring special 'system-directory' file 'X:/PoolPart.67efccb5-2a58-4b81-96d1-19b9227067b8/.covefs'
    Unexpected Windows error 21.
    Error writing file 'U:/SnapRAID.parity'. Input/output error [5/21].

    Since I ran the sync with the powershell script I got an logfile from it. As you can see it says it's an ETA from about 2,5 hours. Running sync after the power for the parity drives were back again leads to a "nothing to do". Moving 3-4 files results in parity-update for these 3-4 files.

    Please don't get my wrong, I think SR is great. But I think maybe there is a little bug where this isn't handled?

    Thanks!
    Peter

     
  • John

    John - 2014-11-07

    There is something wrong with that disk (U:). Try hdtune on it (the long check, last tab) -assuming Windows of course. If it doesn't show any "red" blocks format the partition. At any point if you suspect snapraid just fill the partition with something else and see if you have any issue (I suspect you will).

     
  • therealjmc

    therealjmc - 2014-11-07

    I don't have any problems with the drive. As I stated above it disappeared because the Power failed for this drive - it's an external USB3 drive containing the paritydata. I'm monitoring SMART data etc... It's just plain power failure. Please dont get me wrong - I know you wanna help, but actually - that drive is not the point of the thread and is 100% fine ;) The Drives are healthy and the filesystem is fine, too. Just the fact that SR doesn't continue and isn't complaining worries me a bit.

    But as you can see that it abort at 61 Percent and I found no way to get it to recognize that it aborted - beside running a full scrub or check maybe. But I think SR should be able to recognize that it aborted and continue from there (and of course commit any change that happened between the crashed run and the new run) as Andrea stated that it saves the state.

     
    • Leifi Plomeros

      Leifi Plomeros - 2014-11-07

      The suspected bug would be that content file was incorrectly saved with information that all diff files had been calculated into the parity file, when in reality only 61% had been calculated, and when running sync it discovers no differences from what is declared in the content file?

      Any chance that you could actually run a snapraid -p 100 scrub to determine if there really is a bug or if there must be some other explanation? (possible other explanations could for example be incorrect log or scheduled sync that completed the operation without you noticing).

       
    • Andrea Mazzoleni

      Hi therealjmc,

      The reported Windows error 21 is ERROR_NOT_READY, that means that the disk is refusing the parity write operation for whatever hardware error.
      When such unexpected errors happen, SnapRAID just abort.

      I understand that you are confident on that drive. But that error 21 suggests more a hardware problem than a SnapRAID bug. In fact, SnapRAID is just writing a file, and Windows fails the write operation with this error. I don't see how this could come from a SnapRAID bug, as whatever malfunction in SnapRAID would result in a different error.

      About having SnapRAID to continue from where it stopped, you have to enable the autosave feature of the content file. Then SnapRAID will continue from the latest saved state.

      Ciao,
      Andrea

       
      • therealjmc

        therealjmc - 2014-11-10

        Hi Andrea,

        I enabled thr autosave feature from the beginning (every 250gb). I don't think the error 21 is a SR Bug, the 21 should be there because my truecrypt mounted parity and 2-parity drive dropped out due to power to that case being dropped. I just couldn't get SR to sync the missing files, that is the bug I'm thinking of. But I'll have some time today to reproduce this and I'll be back ( ;) ) with some testing done. I still have the SR logfile from the crashed sync and I'll take a look again if anything else was mentioned. To bad I didn't save the one where it reported nothing to do...

         
  • therealjmc

    therealjmc - 2014-11-08

    It didn't complete as I actually was watching the sync via console on the server the very moment it happend. I cant run the scrub because I started over again with new content ane parity files. Ill try to reproduce this in my lab next week

     
  • Leifi Plomeros

    Leifi Plomeros - 2014-11-08

    I just did some experiments to see if I could repeat the problem but no luck.

    Whenever I remove a disk (data/parity) during sync I get an error message like this:
    DANGER! Unexpected read error in a data disk, it isn't possible to sync.
    or like this:
    DANGER! Write error in the Parity disk, it isn't possible to sync.

    SnapRaid status gives this:
    WARNING! The array is NOT fully synced.
    You have a sync in progress at 60%.
    No rehash is in progress or needed.
    No silent error detected.

    When I run SnapRaid Sync again it completes the operation.

    I used the beta 7.0 version that was available 2 weeks ago.
    I used 2 data disks and single parity
    I tried all scenarios I can think of, including moving almost all files from one data disk to another.

    So, yes, I think it would be very interesting to find out if you are able to recreate the situation and in which conditions it occurs.

     
  • therealjmc

    therealjmc - 2014-11-10

    I've recreated the situation sort of but sync continues as it should. I don't get it actually - maybe the file moving between the aborted sync and the new one got inbetween. I don't get it. I diffed - that is pointing out no differences (maybe this is not a real problem but a notice in the diff if there is a sync in progress would be nice?) and I'm sure I ran status on the array with the problem, too. But maybe I got confused with the 2 windows and didn't read it good enough and took the diff for the status output, just the lines saying no difference. I'll have to admit then that the fault seems to be between the chair and the monitor. :)

     
  • John

    John - 2014-11-10

    Keep in mind that in any modern computer writes to any filesystem are writes to memory and then are synced to the disk (and even the disk itself has some cache).

    Any "unsafe ejection" can (and very likely will) leave some things out of sync. At least at two levels:

    • filesystem inconsistencies - those might or might not be fixed next time the filesystem is mounted
    • inconsistencies between content and parity file

    Therefore we should probably add to the "best practices" the mandatory step of fsck/chkdsk + full scrub (in this order) after each such "bad" event.

     

Log in to post a comment.

MongoDB Logo MongoDB