Menu

WARNING! All the files previously present in disk ... are now missing or rewritten!

Help
rubylaser
2015-12-31
2016-01-03
  • rubylaser

    rubylaser - 2015-12-31

    Hello,

    I checked my email from my SnapRAID script last night and realized the sync didn't run and I'm greeted with this error.

    WARNING! All the files previously present in disk 'd10' at dir '/mnt/data/HIT-MJ1323YNG1JYVC/'
    are now missing or rewritten!

    I didn't make any config file changes and the disk is still mounted and healthly and contains files. Any ideas to correct this?

    Here's the counters at the end of a diff.

    WARNING! All the files previously present in disk 'd10' at dir '/mnt/data/HIT-MJ1323YNG1JYVC/'
    are now missing or rewritten!
    
       74633 equal
           0 moved
           0 copied
           0 restored
        5263 updated
         124 removed
         183 added
    There are differences!
    

    Here's the smartmontools output for anyone that's interested...

    smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.3.0-040300-generic] (local build)
    Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
    
    === START OF INFORMATION SECTION ===
    Model Family:     Hitachi Deskstar 5K3000
    Device Model:     Hitachi HDS5C3030ALA630
    Serial Number:    MJ1323YNG1JYVC
    LU WWN Device Id: 5 000cca 228c0b40c
    Firmware Version: MEAOA580
    User Capacity:    3,000,592,982,016 bytes [3.00 TB]
    Sector Size:      512 bytes logical/physical
    Rotation Rate:    5700 rpm
    Device is:        In smartctl database [for details use: -P show]
    ATA Version is:   ATA8-ACS T13/1699-D revision 4
    SATA Version is:  SATA 2.6, 6.0 Gb/s (current: 6.0 Gb/s)
    Local Time is:    Thu Dec 31 07:44:38 2015 EST
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled
    
    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED
    
    General SMART Values:
    Offline data collection status:  (0x82) Offline data collection activity
                        was completed without error.
                        Auto Offline Data Collection: Enabled.
    Self-test execution status:      (   0) The previous self-test routine completed
                        without error or no self-test has ever
                        been run.
    Total time to complete Offline
    data collection:        (39068) seconds.
    Offline data collection
    capabilities:            (0x5b) SMART execute Offline immediate.
                        Auto Offline data collection on/off support.
                        Suspend Offline collection upon new
                        command.
                        Offline surface scan supported.
                        Self-test supported.
                        No Conveyance Self-test supported.
                        Selective Self-test supported.
    SMART capabilities:            (0x0003) Saves SMART data before entering
                        power-saving mode.
                        Supports SMART auto save timer.
    Error logging capability:        (0x01) Error logging supported.
                        General Purpose Logging supported.
    Short self-test routine
    recommended polling time:    (   1) minutes.
    Extended self-test routine
    recommended polling time:    ( 651) minutes.
    SCT capabilities:          (0x003d) SCT Status supported.
                        SCT Error Recovery Control supported.
                        SCT Feature Control supported.
                        SCT Data Table supported.
    
    SMART Attributes Data Structure revision number: 16
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
      2 Throughput_Performance  0x0005   135   135   054    Pre-fail  Offline      -       108
      3 Spin_Up_Time            0x0007   126   126   024    Pre-fail  Always       -       552 (Average 551)
      4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       673
      5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
      7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
      8 Seek_Time_Performance   0x0005   135   135   020    Pre-fail  Offline      -       31
      9 Power_On_Hours          0x0012   096   096   000    Old_age   Always       -       31856
     10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
     12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       76
    192 Power-Off_Retract_Count 0x0032   099   099   000    Old_age   Always       -       1635
    193 Load_Cycle_Count        0x0012   099   099   000    Old_age   Always       -       1635
    194 Temperature_Celsius     0x0002   214   214   000    Old_age   Always       -       28 (Min/Max 18/51)
    196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
    197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
    198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
    199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0
    
    SMART Error Log Version: 1
    No Errors Logged
    
    SMART Self-test log structure revision number 1
    Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
    # 1  Short offline       Completed without error       00%     31852         -
    # 2  Short offline       Completed without error       00%     31830         -
    # 3  Short offline       Completed without error       00%     31820         -
    # 4  Short offline       Completed without error       00%     31780         -
    # 5  Extended offline    Interrupted (host reset)      10%     31760         -
    # 6  Short offline       Completed without error       00%     31708         -
    # 7  Short offline       Completed without error       00%     31660         -
    # 8  Short offline       Completed without error       00%     31637         -
    # 9  Short offline       Completed without error       00%     31611         -
    #10  Short offline       Completed without error       00%     31587         -
    #11  Extended offline    Completed without error       00%     31577         -
    #12  Short offline       Completed without error       00%     31563         -
    #13  Short offline       Completed without error       00%     31543         -
    #14  Short offline       Completed without error       00%     31525         -
    #15  Short offline       Completed without error       00%     31491         -
    #16  Short offline       Completed without error       00%     31467         -
    #17  Short offline       Completed without error       00%     31443         -
    #18  Extended offline    Completed without error       00%     31423         -
    #19  Short offline       Completed without error       00%     31395         -
    #20  Short offline       Completed without error       00%     31373         -
    #21  Short offline       Completed without error       00%     31347         -
    
    SMART Selective self-test log data structure revision number 1
     SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
        1        0        0  Not_testing
        2        0        0  Not_testing
        3        0        0  Not_testing
        4        0        0  Not_testing
        5        0        0  Not_testing
    Selective self-test flags (0x0):
      After scanning selected spans, do NOT read-scan remainder of disk.
    If Selective self-test is pending on power-up, resume after 0 minute delay.
    
     

    Last edit: rubylaser 2015-12-31
  • Leifi Plomeros

    Leifi Plomeros - 2015-12-31

    To stop the script (and any other automated stuff concerning the disk) and make a backup of the content file (preferably the entire directory) is probably a good start.

    The added and removed file mostly belong to other disks?
    The file count on d10 is ~5.000?
    Does the file count match snapraid status?
    Can you try adding a single file to d10 and confirm that diff now shows one more file.

    Have you investigated the files on d10 to confirm that the file names, timestamps and file sizes looks legit in general?

    If the files look weird upon closer inspection, it is reasonable to assume something bad happend to the disk and that snapraid is correct to state all of them are modified. If not it seems reasonable to assume something bad has happend to the content file.

    Are you using the betas with ability to rename disks? If yes, have you earlier renamed any disk? Is the diff result the same if you downgrade to latest non-beta?

    As a last step I would try to fix a single file to a different disk and compare with original. Does snapraid even succeed in doing it or report an error?

     
    • rubylaser

      rubylaser - 2015-12-31

      Thanks for the ideas! Here are answers to your questions.

      I'm on SnapRAID 9.1 and not the newest betas supporting renaming disks (and, I have not renamed any disks).

      Yes, the added and removed files all belong to other disks (not d10).

      The filecount on d10 is less than 100.

      No the counts do not match, SnapRAID status shows a slightly lower number of files than the actual count for d10 (37 v. 61).

      SnapRAID status report:
      
         Files Fragmented Excess  Wasted  Used    Free  Use Name
                  Files  Fragments  GB      GB      GB
         13446      23      47       -    2881      67  97% d01
          5463      14      37       -    3924      72  98% d02
         12420      34      78       -    3914      72  98% d03
          7170       4       9       -    2879      52  98% d04
          9884       2       8       -    1922      46  97% d05
          3055      43      53       -    2869      80  97% d06
         24105       1       1       -    2736     210  92% d07
          1013      67      92       -    1584    1411  52% d08
          2077      65      84       -    2271    1727  56% d09
            37       0       0       -     100    2897   3% d10
            40       5       5       -      74    2925   2% d11
           258      57     100       -     656    2341  21% d12
          1052      37      48       -     685    3167  17% d13
       ------------------------------------------------------
         80020     352     562     0.0   26502   15071  63%
      
      root@fileserver:/mnt/data/HIT-MJ1323YNG1JYVC# find . -type f | wc -l
      61
      

      If I add a file, the count in diff does reflect the increase.

      touch testfile
      # snapraid diff
      WARNING! All the files previously present in disk 'd10' at dir '/mnt/data/HIT-MJ1323YNG1JYVC/'
      are now missing or rewritten!
      
         74633 equal
             0 moved
             0 copied
             0 restored
          5263 updated
           124 removed
           184 added
      There are differences!
      

      All files appear to be named appropriately, have correct timestamps, and the media files are all playable.

      At this point, I don't want to do a fix until I know what the problem is as everything appears to be okay other than snapraid.

       

      Last edit: rubylaser 2015-12-31
  • Leifi Plomeros

    Leifi Plomeros - 2015-12-31

    From the looks of it snapraid is correctly reading everything in the array, but is having a different expectation for more than 5.000 files, not isolated to d10.

    If the files had been "touched" by some application you would have spotted that they were all modified between the last successful sync and the failed one. So I guess that can be ruled out.

    If they had been moved from one disk to another by some kind of disk balancer they would have showed up as removed at the original location... Unless they were moved back and forth... Which should have resulted in a new time stamp or presented them as restored (at least a I think files only have inode changed are interpreted like that) I am out of logical explanations.

    I still think the best option at this point is to do a limited fix to a different disk. Just make sure that you have backups of the content file. The fix operation should not change any parity so it should be completely safe to do.

    After the fix you would either get an error message confirming that the content file is broken.
    Or you would have files that you can compare to find out what is actually different.

    That is probably as far as you can diagnose the problem without help from Andrea.

     

    Last edit: Leifi Plomeros 2015-12-31
    • rubylaser

      rubylaser - 2015-12-31

      Okay, I just did a fix on one file, and it completed without an issue. Here's what it looks like.

      root@fileserver:/mnt/data/SEA-Z7P0027C/documents/Test Disc/Bitrate# snapraid fix -f /documents/Test\ Disc/Bitrate/bird60.mkv
      Self test...
      Loading state from /var/snapraid/content...
      Searching disk d01...
      Searching disk d02...
      Searching disk d03...
      Searching disk d04...
      Searching disk d05...
      Searching disk d06...
      Searching disk d07...
      Searching disk d08...
      Searching disk d09...
      Searching disk d10...
      Searching disk d11...
      Searching disk d12...
      Searching disk d13...
      Filtering...
      Using 1751 MiB of memory.
      Initializing...
      Fixing...
      100% completed, 1838 MB processed in 0:01
      Everything OK
      
       
  • Leifi Plomeros

    Leifi Plomeros - 2015-12-31

    Name, timestamp and size match?

    Can you do a binary compare or create checksums for both files to confirm that they are identical? (according to google that is really easy to do in linux) :)

     
    • rubylaser

      rubylaser - 2015-12-31

      Sure, I could do an md5sum, but I would assume that a fix would have reported an error if it's hash didn't match. Here's an md5sum before and after a fix on the file, they are exactly the same before and after. Thanks for the continued ideas :)

      Before

      root@fileserver:/mnt/data/SEA-Z7P0027C/documents/Test Disc/Bitrate# md5sum bird50.mkv
      dd1f3013d3dbe2d5b9c000d23b490916  bird50.mkv
      

      Fix

      root@fileserver:/mnt/data/SEA-Z7P0027C/documents/Test Disc/Bitrate# snapraid fix -f /documents/Test\ Disc/Bitrate/bird50.mkv
      Self test...
      Loading state from /var/snapraid/content...
      Searching disk d01...
      Searching disk d02...
      Searching disk d03...
      Searching disk d04...
      Searching disk d05...
      Searching disk d06...
      Searching disk d07...
      Searching disk d08...
      Searching disk d09...
      Searching disk d10...
      Searching disk d11...
      Searching disk d12...
      Searching disk d13...
      Filtering...
      Using 1751 MiB of memory.
      Initializing...
      Fixing...
      100% completed, 1548 MB processed in 0:01
      Everything OK
      

      md5sum After

      root@fileserver:/mnt/data/SEA-Z7P0027C/documents/Test Disc/Bitrate# md5sum bird50.mkv
      dd1f3013d3dbe2d5b9c000d23b490916  bird50.mkv
      
       
  • Leifi Plomeros

    Leifi Plomeros - 2015-12-31

    The good part is:

    1. Nothing seems wrong with your files
    2. Nothing seems wrong with the fixed versions either
    3. You could chose between sync or fix to correct the issue

    The bad part:

    1. Snapraid thinks you have updated thousands of files since last sync.
    2. When snapraid thinks a file is updated during sync it discards the hashes and creates new ones based on the updated file... Which introduces the possibility of undiscovered corruption in those files.
    3. This could have happend on a daily basis for an unknown period of time. (The only reason you discovered it was because it affected all files on a nearly empty disk)

    From a somewhat logical point of view:
    It is difficult to imagine an undiscovered corruption of the content file on disk or in memory, only affecting time stamps, file size, file name or inode informationm and not triggering an error when snapraid reads the file.

    Name and filesize can easily be ruled out since you were able to restore a file with correct name and file size.

    Still these are, as far as I know, the only things snapraid looks at during diff. Which pretty much leaves only time stamps and inodes as possible mismatches.

    If only inode was wrong I assume that snapraid diff would report the file as restored.

    Which leaves only the timestamp... Is it really identical for both the original and the fixed file?

    If it is then it is a complete mystery... Which you could probably make even more mysterious by doing this:

    1. Move a file on d10 to a folder outside the array on d10.
    2. Fix all missing files on d10 only (which would be only 1)
    3. Run diff to see if the fixed file is again counted as updated or if you suddenly have 1 restored file.

    If the file is reported as restored then obviously snapraid is working as designed and the only explanation left is that snapraid has correctly identified something different with the file attributes. (or that snapraid uses some mechanism to keep track of files it has actually fixed)

    If the file is reported as updated, then everything points to some really strange bug in snapraid.

    Another test you can do is to just wait a few days and see if the number of updated files increase. If that happens then you know for sure that something is in fact modifying the file attributes.

     
  • rubylaser

    rubylaser - 2015-12-31

    I'll have to try the rest of these tests tomorrow, because a sync doesn't want to run without forcing it. Thanks again for your help!

    dd1f3013d3dbe2d5b9c000d23b490916  bird50.mkv
    root@fileserver:~# snapraid sync
    Self test...
    Loading state from /var/snapraid/content...
    Scanning disk d01...
    Scanning disk d02...
    Scanning disk d03...
    Scanning disk d04...
    Scanning disk d05...
    Scanning disk d06...
    Scanning disk d07...
    Scanning disk d08...
    Scanning disk d09...
    Scanning disk d10...
    Scanning disk d11...
    Scanning disk d12...
    Scanning disk d13...
    WARNING! All the files previously present in disk 'd10' at dir '/mnt/data/HIT-MJ1323YNG1JYVC/'
    are now missing or rewritten!
    If you want to 'sync' anyway, use 'snapraid --force-empty sync'
    
     
  • rubylaser

    rubylaser - 2016-01-02

    Is there a way to see specifically what files SnapRAID thinks are missing or rewritten? I'd like to complete a sync if possible.

     
  • Leifi Plomeros

    Leifi Plomeros - 2016-01-02

    Snapraid diff does that by default?

     
    • rubylaser

      rubylaser - 2016-01-02

      True, durr, I guess I was looking for more info than a basic diff I guess. At this point all the files seem okay, so I guess I'll just force-empty on the array.

       
  • Andrea Mazzoleni

    Hi rubylaser,

    It's indeed a strange condition. It seems that inodes or timestamp of your files changed for some reason.

    Anyway, if fix reported no error, your files are OK. The fix command doesn't use timestamp/inodes, but only the file path. So, if no error is reported it means that all the hashes are matching. At this point you can safely run the --force-empty command.

    But I'm still interested to understand what happened.

    What filesystem are you using for your disks ?
    Does your files have subsecond timestamp ?

    You can check this with the stat command like:

    $ stat snapraid.1 
    File: 'snapraid.1'
    ...  
    Modify: 2016-01-02 16:59:54.686573281 +0100
    

    In the "Modify:" line, the subsecond number is ".686573281".
    In some case, it could be ".0", and then SnapRAID doesn't use timestamp to match files, but only inodes.

    Ciao,
    Andrea

     
    • Jessie Taylor

      Jessie Taylor - 2016-01-02

      Maybe you could add an option to "diff" to be more verbose, and report why each file is different: inode, timestamp (subsecond timestamp), etc.

      If a problem like this comes up in the future, it might help to debug it.

       
    • rubylaser

      rubylaser - 2016-01-03

      Thanks Andrea, I'm running --force-empty now. I'm using ext4 for both data and parity disks, so it should be good go. I do get subsecond accuracy as well.

      ~~~~
      root@fileserver:/storage/documents/Test Disc/Bitrate# stat bird60.mkv
      File: ‘bird60.mkv’
      Size: 167019983 Blocks: 326216 IO Block: 4096 regular file
      Device: 1bh/27d Inode: 297552 Links: 1
      Access: (0777/-rwxrwxrwx) Uid: ( 1000/ zack) Gid: ( 1000/ zack)
      Access: 2015-12-31 14:58:52.686252107 -0500
      Modify: 2010-05-09 13:49:00.000000000 -0400
      Change: 2015-11-28 09:45:44.444903309 -0500
      Birth: -
      ~~~~

       

Log in to post a comment.

MongoDB Logo MongoDB