I checked my email from my SnapRAID script last night and realized the sync didn't run and I'm greeted with this error.
WARNING! All the files previously present in disk 'd10' at dir '/mnt/data/HIT-MJ1323YNG1JYVC/'
are now missing or rewritten!
I didn't make any config file changes and the disk is still mounted and healthly and contains files. Any ideas to correct this?
Here's the counters at the end of a diff.
WARNING! All the files previously present in disk 'd10' at dir '/mnt/data/HIT-MJ1323YNG1JYVC/'
are now missing or rewritten!
74633 equal
0 moved
0 copied
0 restored
5263 updated
124 removed
183 added
There are differences!
Here's the smartmontools output for anyone that's interested...
smartctl6.22013-07-26r3841[x86_64-linux-4.3.0-040300-generic](localbuild)Copyright(C)2002-13,BruceAllen,ChristianFranke,www.smartmontools.org===STARTOFINFORMATIONSECTION===ModelFamily:HitachiDeskstar5K3000DeviceModel:HitachiHDS5C3030ALA630SerialNumber:MJ1323YNG1JYVCLUWWNDeviceId:5000cca228c0b40cFirmwareVersion:MEAOA580UserCapacity:3,000,592,982,016bytes[3.00TB]SectorSize:512byteslogical/physicalRotationRate:5700rpmDeviceis:Insmartctldatabase[fordetailsuse:-Pshow]ATAVersionis:ATA8-ACST13/1699-Drevision4SATAVersionis:SATA2.6,6.0Gb/s(current:6.0Gb/s)LocalTimeis:ThuDec3107:44:382015ESTSMARTsupportis:Available-devicehasSMARTcapability.SMARTsupportis:Enabled===STARTOFREADSMARTDATASECTION===SMARToverall-healthself-assessmenttestresult:PASSEDGeneralSMARTValues:Offlinedatacollectionstatus:(0x82)Offlinedatacollectionactivitywascompletedwithouterror.AutoOfflineDataCollection:Enabled.Self-testexecutionstatus:(0)Thepreviousself-testroutinecompletedwithouterrorornoself-testhaseverbeenrun.TotaltimetocompleteOfflinedatacollection:(39068)seconds.Offlinedatacollectioncapabilities:(0x5b)SMARTexecuteOfflineimmediate.AutoOfflinedatacollectionon/offsupport.SuspendOfflinecollectionuponnewcommand.Offlinesurfacescansupported.Self-testsupported.NoConveyanceSelf-testsupported.SelectiveSelf-testsupported.SMARTcapabilities:(0x0003)SavesSMARTdatabeforeenteringpower-savingmode.SupportsSMARTautosavetimer.Errorloggingcapability:(0x01)Errorloggingsupported.GeneralPurposeLoggingsupported.Shortself-testroutinerecommendedpollingtime:(1)minutes.Extendedself-testroutinerecommendedpollingtime:(651)minutes.SCTcapabilities:(0x003d)SCTStatussupported.SCTErrorRecoveryControlsupported.SCTFeatureControlsupported.SCTDataTablesupported.SMARTAttributesDataStructurerevisionnumber:16VendorSpecificSMARTAttributeswithThresholds:ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE1Raw_Read_Error_Rate0x000b100100016Pre-failAlways-02Throughput_Performance0x0005135135054Pre-failOffline-1083Spin_Up_Time0x0007126126024Pre-failAlways-552(Average551)4Start_Stop_Count0x0012100100000Old_ageAlways-6735Reallocated_Sector_Ct0x0033100100005Pre-failAlways-07Seek_Error_Rate0x000b100100067Pre-failAlways-08Seek_Time_Performance0x0005135135020Pre-failOffline-319Power_On_Hours0x0012096096000Old_ageAlways-3185610Spin_Retry_Count0x0013100100060Pre-failAlways-012Power_Cycle_Count0x0032100100000Old_ageAlways-76192Power-Off_Retract_Count0x0032099099000Old_ageAlways-1635193Load_Cycle_Count0x0012099099000Old_ageAlways-1635194Temperature_Celsius0x0002214214000Old_ageAlways-28(Min/Max18/51)196Reallocated_Event_Count0x0032100100000Old_ageAlways-0197Current_Pending_Sector0x0022100100000Old_ageAlways-0198Offline_Uncorrectable0x0008100100000Old_ageOffline-0199UDMA_CRC_Error_Count0x000a200200000Old_ageAlways-0SMARTErrorLogVersion:1NoErrorsLoggedSMARTSelf-testlogstructurerevisionnumber1NumTest_DescriptionStatusRemainingLifeTime(hours)LBA_of_first_error# 1 Short offline Completed without error 00% 31852 -# 2 Short offline Completed without error 00% 31830 -# 3 Short offline Completed without error 00% 31820 -# 4 Short offline Completed without error 00% 31780 -# 5 Extended offline Interrupted (host reset) 10% 31760 -# 6 Short offline Completed without error 00% 31708 -# 7 Short offline Completed without error 00% 31660 -# 8 Short offline Completed without error 00% 31637 -# 9 Short offline Completed without error 00% 31611 -#10 Short offline Completed without error 00% 31587 -#11 Extended offline Completed without error 00% 31577 -#12 Short offline Completed without error 00% 31563 -#13 Short offline Completed without error 00% 31543 -#14 Short offline Completed without error 00% 31525 -#15 Short offline Completed without error 00% 31491 -#16 Short offline Completed without error 00% 31467 -#17 Short offline Completed without error 00% 31443 -#18 Extended offline Completed without error 00% 31423 -#19 Short offline Completed without error 00% 31395 -#20 Short offline Completed without error 00% 31373 -#21 Short offline Completed without error 00% 31347 -SMARTSelectiveself-testlogdatastructurerevisionnumber1SPANMIN_LBAMAX_LBACURRENT_TEST_STATUS100Not_testing200Not_testing300Not_testing400Not_testing500Not_testingSelectiveself-testflags(0x0):Afterscanningselectedspans,doNOTread-scanremainderofdisk.IfSelectiveself-testispendingonpower-up,resumeafter0minutedelay.
Last edit: rubylaser 2015-12-31
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
To stop the script (and any other automated stuff concerning the disk) and make a backup of the content file (preferably the entire directory) is probably a good start.
The added and removed file mostly belong to other disks?
The file count on d10 is ~5.000?
Does the file count match snapraid status?
Can you try adding a single file to d10 and confirm that diff now shows one more file.
Have you investigated the files on d10 to confirm that the file names, timestamps and file sizes looks legit in general?
If the files look weird upon closer inspection, it is reasonable to assume something bad happend to the disk and that snapraid is correct to state all of them are modified. If not it seems reasonable to assume something bad has happend to the content file.
Are you using the betas with ability to rename disks? If yes, have you earlier renamed any disk? Is the diff result the same if you downgrade to latest non-beta?
As a last step I would try to fix a single file to a different disk and compare with original. Does snapraid even succeed in doing it or report an error?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If I add a file, the count in diff does reflect the increase.
touch testfile
# snapraid diff
WARNING! All the files previously present in disk 'd10' at dir '/mnt/data/HIT-MJ1323YNG1JYVC/'
are now missing or rewritten!
74633 equal
0 moved
0 copied
0 restored
5263 updated
124 removed
184 added
There are differences!
All files appear to be named appropriately, have correct timestamps, and the media files are all playable.
At this point, I don't want to do a fix until I know what the problem is as everything appears to be okay other than snapraid.
Last edit: rubylaser 2015-12-31
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
From the looks of it snapraid is correctly reading everything in the array, but is having a different expectation for more than 5.000 files, not isolated to d10.
If the files had been "touched" by some application you would have spotted that they were all modified between the last successful sync and the failed one. So I guess that can be ruled out.
If they had been moved from one disk to another by some kind of disk balancer they would have showed up as removed at the original location... Unless they were moved back and forth... Which should have resulted in a new time stamp or presented them as restored (at least a I think files only have inode changed are interpreted like that) I am out of logical explanations.
I still think the best option at this point is to do a limited fix to a different disk. Just make sure that you have backups of the content file. The fix operation should not change any parity so it should be completely safe to do.
After the fix you would either get an error message confirming that the content file is broken.
Or you would have files that you can compare to find out what is actually different.
That is probably as far as you can diagnose the problem without help from Andrea.
Last edit: Leifi Plomeros 2015-12-31
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Can you do a binary compare or create checksums for both files to confirm that they are identical? (according to google that is really easy to do in linux) :)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Sure, I could do an md5sum, but I would assume that a fix would have reported an error if it's hash didn't match. Here's an md5sum before and after a fix on the file, they are exactly the same before and after. Thanks for the continued ideas :)
Nothing seems wrong with the fixed versions either
You could chose between sync or fix to correct the issue
The bad part:
Snapraid thinks you have updated thousands of files since last sync.
When snapraid thinks a file is updated during sync it discards the hashes and creates new ones based on the updated file... Which introduces the possibility of undiscovered corruption in those files.
This could have happend on a daily basis for an unknown period of time. (The only reason you discovered it was because it affected all files on a nearly empty disk)
From a somewhat logical point of view:
It is difficult to imagine an undiscovered corruption of the content file on disk or in memory, only affecting time stamps, file size, file name or inode informationm and not triggering an error when snapraid reads the file.
Name and filesize can easily be ruled out since you were able to restore a file with correct name and file size.
Still these are, as far as I know, the only things snapraid looks at during diff. Which pretty much leaves only time stamps and inodes as possible mismatches.
If only inode was wrong I assume that snapraid diff would report the file as restored.
Which leaves only the timestamp... Is it really identical for both the original and the fixed file?
If it is then it is a complete mystery... Which you could probably make even more mysterious by doing this:
Move a file on d10 to a folder outside the array on d10.
Fix all missing files on d10 only (which would be only 1)
Run diff to see if the fixed file is again counted as updated or if you suddenly have 1 restored file.
If the file is reported as restored then obviously snapraid is working as designed and the only explanation left is that snapraid has correctly identified something different with the file attributes. (or that snapraid uses some mechanism to keep track of files it has actually fixed)
If the file is reported as updated, then everything points to some really strange bug in snapraid.
Another test you can do is to just wait a few days and see if the number of updated files increase. If that happens then you know for sure that something is in fact modifying the file attributes.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
True, durr, I guess I was looking for more info than a basic diff I guess. At this point all the files seem okay, so I guess I'll just force-empty on the array.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It's indeed a strange condition. It seems that inodes or timestamp of your files changed for some reason.
Anyway, if fix reported no error, your files are OK. The fix command doesn't use timestamp/inodes, but only the file path. So, if no error is reported it means that all the hashes are matching. At this point you can safely run the --force-empty command.
But I'm still interested to understand what happened.
What filesystem are you using for your disks ?
Does your files have subsecond timestamp ?
In the "Modify:" line, the subsecond number is ".686573281".
In some case, it could be ".0", and then SnapRAID doesn't use timestamp to match files, but only inodes.
Ciao,
Andrea
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks Andrea, I'm running --force-empty now. I'm using ext4 for both data and parity disks, so it should be good go. I do get subsecond accuracy as well.
Hello,
I checked my email from my SnapRAID script last night and realized the sync didn't run and I'm greeted with this error.
I didn't make any config file changes and the disk is still mounted and healthly and contains files. Any ideas to correct this?
Here's the counters at the end of a diff.
Here's the smartmontools output for anyone that's interested...
Last edit: rubylaser 2015-12-31
To stop the script (and any other automated stuff concerning the disk) and make a backup of the content file (preferably the entire directory) is probably a good start.
The added and removed file mostly belong to other disks?
The file count on d10 is ~5.000?
Does the file count match snapraid status?
Can you try adding a single file to d10 and confirm that diff now shows one more file.
Have you investigated the files on d10 to confirm that the file names, timestamps and file sizes looks legit in general?
If the files look weird upon closer inspection, it is reasonable to assume something bad happend to the disk and that snapraid is correct to state all of them are modified. If not it seems reasonable to assume something bad has happend to the content file.
Are you using the betas with ability to rename disks? If yes, have you earlier renamed any disk? Is the diff result the same if you downgrade to latest non-beta?
As a last step I would try to fix a single file to a different disk and compare with original. Does snapraid even succeed in doing it or report an error?
Thanks for the ideas! Here are answers to your questions.
I'm on SnapRAID 9.1 and not the newest betas supporting renaming disks (and, I have not renamed any disks).
Yes, the added and removed files all belong to other disks (not d10).
The filecount on d10 is less than 100.
No the counts do not match, SnapRAID status shows a slightly lower number of files than the actual count for d10 (37 v. 61).
If I add a file, the count in diff does reflect the increase.
All files appear to be named appropriately, have correct timestamps, and the media files are all playable.
At this point, I don't want to do a fix until I know what the problem is as everything appears to be okay other than snapraid.
Last edit: rubylaser 2015-12-31
From the looks of it snapraid is correctly reading everything in the array, but is having a different expectation for more than 5.000 files, not isolated to d10.
If the files had been "touched" by some application you would have spotted that they were all modified between the last successful sync and the failed one. So I guess that can be ruled out.
If they had been moved from one disk to another by some kind of disk balancer they would have showed up as removed at the original location... Unless they were moved back and forth... Which should have resulted in a new time stamp or presented them as restored (at least a I think files only have inode changed are interpreted like that) I am out of logical explanations.
I still think the best option at this point is to do a limited fix to a different disk. Just make sure that you have backups of the content file. The fix operation should not change any parity so it should be completely safe to do.
After the fix you would either get an error message confirming that the content file is broken.
Or you would have files that you can compare to find out what is actually different.
That is probably as far as you can diagnose the problem without help from Andrea.
Last edit: Leifi Plomeros 2015-12-31
Okay, I just did a fix on one file, and it completed without an issue. Here's what it looks like.
Name, timestamp and size match?
Can you do a binary compare or create checksums for both files to confirm that they are identical? (according to google that is really easy to do in linux) :)
Sure, I could do an md5sum, but I would assume that a fix would have reported an error if it's hash didn't match. Here's an md5sum before and after a fix on the file, they are exactly the same before and after. Thanks for the continued ideas :)
Before
Fix
md5sum After
The good part is:
The bad part:
From a somewhat logical point of view:
It is difficult to imagine an undiscovered corruption of the content file on disk or in memory, only affecting time stamps, file size, file name or inode informationm and not triggering an error when snapraid reads the file.
Name and filesize can easily be ruled out since you were able to restore a file with correct name and file size.
Still these are, as far as I know, the only things snapraid looks at during diff. Which pretty much leaves only time stamps and inodes as possible mismatches.
If only inode was wrong I assume that snapraid diff would report the file as restored.
Which leaves only the timestamp... Is it really identical for both the original and the fixed file?
If it is then it is a complete mystery... Which you could probably make even more mysterious by doing this:
If the file is reported as restored then obviously snapraid is working as designed and the only explanation left is that snapraid has correctly identified something different with the file attributes. (or that snapraid uses some mechanism to keep track of files it has actually fixed)
If the file is reported as updated, then everything points to some really strange bug in snapraid.
Another test you can do is to just wait a few days and see if the number of updated files increase. If that happens then you know for sure that something is in fact modifying the file attributes.
I'll have to try the rest of these tests tomorrow, because a sync doesn't want to run without forcing it. Thanks again for your help!
Is there a way to see specifically what files SnapRAID thinks are missing or rewritten? I'd like to complete a sync if possible.
Snapraid diff does that by default?
True, durr, I guess I was looking for more info than a basic diff I guess. At this point all the files seem okay, so I guess I'll just force-empty on the array.
Hi rubylaser,
It's indeed a strange condition. It seems that inodes or timestamp of your files changed for some reason.
Anyway, if fix reported no error, your files are OK. The fix command doesn't use timestamp/inodes, but only the file path. So, if no error is reported it means that all the hashes are matching. At this point you can safely run the --force-empty command.
But I'm still interested to understand what happened.
What filesystem are you using for your disks ?
Does your files have subsecond timestamp ?
You can check this with the stat command like:
In the "Modify:" line, the subsecond number is ".686573281".
In some case, it could be ".0", and then SnapRAID doesn't use timestamp to match files, but only inodes.
Ciao,
Andrea
Maybe you could add an option to "diff" to be more verbose, and report why each file is different: inode, timestamp (subsecond timestamp), etc.
If a problem like this comes up in the future, it might help to debug it.
Thanks Andrea, I'm running --force-empty now. I'm using ext4 for both data and parity disks, so it should be good go. I do get subsecond accuracy as well.
~~~~
root@fileserver:/storage/documents/Test Disc/Bitrate# stat bird60.mkv
File: ‘bird60.mkv’
Size: 167019983 Blocks: 326216 IO Block: 4096 regular file
Device: 1bh/27d Inode: 297552 Links: 1
Access: (0777/-rwxrwxrwx) Uid: ( 1000/ zack) Gid: ( 1000/ zack)
Access: 2015-12-31 14:58:52.686252107 -0500
Modify: 2010-05-09 13:49:00.000000000 -0400
Change: 2015-11-28 09:45:44.444903309 -0500
Birth: -
~~~~