Silent error but files are OK?

Help
Skibbi
2014-01-30
2014-02-06
  • Skibbi

    Skibbi - 2014-01-30

    Hello,
    Yesterday snapraid scrub found multiple silent errors in some of my files (WARNING! Unexpected data error in a data disk! The block is now marked as bad!). I've checked the affected files and they seems to be OK (most of them are rar files which I unpacked without errors). SMART data for drives is also OK. Hardware is also OK, at least memtest didn't report any issues.
    Is it safe to run suggested snapraid -e fix command or maybe there is a bug in latest 5.2 version?

     
  • jwill42

    jwill42 - 2014-01-30

    That would be risky. Your report is sorely lacking in details, but there is a distinct possibility that you have (or had) some hardware faults during checksum computation or verification.

     
    Last edit: jwill42 2014-01-30
  • Skibbi

    Skibbi - 2014-01-30

    Recovered files are broken. But I've made backup before running snapraid fix so I restored all files. The question is why snapraid corrupted my files checkums? I'm pretty sure the hardware is OK, it's not a brand new computer but it's uptime is months. Is it possible manually to unmark bad blocks?

     
  • Andrea Mazzoleni

    Hi Skibbi,

    Likely the machine had some kind of silent error when SnapRAID read such files the fist time, and then it computed the parity and the hash starting from wrong data.
    The fix command just restored such wrong data, that originally SnapRAID read.

    If you still have the files, it would be interesting to see in which part they differs, and for what content.

    We already had a report of this kind, and the most likely cause is a latent hardware problem triggered by the heavy load that SnapRAID is generating on the machine during sync operations.

    The fact that these kind of problem happens only during sync is not so strange, because only in sync you have all disks spinning reading or writing and the CPU almost full busy.

    Anyway, I'm planning to address this issue in the 6.0 release, moving the hash computation before the parity computation, one disk at a time, to avoid to put much pressure on the machine.

    This will allow to detect such kind of problems before computing wrong parity, but the only real fix is only to find the hardware problem. Like trying replacing the disk cabling, the power supply, the memory or whatever else.

    Ciao,
    Andrea

     
  • Skibbi

    Skibbi - 2014-01-31

    Hi Andrea,
    Unfortunately I've removed the corrupted files, did a new sync and restored original files so I cannot analyze the differences in detail. But according to log the errors were located in multiple blocks (I use 128k block size):

    DANGER! In the array there are 3861 silent errors!

    Some files were corrupted from position 0 to 381. Few other files were corrupted in 0-114 range. In total there were like 10 affected files. Files with the same corruption range belonged to the same directory.

    I'm just wondering wouldn't be possible to make hash computing twice - once before sync and another one after? Or maybe make few hash computations just to make sure no hardware errors are present?

    In fact right now I'm not sure about my parity file - there might be more silent errors - and in case of bad luck I wouldn't be able to verify the source files. Perhaps I should do a rescan/rehash of the whole array?

    Regards

     
  • Mitchell Deoudes

    Is there any facility within Snapraid currently (or could there be in the future) to "re-run" a sync, checking only changed files to verify both the hashes and the parity?

    I suppose you could hack this together with scripts and the check command - though it would probably mean invoking Snapraid many times, since Snapraid (I believe) only accepts command-line filters, rather than a filter file.

     
  • Gary Snow

    Gary Snow - 2014-02-03

    It has been my experience (through the hard knocks of troubleshooting this exact problem) that this is a HARDWARE error/problem. Most likely it is a MEMORY problem. Do NOT trust what MEMTEST is telling you because it is usually wrong and doesn't test the memory in a way that will expose the error. If the option is available to you I would replace the memory and then try again. At the very least pull some of your memory out and swap it around to see if you can find the bad stick.

    Like I said DO NOT TRUST what MEMTEST is telling you. I did the same thing and ended up swapping most (actually ALL) of my hardware out and in the end it turned out to be bad memory.

    Gary

     
  • Skibbi

    Skibbi - 2014-02-06

    Hello,
    Another week and another bunch of silent errors detected. But I found interesting thing - all corrupted files are interpreted as duplicates of another files in the same directory ('snapraid dup'):

    15000000 /mnt/data/Documents/archive.r02 = /mnt/data/Documents/archive.r03
    15000000 /mnt/data/Documents/archive.r07 = /mnt/data/Documents/archive.r08

    archive.r03 and archive.r08 are corrupted according to snapraid. Of course both files are totally fine - the problem is in computed hash.

    According to 'snapraid dup' there are few other mysterious duplicates, but it seems that all new files added to disks have correct hashes. I'm going to rehash those files again and will monitor the snapraid behaviour. I'll plan to test my computer with another memory as suggested just to make sure it's not the case.

    Could it be some bugs in linux kernel that might trigger those issues?

    Regards

     
    Last edit: Skibbi 2014-02-06

Log in to post a comment.