Menu

#802 gfarm2fs: reproducible data corruption

gfarm-2.6.1
closed
nobody
None
gfarm_v2
blocker
2.6.0
defect
2015-02-10
2014-12-25
onlyjob
No

With FUSE client 1.2.9.7 I've reproduced undetected silent data corruption in the test setup with two replicas where one GFSD is using spool on unreliable storage.

2.6.0 release notes stated:

End-to-end data integrity to detect silent data corruption.

So I would expect at least detection of corruption and ideally silent repair i.e. fallback to good replica.

Also I can not figure out how to use "gfcksum" utility -- it always complaints either "operation not supported" or "no checksum". Is integrity checking feature should be activated somehow?

Integrity of data is very important.

Discussion

  • Osamu Tatebe

    Osamu Tatebe - 2014-12-25

    end-to-end data integrity is enabled by specifying digest in gfmd.conf, and client_digest_check in gfarm2.conf. For example,

    add digest in gfmd.conf

    digest md5

    add client_digest_check in gfarm2.conf or ~/.gfarm2rc

    client_digest_check enable

    if digest is effective, gfcksum displays digest.

    Regards,
    Osamu

     
  • onlyjob

    onlyjob - 2014-12-25

    Please forgive my lack of experience with GFarmFS. I found that with

    digest md5
    

    in "/etc/gfmd.conf" and

    direct_local_access disable
    client_digest_check enable
    

    in "/etc/gfarm2.conf" "gfcksum" works and "gfarm2fs" detects corruption perfectly and even report file name where error was detected to syslog. Awesome. :)

    However there is no fallback to good replica: client merely returns I/O error.
    Read errors should be possible to fix automatically by falling back to healthy replica transparently.

    Therefore the following questions remain:

    • How to repair corrupted file, provided that healthy replica is available?
    • If minimum number of two replicas is enforced, is it always true that replicas are created in parallel (to avoid replicating the only corrupted copy)?
    • How to make sure that healthy replica exist?
    • Feature request: please implement
      • automatic repair in client (fallback to healthy replica when corruption detected)
      • automatic removal of corrupted replica if healthy replica exists.

    Thanks.

     
  • onlyjob

    onlyjob - 2015-01-05

    After fixing authentication problem (#803) I tried this test again and found that on checksum mismatch gfarm2fs just returns read error without trying to fetch another replica.
    I also confirmed that healthy replicas exist by stopping problematic GFSD and running integrity check again.

    I think that the correct behaviour would be to try fetching another replica on checksum mismatch. If healthy replica found then remove corrupted replica (to initiate replication). Return read error (only) if all replicas are corrupted.

     
  • Osamu Tatebe

    Osamu Tatebe - 2015-01-10

    I appreciate your comment. Actually, corrupted files need to be removed manually until now, which is quite a bit complicated. Regarding your request, it is not possible to try to fetch another replica since the digest error can be detected when the whole file content is read. Instead, the corrupted replica is automatically moved to lost+found by [r9407]. We think this would be best to do.

    Thanks,
    Osamu

     

    Related

    Commit: [r9407]

  • Osamu Tatebe

    Osamu Tatebe - 2015-02-10
    • status: new --> closed
     
  • Osamu Tatebe

    Osamu Tatebe - 2015-02-10

    This feature is disabled by specifying 'spool_digest_error_check disable' on file system nodes [r9415]. Thanks.

     

    Related

    Commit: [r9415]


Log in to post a comment.