Gfarm File System / Tickets / #802 gfarm2fs: reproducible data corruption

#802 gfarm2fs: reproducible data corruption

Milestone: gfarm-2.6.1

Status: closed

Owner: nobody

Labels: None

Resolution:

Component: gfarm_v2

Priority: blocker

Version: 2.6.0

Type: defect

Updated: 2015-02-10

Created: 2014-12-25

Creator: onlyjob

Private: No

With FUSE client 1.2.9.7 I've reproduced undetected silent data corruption in the test setup with two replicas where one GFSD is using spool on unreliable storage.

2.6.0 release notes stated:

End-to-end data integrity to detect silent data corruption.

So I would expect at least detection of corruption and ideally silent repair i.e. fallback to good replica.

Also I can not figure out how to use "gfcksum" utility -- it always complaints either "operation not supported" or "no checksum". Is integrity checking feature should be activated somehow?

Integrity of data is very important.

Discussion

Osamu Tatebe - 2014-12-25

end-to-end data integrity is enabled by specifying digest in gfmd.conf, and client_digest_check in gfarm2.conf. For example,

add digest in gfmd.conf

digest md5

add client_digest_check in gfarm2.conf or ~/.gfarm2rc

client_digest_check enable

if digest is effective, gfcksum displays digest.

Regards,
Osamu

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

onlyjob - 2014-12-25

Please forgive my lack of experience with GFarmFS. I found that with

digest md5

in "/etc/gfmd.conf" and

direct_local_access disable client_digest_check enable

in "/etc/gfarm2.conf" "gfcksum" works and "gfarm2fs" detects corruption perfectly and even report file name where error was detected to syslog. Awesome. :)

However there is no fallback to good replica: client merely returns I/O error.
Read errors should be possible to fix automatically by falling back to healthy replica transparently.

Therefore the following questions remain:

How to repair corrupted file, provided that healthy replica is available?

If minimum number of two replicas is enforced, is it always true that replicas are created in parallel (to avoid replicating the only corrupted copy)?

How to make sure that healthy replica exist?

Feature request: please implement

automatic repair in client (fallback to healthy replica when corruption detected)

automatic removal of corrupted replica if healthy replica exists.

Thanks.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

onlyjob - 2015-01-05

After fixing authentication problem (#803) I tried this test again and found that on checksum mismatch gfarm2fs just returns read error without trying to fetch another replica.
I also confirmed that healthy replicas exist by stopping problematic GFSD and running integrity check again.

I think that the correct behaviour would be to try fetching another replica on checksum mismatch. If healthy replica found then remove corrupted replica (to initiate replication). Return read error (only) if all replicas are corrupted.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Osamu Tatebe - 2015-01-10

I appreciate your comment. Actually, corrupted files need to be removed manually until now, which is quite a bit complicated. Regarding your request, it is not possible to try to fetch another replica since the digest error can be detected when the whole file content is read. Instead, the corrupted replica is automatically moved to lost+found by [r9407]. We think this would be best to do.

Thanks,
Osamu

Related

Commit: [r9407]

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Osamu Tatebe - 2015-02-10

status: new --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Osamu Tatebe - 2015-02-10

This feature is disabled by specifying 'spool_digest_error_check disable' on file system nodes [r9415]. Thanks.

Related

Commit: [r9415]

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.