Menu

signature md5sum to check ncrcat

PBrockmann
2011-11-17
2013-10-17
  • PBrockmann

    PBrockmann - 2011-11-17

    Hi All,

    How to check that an ncrcat command has correctly
    done its works and has really concatenated files ?

    By experience, I have encountered cases where no error
    has been displayed by the command with a result file uncorrect.

    I am wondering if there is an existing option to calculate
    a md5sum or a signature on the data part (not metadata) of each file
    to concatenate and compare it to the signature of the data part of
    the resulting concatenated file ?

    Let me know if I have missed something on this topic.
    We want to reduce our number of inodes on our data archive but
    want to be sure that concatenation of monthly files into yearly
    files will be done at 100% sure.

    Patrick

     
  • Charlie Zender

    Charlie Zender - 2011-11-17

    There's no automated way to verify the data integrity after concatenation.
    Of course we hope the integrity is maintained or that the program fails with an error.
    However, recent problems in the netCDF library layer (the "NOFILL bug") resulted in undetected corruption
    which the feature you suggest would have detected. So it's reasonable to defensively program in
    anticipation of a repeat of that. This feature is now TODO nco1027.

     
  • PBrockmann

    PBrockmann - 2012-02-17

    Hi,
    I come back on this issue because we are entering in my computer center
    in a very very big task of migration to pass from a file system that uses bands
    to a LUSTRE file system.
    It concerns more than 100 millions of inodes representing 5000 To !

    The strategy is to reduce the number of inodes by concatenation using ncrcat.
    So the possiblity to check that ncrcat has correctly done its work is now crucial.
    So could the TODO nco1027 (check md5sum of data before/after concatenation)
    become a high priority ?

    Another point, I would like that ncrcat display a warning when
    the set of variables in files to be concatenated are different.
    For now, there is an error only when first file
    has less variables than other files.
    Having a warning also when the first file has more
    variables could help.

    Thanks for your work
    Patrick

     
  • Charlie Zender

    Charlie Zender - 2012-02-18

    Hello Patrick,
    This is an interesting request.
    The data scale is large, and we want NCO to work well in the datacenter.
    What data center is this, if I may ask?
    I will "up" the priority of your md5sum request.

    I though ncrcat would die with an error if the first file has more record variables than subsequent files.
    Is this not true? Does it really proceed without complaining? In that case you're right there should be a warning.

    cz

     
  • Charlie Zender

    Charlie Zender - 2012-02-18

    it seems to die as expected. see below. please clarify what you want.

    zender@roulee:~$ ncks -O -v one_dmn_rec_var ~/nco/data/in.nc ~/foo1.nc
    zender@roulee:~$ ncks -O -v one_dmn_rec_var,two_dmn_rec_var ~/nco/data/in.nc ~/foo2.nc
    zender@roulee:~$ ncrcat -O ~/foo2.nc ~/foo1.nc ~/foo.nc
    ERROR: nco_inq_varid() reports requested variable "two_dmn_rec_var" is not in input file
    nco_err_exit(): ERROR Short NCO-generated message (usually name of function that triggered error): nco_inq_varid()
    nco_err_exit(): ERROR Error code is -49. Translation into English with nc_strerror(-49) is "NetCDF: Variable not found"
    nco_err_exit(): ERROR NCO will now exit with system call abort()
    Abandon

     
  • Charlie Zender

    Charlie Zender - 2012-02-20

    MD5 digests are now in the main trunk of ncks, ncecat, and ncrcat.
    Provisional documentation is at
    http://nco.sf.net/nco.html#md5
    Please post any feedback.
    cz

     
  • PBrockmann

    PBrockmann - 2012-02-27

    Hi,

    To get back on the issue of warnings/errors with ncrcat
    when the set of variables differs from files to cat.

    1) when the set of variables from foo1.nc is larger than foo2.nc
    then an error occurs.
    -> OK ncrcat does well.

    2) when the set of variables from foo1.nc is smaller than foo2.nc
    then no warning occurs.
    -> I propose to warn the user because the resulting file will
    not contains all the variables from the files to be concatenated.
    And he should be advise.

    Just:

    $ ncrcat -O foo1.nc foo2.nc foo.nc
    

    -> no warning

    I will now test the md5 checking you have added. Great.

    PS: I belong to the IPSL (http://icmc.ipsl.fr/)
    and work on the http://www.genci.fr/?lang=en

    Patrick

     
  • Charlie Zender

    Charlie Zender - 2012-02-28

    OK, i understand what you mean abou tht warning now. it's on the TODO list.
    cz

     
  • PBrockmann

    PBrockmann - 2012-03-21

    Hi,

    I have tested the -md5_digest and it gives a very good check
    to be confident to what is done. So a big thanks for this.
    The overhead for this check seems not to be too expensive.
    So another good point.

    2 more requests about:
    - the warning to fire when the set of variables from the 1st file
    is smaller than the followings. Could it be added with the 4.10 release ?

    - could you add also a small tool to list the variable of a file
    This command could be name "nclist".
    For now I use 

    ncump -h file.nc | grep -E 'float|double' | cut -f 1 -d '(' | cut -f 2 -d ' ' | sort
    

    But it would be nice to have this as a nco operator.

    Rergards
    Patrick

     
  • Charlie Zender

    Charlie Zender - 2012-03-23

    > Could it be added with the 4.10 release ?

    No. I looked into this and it would be a large task. Will not be in 4.1.0.

    > But it would be nice to have this as a nco operator.

    Yes, this already exists as the NCO operator

    ncks -m in.nc | grep -E ': type' | cut -f 1 -d ' ' | sed 's/://' | sort

    which is so much easier to remember :)

    Put it in your .bashrc as a standard Bash function

    function nclist {ncks -m ${1} | grep -E ': type' | cut -f 1 -d ' ' | sed 's/://' | sort; }

     
  • PBrockmann

    PBrockmann - 2012-03-23

    Hi ,

    If -md5_digest option is not scheduled for 4.10
    can I be confident in the cvs trunk then ?
    What are the problem raised by this option ?
    Could you tell me if this option could be officially released within 1 month ?

    Ok for the nclist. Many ways to do this kind of things.
    What about: ncdump -h file.nc | gawk '{if (match($0, /(byte|char|short|int|float|double) (.*)\(/, arr)) print arr }'

    I have noticed that yours does not work with 4.0.1 because ncks -m does not
    give ": type".

    Patrick

     
  • Charlie Zender

    Charlie Zender - 2012-03-23

    let me clarify:

    -md5_digest _will be_ in 4.1.0, which will be released next week i think.

    the warning to fire when the set of variables from the 1st file is smaller than the followings _will not_.

     
  • PBrockmann

    PBrockmann - 2012-03-23

    Ok perfect.
    I can handle the warning on my side.

    Thank you again Charles.
    Regards
    Patrick

     

Log in to post a comment.