How to check that an ncrcat command has correctly
done its works and has really concatenated files ?
By experience, I have encountered cases where no error
has been displayed by the command with a result file uncorrect.
I am wondering if there is an existing option to calculate
a md5sum or a signature on the data part (not metadata) of each file
to concatenate and compare it to the signature of the data part of
the resulting concatenated file ?
Let me know if I have missed something on this topic.
We want to reduce our number of inodes on our data archive but
want to be sure that concatenation of monthly files into yearly
files will be done at 100% sure.
Patrick
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
There's no automated way to verify the data integrity after concatenation.
Of course we hope the integrity is maintained or that the program fails with an error.
However, recent problems in the netCDF library layer (the "NOFILL bug") resulted in undetected corruption
which the feature you suggest would have detected. So it's reasonable to defensively program in
anticipation of a repeat of that. This feature is now TODO nco1027.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I come back on this issue because we are entering in my computer center
in a very very big task of migration to pass from a file system that uses bands
to a LUSTRE file system.
It concerns more than 100 millions of inodes representing 5000 To !
The strategy is to reduce the number of inodes by concatenation using ncrcat.
So the possiblity to check that ncrcat has correctly done its work is now crucial.
So could the TODO nco1027 (check md5sum of data before/after concatenation)
become a high priority ?
Another point, I would like that ncrcat display a warning when
the set of variables in files to be concatenated are different.
For now, there is an error only when first file
has less variables than other files.
Having a warning also when the first file has more
variables could help.
Thanks for your work
Patrick
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello Patrick,
This is an interesting request.
The data scale is large, and we want NCO to work well in the datacenter.
What data center is this, if I may ask?
I will "up" the priority of your md5sum request.
I though ncrcat would die with an error if the first file has more record variables than subsequent files.
Is this not true? Does it really proceed without complaining? In that case you're right there should be a warning.
cz
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
it seems to die as expected. see below. please clarify what you want.
zender@roulee:~$ ncks -O -v one_dmn_rec_var ~/nco/data/in.nc ~/foo1.nc
zender@roulee:~$ ncks -O -v one_dmn_rec_var,two_dmn_rec_var ~/nco/data/in.nc ~/foo2.nc
zender@roulee:~$ ncrcat -O ~/foo2.nc ~/foo1.nc ~/foo.nc
ERROR: nco_inq_varid() reports requested variable "two_dmn_rec_var" is not in input file
nco_err_exit(): ERROR Short NCO-generated message (usually name of function that triggered error): nco_inq_varid()
nco_err_exit(): ERROR Error code is -49. Translation into English with nc_strerror(-49) is "NetCDF: Variable not found"
nco_err_exit(): ERROR NCO will now exit with system call abort()
Abandon
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
MD5 digests are now in the main trunk of ncks, ncecat, and ncrcat.
Provisional documentation is at http://nco.sf.net/nco.html#md5
Please post any feedback.
cz
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
To get back on the issue of warnings/errors with ncrcat
when the set of variables differs from files to cat.
1) when the set of variables from foo1.nc is larger than foo2.nc
then an error occurs.
-> OK ncrcat does well.
2) when the set of variables from foo1.nc is smaller than foo2.nc
then no warning occurs.
-> I propose to warn the user because the resulting file will
not contains all the variables from the files to be concatenated.
And he should be advise.
Just:
$ ncrcat -O foo1.nc foo2.nc foo.nc
-> no warning
I will now test the md5 checking you have added. Great.
I have tested the -md5_digest and it gives a very good check
to be confident to what is done. So a big thanks for this.
The overhead for this check seems not to be too expensive.
So another good point.
2 more requests about:
- the warning to fire when the set of variables from the 1st file
is smaller than the followings. Could it be added with the 4.10 release ?
- could you add also a small tool to list the variable of a file
This command could be name "nclist".
For now I use
If -md5_digest option is not scheduled for 4.10
can I be confident in the cvs trunk then ?
What are the problem raised by this option ?
Could you tell me if this option could be officially released within 1 month ?
Ok for the nclist. Many ways to do this kind of things.
What about: ncdump -h file.nc | gawk '{if (match($0, /(byte|char|short|int|float|double) (.*)\(/, arr)) print arr }'
I have noticed that yours does not work with 4.0.1 because ncks -m does not
give ": type".
Patrick
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi All,
How to check that an ncrcat command has correctly
done its works and has really concatenated files ?
By experience, I have encountered cases where no error
has been displayed by the command with a result file uncorrect.
I am wondering if there is an existing option to calculate
a md5sum or a signature on the data part (not metadata) of each file
to concatenate and compare it to the signature of the data part of
the resulting concatenated file ?
Let me know if I have missed something on this topic.
We want to reduce our number of inodes on our data archive but
want to be sure that concatenation of monthly files into yearly
files will be done at 100% sure.
Patrick
There's no automated way to verify the data integrity after concatenation.
Of course we hope the integrity is maintained or that the program fails with an error.
However, recent problems in the netCDF library layer (the "NOFILL bug") resulted in undetected corruption
which the feature you suggest would have detected. So it's reasonable to defensively program in
anticipation of a repeat of that. This feature is now TODO nco1027.
Hi,
I come back on this issue because we are entering in my computer center
in a very very big task of migration to pass from a file system that uses bands
to a LUSTRE file system.
It concerns more than 100 millions of inodes representing 5000 To !
The strategy is to reduce the number of inodes by concatenation using ncrcat.
So the possiblity to check that ncrcat has correctly done its work is now crucial.
So could the TODO nco1027 (check md5sum of data before/after concatenation)
become a high priority ?
Another point, I would like that ncrcat display a warning when
the set of variables in files to be concatenated are different.
For now, there is an error only when first file
has less variables than other files.
Having a warning also when the first file has more
variables could help.
Thanks for your work
Patrick
Hello Patrick,
This is an interesting request.
The data scale is large, and we want NCO to work well in the datacenter.
What data center is this, if I may ask?
I will "up" the priority of your md5sum request.
I though ncrcat would die with an error if the first file has more record variables than subsequent files.
Is this not true? Does it really proceed without complaining? In that case you're right there should be a warning.
cz
it seems to die as expected. see below. please clarify what you want.
zender@roulee:~$ ncks -O -v one_dmn_rec_var ~/nco/data/in.nc ~/foo1.nc
zender@roulee:~$ ncks -O -v one_dmn_rec_var,two_dmn_rec_var ~/nco/data/in.nc ~/foo2.nc
zender@roulee:~$ ncrcat -O ~/foo2.nc ~/foo1.nc ~/foo.nc
ERROR: nco_inq_varid() reports requested variable "two_dmn_rec_var" is not in input file
nco_err_exit(): ERROR Short NCO-generated message (usually name of function that triggered error): nco_inq_varid()
nco_err_exit(): ERROR Error code is -49. Translation into English with nc_strerror(-49) is "NetCDF: Variable not found"
nco_err_exit(): ERROR NCO will now exit with system call abort()
Abandon
MD5 digests are now in the main trunk of ncks, ncecat, and ncrcat.
Provisional documentation is at
http://nco.sf.net/nco.html#md5
Please post any feedback.
cz
Hi,
To get back on the issue of warnings/errors with ncrcat
when the set of variables differs from files to cat.
1) when the set of variables from foo1.nc is larger than foo2.nc
then an error occurs.
-> OK ncrcat does well.
2) when the set of variables from foo1.nc is smaller than foo2.nc
then no warning occurs.
-> I propose to warn the user because the resulting file will
not contains all the variables from the files to be concatenated.
And he should be advise.
Just:
-> no warning
I will now test the md5 checking you have added. Great.
PS: I belong to the IPSL (http://icmc.ipsl.fr/)
and work on the http://www.genci.fr/?lang=en
Patrick
OK, i understand what you mean abou tht warning now. it's on the TODO list.
cz
Hi,
I have tested the -md5_digest and it gives a very good check
to be confident to what is done. So a big thanks for this.
The overhead for this check seems not to be too expensive.
So another good point.
2 more requests about:
- the warning to fire when the set of variables from the 1st file
is smaller than the followings. Could it be added with the 4.10 release ?
- could you add also a small tool to list the variable of a file
This command could be name "nclist".
For now I use
But it would be nice to have this as a nco operator.
Rergards
Patrick
> Could it be added with the 4.10 release ?
No. I looked into this and it would be a large task. Will not be in 4.1.0.
> But it would be nice to have this as a nco operator.
Yes, this already exists as the NCO operator
ncks -m in.nc | grep -E ': type' | cut -f 1 -d ' ' | sed 's/://' | sort
which is so much easier to remember :)
Put it in your .bashrc as a standard Bash function
function nclist {ncks -m ${1} | grep -E ': type' | cut -f 1 -d ' ' | sed 's/://' | sort; }
Hi ,
If -md5_digest option is not scheduled for 4.10
can I be confident in the cvs trunk then ?
What are the problem raised by this option ?
Could you tell me if this option could be officially released within 1 month ?
Ok for the nclist. Many ways to do this kind of things.
What about: ncdump -h file.nc | gawk '{if (match($0, /(byte|char|short|int|float|double) (.*)\(/, arr)) print arr }'
I have noticed that yours does not work with 4.0.1 because ncks -m does not
give ": type".
Patrick
let me clarify:
-md5_digest _will be_ in 4.1.0, which will be released next week i think.
the warning to fire when the set of variables from the 1st file is smaller than the followings _will not_.
Ok perfect.
I can handle the warning on my side.
Thank you again Charles.
Regards
Patrick