From: Jeffrey S W. <Jef...@no...> - 2004-11-09 16:02:37
|
Hi: I just noticed that compression doesn't seem to be working right (for me at least) in 0.9. Here's an example: with pytables 0.9 [mac28:~/python] jsw% nctoh5 --complevel=6 -o test.nc test.h5 [mac28:~/python] jsw% ls -l test.nc test.h5 -rw-r--r-- 1 jsw jsw 12089048 9 Nov 08:59 test.h5 -rw-r--r-- 1 jsw jsw 26355656 4 Nov 17:10 test.nc with pytables 0.8.1 [mac28:~/python] jsw% ls -l test.nc test.h5 -rw-r--r-- 1 jsw jsw 5344279 9 Nov 09:00 test.h5 -rw-r--r-- 1 jsw jsw 26355656 4 Nov 17:10 test.nc No matter what netcdf file I use as input, the resulting h5 file is about twice as large using 0.9 as it is in 0.8.1. BTW: the test.nc file I used here can be found at ftp://ftp.cdc.noaa.gov/Public/jsw. -Jeff -- Jeffrey S. Whitaker Phone : (303)497-6313 Meteorologist FAX : (303)497-6449 NOAA/OAR/CDC R/CDC1 Email : Jef...@no... 325 Broadway Web : www.cdc.noaa.gov/~jsw Boulder, CO, USA 80303-3328 Office : Skaggs Research Cntr 1D-124 |
From: Francesc A. <fa...@py...> - 2004-11-09 20:57:58
|
Hi Jeff, Yep, it seems that some rework on buffer sizes calculation in 0.9 has made the chunk sizes for compression much smaller, and hence the compression ratio. Please, try to apply the next patch and tell me if that works better: --- pytables-0.9/tables/EArray.py 2004-10-05 14:30:31.000000000 +0200 +++ EArray.py 2004-11-09 21:51:11.000000000 +0100 @@ -254,7 +254,7 @@ if maxTuples > 10: # Yes. So the chunk sizes for the non-extendeable dims will be # unchanged - chunksizes[extdim] = maxTuples // 10 + chunksizes[extdim] = maxTuples else: # No. reduce other dimensions until we get a proper chunksizes # shape @@ -268,7 +268,7 @@ break chunksizes[j] = 1 # Compute the chunksizes correctly for this j index - chunksize = maxTuples // 10 + chunksize = maxTuples if j < len(chunksizes): # Only modify chunksizes[j] if needed if chunksize < chunksizes[j]: If works better, I'll have to double check that indexation performance won't suffer because of this change. To say the truth, I don't quite remember why I've reduced the chunksizes by a factor of 10, although I want to believe that there was a good reason :-/ Cheers, A Dimarts 09 Novembre 2004 17:02, Jeffrey S Whitaker va escriure: > Hi: > > I just noticed that compression doesn't seem to be working right (for me > at least) in 0.9. Here's an example: > > with pytables 0.9 > > [mac28:~/python] jsw% nctoh5 --complevel=6 -o test.nc test.h5 > > [mac28:~/python] jsw% ls -l test.nc test.h5 > -rw-r--r-- 1 jsw jsw 12089048 9 Nov 08:59 test.h5 > -rw-r--r-- 1 jsw jsw 26355656 4 Nov 17:10 test.nc > > with pytables 0.8.1 > > [mac28:~/python] jsw% ls -l test.nc test.h5 > -rw-r--r-- 1 jsw jsw 5344279 9 Nov 09:00 test.h5 > -rw-r--r-- 1 jsw jsw 26355656 4 Nov 17:10 test.nc > > No matter what netcdf file I use as input, the resulting h5 file is > about twice as large using 0.9 as it is in 0.8.1. > > BTW: the test.nc file I used here can be found at > ftp://ftp.cdc.noaa.gov/Public/jsw. > > > -Jeff > -- Francesc Altet |
From: Jeffrey S W. <Jef...@no...> - 2004-11-09 21:10:21
|
Francesc Altet wrote: > Hi Jeff, > > Yep, it seems that some rework on buffer sizes calculation in 0.9 has made > the chunk sizes for compression much smaller, and hence the compression > ratio. Please, try to apply the next patch and tell me if that works better: > > --- pytables-0.9/tables/EArray.py 2004-10-05 14:30:31.000000000 +0200 > +++ EArray.py 2004-11-09 21:51:11.000000000 +0100 > @@ -254,7 +254,7 @@ > if maxTuples > 10: > # Yes. So the chunk sizes for the non-extendeable dims will be > # unchanged > - chunksizes[extdim] = maxTuples // 10 > + chunksizes[extdim] = maxTuples > else: > # No. reduce other dimensions until we get a proper chunksizes > # shape > @@ -268,7 +268,7 @@ > break > chunksizes[j] = 1 > # Compute the chunksizes correctly for this j index > - chunksize = maxTuples // 10 > + chunksize = maxTuples > if j < len(chunksizes): > # Only modify chunksizes[j] if needed > if chunksize < chunksizes[j]: > > > If works better, I'll have to double check that indexation performance won't > suffer because of this change. To say the truth, I don't quite remember why > I've reduced the chunksizes by a factor of 10, although I want to believe > that there was a good reason :-/ > > Cheers, > > A Dimarts 09 Novembre 2004 17:02, Jeffrey S Whitaker va escriure: > >>Hi: >> >>I just noticed that compression doesn't seem to be working right (for me >>at least) in 0.9. Here's an example: >> >>with pytables 0.9 >> >>[mac28:~/python] jsw% nctoh5 --complevel=6 -o test.nc test.h5 >> >>[mac28:~/python] jsw% ls -l test.nc test.h5 >>-rw-r--r-- 1 jsw jsw 12089048 9 Nov 08:59 test.h5 >>-rw-r--r-- 1 jsw jsw 26355656 4 Nov 17:10 test.nc >> >>with pytables 0.8.1 >> >>[mac28:~/python] jsw% ls -l test.nc test.h5 >>-rw-r--r-- 1 jsw jsw 5344279 9 Nov 09:00 test.h5 >>-rw-r--r-- 1 jsw jsw 26355656 4 Nov 17:10 test.nc >> >>No matter what netcdf file I use as input, the resulting h5 file is >>about twice as large using 0.9 as it is in 0.8.1. >> >>BTW: the test.nc file I used here can be found at >>ftp://ftp.cdc.noaa.gov/Public/jsw. >> >> >>-Jeff >> > > Francesc: That helped a little bit. Now I get [mac28:~/python] jsw% ls -l test.nc test.h5 -rw-r--r-- 1 jsw jsw 9281104 9 Nov 14:04 test.h5 -rw-r--r-- 1 jsw jsw 26355656 4 Nov 17:10 test.nc Still a long way from the 0.8.1 result of 5344279 though. -Jeff -- Jeffrey S. Whitaker Phone : (303)497-6313 Meteorologist FAX : (303)497-6449 NOAA/OAR/CDC R/CDC1 Email : Jef...@no... 325 Broadway Web : www.cdc.noaa.gov/~jsw Boulder, CO, USA 80303-3328 Office : Skaggs Research Cntr 1D-124 |
From: Francesc A. <fa...@py...> - 2004-11-09 21:20:54
|
A Dimarts 09 Novembre 2004 22:10, Jeffrey S Whitaker va escriure: > Francesc: That helped a little bit. Now I get > > [mac28:~/python] jsw% ls -l test.nc test.h5 > -rw-r--r-- 1 jsw jsw 9281104 9 Nov 14:04 test.h5 > -rw-r--r-- 1 jsw jsw 26355656 4 Nov 17:10 test.nc > > Still a long way from the 0.8.1 result of 5344279 though. Ok, but we are in the correct way. Let me study a bit more carefully the problem, and I'll get back with an answer. Cheers, -- Francesc Altet |
From: Francesc A. <fa...@py...> - 2004-11-10 10:16:21
|
Hi again, I've been looking deep into the problem, and it seems like I have a solution. The problem was that I had a mistake when I was implementing indexation, and the parameters for EArray chunk size computation remains for my early tests for optimizing chunksizes just for indexes. After that, I've moved the computation for optimum index chunksizes out of EArray module, but I forgot to restablish the correct values for general EArrays :-/ Check with the next patch (against original 0.9 sources): --- /home/falted/PyTables/exports/pytables-0.9/tables/EArray.py 2004-10-05 14:30:31.000000000 +0200 +++ EArray.py 2004-11-10 11:08:22.000000000 +0100 @@ -224,7 +224,7 @@ #bufmultfactor = int(1000 * 2) # Is a good choice too, # specially for very large tables and large available memory #bufmultfactor = int(1000 * 1) # Optimum for sorted object - bufmultfactor = int(1000 * 1) # Optimum for sorted object + bufmultfactor = int(1000 * 100) # Optimum for sorted object rowsizeinfile = rowsize expectedfsizeinKb = (expectedrows * rowsizeinfile) / 1024 That should get the 0.8.1 compression ratios back. You can play with increasing the bufmultfactor still more, and you will get better ratios, but I'm afraid that this will make the access to small portions of the EArray slower (much more data should be read compared with the desired range). Please, tell me about your findings and I'll fix that in CVS afterwards. Cheers, -- Francesc Altet |
From: Francesc A. <fa...@py...> - 2004-11-10 10:50:10
|
I've ended with a new rewrite of EArray._calcBufferSize method, that I'm including at the end of this message. Please, play with the different values in lines: #bufmultfactor = int(1000 * 10) # Conservative value bufmultfactor = int(1000 * 20) # Medium value #bufmultfactor = int(1000 * 50) # Aggresive value #bufmultfactor = int(1000 * 100) # Very Aggresive value and tell me your feedback. -- Francesc Altet def _calcBufferSize(self, atom, extdim, expectedrows, compress): """Calculate the buffer size and the HDF5 chunk size. The logic to do that is based purely in experiments playing with different buffer sizes, chunksize and compression flag. It is obvious that using big buffers optimize the I/O speed. This might (should) be further optimized doing more experiments. """ rowsize = atom.atomsize() #bufmultfactor = int(1000 * 10) # Conservative value bufmultfactor = int(1000 * 20) # Medium value #bufmultfactor = int(1000 * 50) # Aggresive value #bufmultfactor = int(1000 * 100) # Very Aggresive value rowsizeinfile = rowsize expectedfsizeinKb = (expectedrows * rowsizeinfile) / 1024 if expectedfsizeinKb <= 100: # Values for files less than 100 KB of size buffersize = 5 * bufmultfactor elif (expectedfsizeinKb > 100 and expectedfsizeinKb <= 1000): # Values for files less than 1 MB of size buffersize = 10 * bufmultfactor elif (expectedfsizeinKb > 1000 and expectedfsizeinKb <= 20 * 1000): # Values for sizes between 1 MB and 20 MB buffersize = 20 * bufmultfactor elif (expectedfsizeinKb > 20 * 1000 and expectedfsizeinKb <= 200 * 1000): # Values for sizes between 20 MB and 200 MB buffersize = 40 * bufmultfactor elif (expectedfsizeinKb > 200 * 1000 and expectedfsizeinKb <= 2000 * 1000): # Values for sizes between 200 MB and 2 GB buffersize = 50 * bufmultfactor else: # Greater than 2 GB buffersize = 60 * bufmultfactor # Max Tuples to fill the buffer maxTuples = buffersize // rowsize chunksizes = list(atom.shape) # Check if at least 1 tuple fits in buffer if maxTuples > 1: # Yes. So the chunk sizes for the non-extendeable dims will be # unchanged chunksizes[extdim] = maxTuples else: # No. reduce other dimensions until we get a proper chunksizes # shape chunksizes[extdim] = 1 # Only one row in extendeable dimension for j in range(len(chunksizes)): newrowsize = atom.itemsize for i in chunksizes[j+1:]: newrowsize *= i maxTuples = buffersize // newrowsize if maxTuples > 1: break chunksizes[j] = 1 # Compute the chunksizes correctly for this j index chunksize = maxTuples if j < len(chunksizes): # Only modify chunksizes[j] if needed if chunksize < chunksizes[j]: chunksizes[j] = chunksize else: chunksizes[-1] = 1 # very large itemsizes! # Compute the correct maxTuples number newrowsize = atom.itemsize for i in chunksizes: newrowsize *= i maxTuples = buffersize // newrowsize return (buffersize, maxTuples, chunksizes) |
From: Jeff W. <jef...@no...> - 2004-11-10 13:23:55
|
Francesc Altet wrote: >I've ended with a new rewrite of EArray._calcBufferSize method, that I'm >including at the end of this message. Please, play with the different values >in lines: > > #bufmultfactor = int(1000 * 10) # Conservative value > bufmultfactor = int(1000 * 20) # Medium value > #bufmultfactor = int(1000 * 50) # Aggresive value > #bufmultfactor = int(1000 * 100) # Very Aggresive value > >and tell me your feedback. > > > Francesc: That new version of _calcbuffersize fixed the problem - compression ratios with the 'medium' value of bufmultfactor are back to what they were in 0.8.1. Changing to any of the other values has very little effect on file size. Thanks for your prompt attention to this problem! BTW: I've added a couple more command line options to nctoh5 - the new version is at http://whitaker.homeunix.org/~jeff/nctoh5. The new switches are: --unpackshort=(0|1) -- unpack short integer variables to float variables using scale_factor and add_offset netCDF variable attributes (not active by default). --quantize=(0|1) -- quantize data to improve compression using least_significant_digit netCDF variable attribute (not active by default). See http://www.cdc.noaa.gov/cdc/conventions/cdc_netcdf_standard.shtml for further explanation of what this attribute means. -Jeff -- Jeffrey S. Whitaker Phone : (303)497-6313 NOAA/OAR/CDC R/CDC1 FAX : (303)497-6449 325 Broadway Web : http://www.cdc.noaa.gov/~jsw Boulder, CO, USA 80305-3328 Office: Skaggs Research Cntr 1D-124 |
From: Francesc A. <fa...@py...> - 2004-11-10 14:05:34
|
A Dimecres 10 Novembre 2004 14:23, Jeff Whitaker va escriure: > Francesc: That new version of _calcbuffersize fixed the problem - > compression ratios with the 'medium' value of bufmultfactor are back to > what they were in 0.8.1. Changing to any of the other values has very > little effect on file size. Thanks for your prompt attention to this > problem! Good. The patch has been uploaded to the Release-0.9_patches branch in CVS. It will hopefully appear in next 0.9.1 release. > BTW: I've added a couple more command line options to nctoh5 - the new > version is at http://whitaker.homeunix.org/~jeff/nctoh5. The new > switches are: > > --unpackshort=(0|1) -- unpack short integer variables to float variables > using scale_factor and add_offset netCDF variable attributes > (not active by default). > --quantize=(0|1) -- quantize data to improve compression using > least_significant_digit netCDF variable attribute (not active by > default). > See http://www.cdc.noaa.gov/cdc/conventions/cdc_netcdf_standard.shtml > for further explanation of what this attribute means. Great!, however I can't get the file. Is the above URL correct? Cheers, -- Francesc Altet |
From: Jeff W. <jef...@no...> - 2004-11-10 14:50:14
|
Francesc Altet wrote: > >>BTW: I've added a couple more command line options to nctoh5 - the new >>version is at http://whitaker.homeunix.org/~jeff/nctoh5. The new >>switches are: >> >> --unpackshort=(0|1) -- unpack short integer variables to float variables >> using scale_factor and add_offset netCDF variable attributes >> (not active by default). >> --quantize=(0|1) -- quantize data to improve compression using >> least_significant_digit netCDF variable attribute (not active by >>default). >> See http://www.cdc.noaa.gov/cdc/conventions/cdc_netcdf_standard.shtml >> for further explanation of what this attribute means. >> >> > >Great!, however I can't get the file. Is the above URL correct? > >Cheers, > > > Sorry - it's there now. -Jeff -- Jeffrey S. Whitaker Phone : (303)497-6313 NOAA/OAR/CDC R/CDC1 FAX : (303)497-6449 325 Broadway Web : http://www.cdc.noaa.gov/~jsw Boulder, CO, USA 80305-3328 Office: Skaggs Research Cntr 1D-124 |
From: Francesc A. <fa...@py...> - 2004-11-11 10:27:59
|
A Dimecres 10 Novembre 2004 15:50, Jeff Whitaker va escriure: > >>BTW: I've added a couple more command line options to nctoh5 - the new > >>version is at http://whitaker.homeunix.org/~jeff/nctoh5. The new > >>switches are: > > Sorry - it's there now. > Ok, so I've checked-in your improvements in nctoh5 utility. Cheers, -- Francesc Altet |