From: Thadeus B. <tha...@th...> - 2013-03-08 01:14:28
|
Thank you for the information. I will run a few more tests over the next couple of days, one day with no compression, and one day with a chunksize similar to what will be appended each cycle, hopefully I will get a chance to report back. A ptrepack into a file with no compression is half the size of its append/compress/lots of unused space counterpart. The reason for using compression is to reduce the IO required from the network backed storage, not necessarily reduce disk space, although that is a plus. -- Thadeus On Thu, Mar 7, 2013 at 5:40 PM, Anthony Scopatz <sc...@gm...> wrote: > Hi Thadeus, > > HDF5 does not guarantee that the data is contiguous on disk between > blocks. hat is, there may be empty space in your file. Furthermore, > compression really messes with HDF5's ability to predict how large blocks > will end up being. To avoid accidental data loss, HDF5 tends to over > predict the empty buffer space needed. > > Thus my guess is that by having this tight loop around open/append/close, > you keep accidentally triggering extraneous buffer space. You basically > have two options: > > 1. turn off compression. size prediction is exact without it. > 2. periodically run ptrepack. (every 10, 100, 1000 cycles? end of the > day?) > > Hope this helps > Be Well > Anthony > > > On Thu, Mar 7, 2013 at 5:26 PM, Thadeus Burgess <tha...@th...>wrote: > >> I have a PyTables file that receives many appends to a Table throughout >> the day, the file is opened, a small bit of data is appended, and the file >> is closed. The open/append/close can happen many times in a minute. >> Anywhere from 1-500 rows are appended at any given time. By the end of the >> day, this file is expected to have roughly 66000 rows. Chunkshape is set to >> 1500 for no particular reason (doesn't seem to make a difference, and some >> other files can be 5 million/day). BLOSC with lvl 9 compression is used on >> the table. Data is never deleted from the table. There are roughly 12 >> columns on the Table. >> >> The problem is that at the end of the day this file is 1GB in size. I >> don't understand why the file is growing so big. The tbl.size_on_disk shows >> a meager 20MB. >> >> I have used ptrepack with --keep-source-filters and --chunkshape=keep. >> The new file is only 30MB in size which is reasonable. >> I have also used ptrepack with --chunkshape=auto and although it set the >> chunkshape to around 388, there was no significant change in filesize from >> chunkshape of 1500. >> >> Is pytables not re-using chunks on new appends. When 50 rows are >> appended, is it still writing a chunk sized for 1500 rows. When the next >> append comes along, it writes a brand new chunk instead of opening the old >> chunk and appending the data? >> >> Should my chunksize really be "expected rows to append each time" instead >> of "expected total rows"? >> >> -- >> Thadeus >> >> >> >> ------------------------------------------------------------------------------ >> Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester >> Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the >> endpoint security space. For insight on selecting the right partner to >> tackle endpoint security challenges, access the full report. >> http://p.sf.net/sfu/symantec-dev2dev >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> > > > ------------------------------------------------------------------------------ > Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester > Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the > endpoint security space. For insight on selecting the right partner to > tackle endpoint security challenges, access the full report. > http://p.sf.net/sfu/symantec-dev2dev > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |