From: Jacob B. <jac...@gm...> - 2012-07-17 20:30:31
|
Hello PyTables Users & Contributors, Just a quick question, let's say that I have certain identifiers that link to a set of data. Would it generally be faster for lookup to have each set a data as a separate table with an id as the tables name or to add this id as another column to a universal table of data and then let the in-kernel search query data only with a specific id? I hope you can understand my question would 1,000 tables of 100,000 records each be better for searching than 1 table with 100 million records and one extra id column? Thanks, Jacob Bennett -- Jacob Bennett Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Class of 2014| ben...@mi... |
From: Anthony S. <sc...@gm...> - 2012-07-17 21:20:30
|
On Tue, Jul 17, 2012 at 3:30 PM, Jacob Bennett <jac...@gm...>wrote: > Hello PyTables Users & Contributors, > > Just a quick question, let's say that I have certain identifiers that link > to a set of data. Would it generally be faster for lookup to have each set > a data as a separate table with an id as the tables name or to add this id > as another column to a universal table of data and then let the in-kernel > search query data only with a specific id? > I think that in general it is faster to have more tables with ids as names. For very small data, searching through a single larger table might be quicker than node access...but even then I doubt it. > I hope you can understand my question would 1,000 tables of 100,000 > records each be better for searching than 1 table with 100 million records > and one extra id column? > For these data sizes more tables is probably faster. (It should also be noted that in the more tables case, that data is actually smaller, because you can eliminate the id column.) Be Well Anthony > > Thanks, > Jacob Bennett > > -- > Jacob Bennett > Massachusetts Institute of Technology > Department of Electrical Engineering and Computer Science > Class of 2014| ben...@mi... > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Jacob B. <jac...@gm...> - 2012-07-17 21:55:34
|
Thanks for the input Anthony! -Jake On Tue, Jul 17, 2012 at 4:20 PM, Anthony Scopatz <sc...@gm...> wrote: > On Tue, Jul 17, 2012 at 3:30 PM, Jacob Bennett <jac...@gm...>wrote: > >> Hello PyTables Users & Contributors, >> >> Just a quick question, let's say that I have certain identifiers that >> link to a set of data. Would it generally be faster for lookup to have each >> set a data as a separate table with an id as the tables name or to add this >> id as another column to a universal table of data and then let the >> in-kernel search query data only with a specific id? >> > > I think that in general it is faster to have more tables with ids as > names. For very small data, searching through a single larger table might > be quicker than node access...but even then I doubt it. > > >> I hope you can understand my question would 1,000 tables of 100,000 >> records each be better for searching than 1 table with 100 million records >> and one extra id column? >> > > For these data sizes more tables is probably faster. > > (It should also be noted that in the more tables case, that data is > actually smaller, because you can eliminate the id column.) > > Be Well > Anthony > > >> >> Thanks, >> Jacob Bennett >> >> -- >> Jacob Bennett >> Massachusetts Institute of Technology >> Department of Electrical Engineering and Computer Science >> Class of 2014| ben...@mi... >> >> >> >> ------------------------------------------------------------------------------ >> Live Security Virtual Conference >> Exclusive live event will cover all the ways today's security and >> threat landscape has changed and how IT managers can respond. Discussions >> will include endpoint security, mobile security and the latest in malware >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > -- Jacob Bennett Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Class of 2014| ben...@mi... |
From: Ümit S. <uem...@gm...> - 2012-07-18 06:22:53
|
Just to add what Anthony said: In the end it also depends how unrelated your data is and how you want to access it. If the access scenaria is that you usually only search or select within a specific dataset then splitting up the datasets and putting them into separate tables is the way to go. In RBDMS terms this is btw called sharding. I have such a use case where I do have around 30000 datasets (each of them with around 5 million rows). I am only interested in one dataset at a time. So I created 30.000 tables. It works really good. And in case you want to access the data across the datasets (for aggregating or calculating averages) you can take a MapReduce approach which should work very well with this approach. On Tue, Jul 17, 2012 at 11:55 PM, Jacob Bennett <jac...@gm...> wrote: > Thanks for the input Anthony! > > -Jake > > > On Tue, Jul 17, 2012 at 4:20 PM, Anthony Scopatz <sc...@gm...> wrote: >> >> On Tue, Jul 17, 2012 at 3:30 PM, Jacob Bennett <jac...@gm...> >> wrote: >>> >>> Hello PyTables Users & Contributors, >>> >>> Just a quick question, let's say that I have certain identifiers that >>> link to a set of data. Would it generally be faster for lookup to have each >>> set a data as a separate table with an id as the tables name or to add this >>> id as another column to a universal table of data and then let the in-kernel >>> search query data only with a specific id? >> >> >> I think that in general it is faster to have more tables with ids as >> names. For very small data, searching through a single larger table might >> be quicker than node access...but even then I doubt it. >> >>> >>> I hope you can understand my question would 1,000 tables of 100,000 >>> records each be better for searching than 1 table with 100 million records >>> and one extra id column? >> >> >> For these data sizes more tables is probably faster. >> >> (It should also be noted that in the more tables case, that data is >> actually smaller, because you can eliminate the id column.) >> >> Be Well >> Anthony >> >>> >>> >>> Thanks, >>> Jacob Bennett >>> >>> -- >>> Jacob Bennett >>> Massachusetts Institute of Technology >>> Department of Electrical Engineering and Computer Science >>> Class of 2014| ben...@mi... >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> Live Security Virtual Conference >>> Exclusive live event will cover all the ways today's security and >>> threat landscape has changed and how IT managers can respond. Discussions >>> will include endpoint security, mobile security and the latest in malware >>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >>> _______________________________________________ >>> Pytables-users mailing list >>> Pyt...@li... >>> https://lists.sourceforge.net/lists/listinfo/pytables-users >>> >> >> >> >> ------------------------------------------------------------------------------ >> Live Security Virtual Conference >> Exclusive live event will cover all the ways today's security and >> threat landscape has changed and how IT managers can respond. Discussions >> will include endpoint security, mobile security and the latest in malware >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> > > > > -- > Jacob Bennett > Massachusetts Institute of Technology > Department of Electrical Engineering and Computer Science > Class of 2014| ben...@mi... > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Jacob B. <jac...@gm...> - 2012-07-18 10:49:34
|
I really like this way about going about it; however, would it be better to use the built in hierarchy for separation of the tables or to write to separate hdf5 files? When I am currently experimenting with concurrent read/write operations to a shared hdf5 file w/o hierarchy, I notice that the only errors that I get are occasional read errors (which isn't much of a problem for me), so I am thinking. Could there be a way to reduce the metadata within an hdf5 and at the same time, use a multi-tabled approach to solve my problem? Thanks, Jacob On Wed, Jul 18, 2012 at 1:22 AM, Ümit Seren <uem...@gm...> wrote: > Just to add what Anthony said: > In the end it also depends how unrelated your data is and how you want > to access it. If the access scenaria is that you usually only search > or select within a specific dataset then splitting up the datasets and > putting them into separate tables is the way to go. In RBDMS terms > this is btw called sharding. > I have such a use case where I do have around 30000 datasets (each of > them with around 5 million rows). I am only interested in one dataset > at a time. So I created 30.000 tables. It works really good. > And in case you want to access the data across the datasets (for > aggregating or calculating averages) you can take a MapReduce approach > which should work very well with this approach. > > > On Tue, Jul 17, 2012 at 11:55 PM, Jacob Bennett > <jac...@gm...> wrote: > > Thanks for the input Anthony! > > > > -Jake > > > > > > On Tue, Jul 17, 2012 at 4:20 PM, Anthony Scopatz <sc...@gm...> > wrote: > >> > >> On Tue, Jul 17, 2012 at 3:30 PM, Jacob Bennett < > jac...@gm...> > >> wrote: > >>> > >>> Hello PyTables Users & Contributors, > >>> > >>> Just a quick question, let's say that I have certain identifiers that > >>> link to a set of data. Would it generally be faster for lookup to have > each > >>> set a data as a separate table with an id as the tables name or to add > this > >>> id as another column to a universal table of data and then let the > in-kernel > >>> search query data only with a specific id? > >> > >> > >> I think that in general it is faster to have more tables with ids as > >> names. For very small data, searching through a single larger table > might > >> be quicker than node access...but even then I doubt it. > >> > >>> > >>> I hope you can understand my question would 1,000 tables of 100,000 > >>> records each be better for searching than 1 table with 100 million > records > >>> and one extra id column? > >> > >> > >> For these data sizes more tables is probably faster. > >> > >> (It should also be noted that in the more tables case, that data is > >> actually smaller, because you can eliminate the id column.) > >> > >> Be Well > >> Anthony > >> > >>> > >>> > >>> Thanks, > >>> Jacob Bennett > >>> > >>> -- > >>> Jacob Bennett > >>> Massachusetts Institute of Technology > >>> Department of Electrical Engineering and Computer Science > >>> Class of 2014| ben...@mi... > >>> > >>> > >>> > >>> > ------------------------------------------------------------------------------ > >>> Live Security Virtual Conference > >>> Exclusive live event will cover all the ways today's security and > >>> threat landscape has changed and how IT managers can respond. > Discussions > >>> will include endpoint security, mobile security and the latest in > malware > >>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > >>> _______________________________________________ > >>> Pytables-users mailing list > >>> Pyt...@li... > >>> https://lists.sourceforge.net/lists/listinfo/pytables-users > >>> > >> > >> > >> > >> > ------------------------------------------------------------------------------ > >> Live Security Virtual Conference > >> Exclusive live event will cover all the ways today's security and > >> threat landscape has changed and how IT managers can respond. > Discussions > >> will include endpoint security, mobile security and the latest in > malware > >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > >> _______________________________________________ > >> Pytables-users mailing list > >> Pyt...@li... > >> https://lists.sourceforge.net/lists/listinfo/pytables-users > >> > > > > > > > > -- > > Jacob Bennett > > Massachusetts Institute of Technology > > Department of Electrical Engineering and Computer Science > > Class of 2014| ben...@mi... > > > > > > > ------------------------------------------------------------------------------ > > Live Security Virtual Conference > > Exclusive live event will cover all the ways today's security and > > threat landscape has changed and how IT managers can respond. Discussions > > will include endpoint security, mobile security and the latest in malware > > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > > _______________________________________________ > > Pytables-users mailing list > > Pyt...@li... > > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > -- Jacob Bennett Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Class of 2014| ben...@mi... |
From: Ümit S. <uem...@gm...> - 2012-07-18 11:22:22
|
I think it depends and there are different ways to do it. But concurrent writes to one HDF5 file is not really supported (not even by the underlying HDF5 library unless you use the MPI version). So in case you want to write from different threads/processes you probably have to use separate hdf5 files. However writing from one process and reading from another is not much of an issue. Having everything in one hdf5 file has it's advantages as well as putting everything in separate hdf5 files. Filesystems can usually cope with one huge file much better than will millions of small files (copying, listing, etc). Of course if you have the datasets in separate hdf5 files it's easier to copy/move just single datasets compared to having everything in one hdf5 file (tough that's also possible using ptrepack). You could also create one hdf5 file for the meta information and create separate hdf5 files for each dataset. Then you can use hardlinks to connect the hdf5 file containing the meta-information to the hdf5 files for the datasets. I usually tend to put everything in one hdf5 file. On Wed, Jul 18, 2012 at 12:49 PM, Jacob Bennett <jac...@gm...> wrote: > I really like this way about going about it; however, would it be better to > use the built in hierarchy for separation of the tables or to write to > separate hdf5 files? When I am currently experimenting with concurrent > read/write operations to a shared hdf5 file w/o hierarchy, I notice that the > only errors that I get are occasional read errors (which isn't much of a > problem for me), so I am thinking. Could there be a way to reduce the > metadata within an hdf5 and at the same time, use a multi-tabled approach to > solve my problem? > > Thanks, > Jacob > > > On Wed, Jul 18, 2012 at 1:22 AM, Ümit Seren <uem...@gm...> wrote: >> >> Just to add what Anthony said: >> In the end it also depends how unrelated your data is and how you want >> to access it. If the access scenaria is that you usually only search >> or select within a specific dataset then splitting up the datasets and >> putting them into separate tables is the way to go. In RBDMS terms >> this is btw called sharding. >> I have such a use case where I do have around 30000 datasets (each of >> them with around 5 million rows). I am only interested in one dataset >> at a time. So I created 30.000 tables. It works really good. >> And in case you want to access the data across the datasets (for >> aggregating or calculating averages) you can take a MapReduce approach >> which should work very well with this approach. >> >> >> On Tue, Jul 17, 2012 at 11:55 PM, Jacob Bennett >> <jac...@gm...> wrote: >> > Thanks for the input Anthony! >> > >> > -Jake >> > >> > >> > On Tue, Jul 17, 2012 at 4:20 PM, Anthony Scopatz <sc...@gm...> >> > wrote: >> >> >> >> On Tue, Jul 17, 2012 at 3:30 PM, Jacob Bennett >> >> <jac...@gm...> >> >> wrote: >> >>> >> >>> Hello PyTables Users & Contributors, >> >>> >> >>> Just a quick question, let's say that I have certain identifiers that >> >>> link to a set of data. Would it generally be faster for lookup to have >> >>> each >> >>> set a data as a separate table with an id as the tables name or to add >> >>> this >> >>> id as another column to a universal table of data and then let the >> >>> in-kernel >> >>> search query data only with a specific id? >> >> >> >> >> >> I think that in general it is faster to have more tables with ids as >> >> names. For very small data, searching through a single larger table >> >> might >> >> be quicker than node access...but even then I doubt it. >> >> >> >>> >> >>> I hope you can understand my question would 1,000 tables of 100,000 >> >>> records each be better for searching than 1 table with 100 million >> >>> records >> >>> and one extra id column? >> >> >> >> >> >> For these data sizes more tables is probably faster. >> >> >> >> (It should also be noted that in the more tables case, that data is >> >> actually smaller, because you can eliminate the id column.) >> >> >> >> Be Well >> >> Anthony >> >> >> >>> >> >>> >> >>> Thanks, >> >>> Jacob Bennett >> >>> >> >>> -- >> >>> Jacob Bennett >> >>> Massachusetts Institute of Technology >> >>> Department of Electrical Engineering and Computer Science >> >>> Class of 2014| ben...@mi... >> >>> >> >>> >> >>> >> >>> >> >>> ------------------------------------------------------------------------------ >> >>> Live Security Virtual Conference >> >>> Exclusive live event will cover all the ways today's security and >> >>> threat landscape has changed and how IT managers can respond. >> >>> Discussions >> >>> will include endpoint security, mobile security and the latest in >> >>> malware >> >>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> >>> _______________________________________________ >> >>> Pytables-users mailing list >> >>> Pyt...@li... >> >>> https://lists.sourceforge.net/lists/listinfo/pytables-users >> >>> >> >> >> >> >> >> >> >> >> >> ------------------------------------------------------------------------------ >> >> Live Security Virtual Conference >> >> Exclusive live event will cover all the ways today's security and >> >> threat landscape has changed and how IT managers can respond. >> >> Discussions >> >> will include endpoint security, mobile security and the latest in >> >> malware >> >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> >> _______________________________________________ >> >> Pytables-users mailing list >> >> Pyt...@li... >> >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> >> > >> > >> > >> > -- >> > Jacob Bennett >> > Massachusetts Institute of Technology >> > Department of Electrical Engineering and Computer Science >> > Class of 2014| ben...@mi... >> > >> > >> > >> > ------------------------------------------------------------------------------ >> > Live Security Virtual Conference >> > Exclusive live event will cover all the ways today's security and >> > threat landscape has changed and how IT managers can respond. >> > Discussions >> > will include endpoint security, mobile security and the latest in >> > malware >> > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> > _______________________________________________ >> > Pytables-users mailing list >> > Pyt...@li... >> > https://lists.sourceforge.net/lists/listinfo/pytables-users >> > >> >> >> ------------------------------------------------------------------------------ >> Live Security Virtual Conference >> Exclusive live event will cover all the ways today's security and >> threat landscape has changed and how IT managers can respond. Discussions >> will include endpoint security, mobile security and the latest in malware >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > -- > Jacob Bennett > Massachusetts Institute of Technology > Department of Electrical Engineering and Computer Science > Class of 2014| ben...@mi... > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Jacob B. <jac...@gm...> - 2012-07-18 11:32:24
|
Sounds awesome, thanks for the help, I also have two more concerns. #1 - I will never concurrently write, I only have to worry about one write with many reads, will the hdf5 metadata for a tree-like structure be able to hold up in this scenario? #2 - When you have around 30,000 tables in your hdf5 file, you do not want to have every node directly linked to root (plus I don't think hdf5 can support that); however, I have no other natural grouping besides this, could this be a concern also. If you could help me out with these two items, I think I will have enough knowledge under my belt to know what I need to do. Thanks again! ;) On Wed, Jul 18, 2012 at 6:21 AM, Ümit Seren <uem...@gm...> wrote: > I think it depends and there are different ways to do it. > But concurrent writes to one HDF5 file is not really supported (not > even by the underlying HDF5 library unless you use the MPI version). > So in case you want to write from different threads/processes you > probably have to use separate hdf5 files. > However writing from one process and reading from another is not much > of an issue. > > Having everything in one hdf5 file has it's advantages as well as > putting everything in separate hdf5 files. > Filesystems can usually cope with one huge file much better than will > millions of small files (copying, listing, etc). > Of course if you have the datasets in separate hdf5 files it's easier > to copy/move just single datasets compared to having everything in one > hdf5 file (tough that's also possible using ptrepack). > > You could also create one hdf5 file for the meta information and > create separate hdf5 files for each dataset. Then you can use > hardlinks to connect the hdf5 file containing the meta-information to > the hdf5 files for the datasets. > > I usually tend to put everything in one hdf5 file. > > On Wed, Jul 18, 2012 at 12:49 PM, Jacob Bennett > <jac...@gm...> wrote: > > I really like this way about going about it; however, would it be better > to > > use the built in hierarchy for separation of the tables or to write to > > separate hdf5 files? When I am currently experimenting with concurrent > > read/write operations to a shared hdf5 file w/o hierarchy, I notice that > the > > only errors that I get are occasional read errors (which isn't much of a > > problem for me), so I am thinking. Could there be a way to reduce the > > metadata within an hdf5 and at the same time, use a multi-tabled > approach to > > solve my problem? > > > > Thanks, > > Jacob > > > > > > On Wed, Jul 18, 2012 at 1:22 AM, Ümit Seren <uem...@gm...> > wrote: > >> > >> Just to add what Anthony said: > >> In the end it also depends how unrelated your data is and how you want > >> to access it. If the access scenaria is that you usually only search > >> or select within a specific dataset then splitting up the datasets and > >> putting them into separate tables is the way to go. In RBDMS terms > >> this is btw called sharding. > >> I have such a use case where I do have around 30000 datasets (each of > >> them with around 5 million rows). I am only interested in one dataset > >> at a time. So I created 30.000 tables. It works really good. > >> And in case you want to access the data across the datasets (for > >> aggregating or calculating averages) you can take a MapReduce approach > >> which should work very well with this approach. > >> > >> > >> On Tue, Jul 17, 2012 at 11:55 PM, Jacob Bennett > >> <jac...@gm...> wrote: > >> > Thanks for the input Anthony! > >> > > >> > -Jake > >> > > >> > > >> > On Tue, Jul 17, 2012 at 4:20 PM, Anthony Scopatz <sc...@gm...> > >> > wrote: > >> >> > >> >> On Tue, Jul 17, 2012 at 3:30 PM, Jacob Bennett > >> >> <jac...@gm...> > >> >> wrote: > >> >>> > >> >>> Hello PyTables Users & Contributors, > >> >>> > >> >>> Just a quick question, let's say that I have certain identifiers > that > >> >>> link to a set of data. Would it generally be faster for lookup to > have > >> >>> each > >> >>> set a data as a separate table with an id as the tables name or to > add > >> >>> this > >> >>> id as another column to a universal table of data and then let the > >> >>> in-kernel > >> >>> search query data only with a specific id? > >> >> > >> >> > >> >> I think that in general it is faster to have more tables with ids as > >> >> names. For very small data, searching through a single larger table > >> >> might > >> >> be quicker than node access...but even then I doubt it. > >> >> > >> >>> > >> >>> I hope you can understand my question would 1,000 tables of 100,000 > >> >>> records each be better for searching than 1 table with 100 million > >> >>> records > >> >>> and one extra id column? > >> >> > >> >> > >> >> For these data sizes more tables is probably faster. > >> >> > >> >> (It should also be noted that in the more tables case, that data is > >> >> actually smaller, because you can eliminate the id column.) > >> >> > >> >> Be Well > >> >> Anthony > >> >> > >> >>> > >> >>> > >> >>> Thanks, > >> >>> Jacob Bennett > >> >>> > >> >>> -- > >> >>> Jacob Bennett > >> >>> Massachusetts Institute of Technology > >> >>> Department of Electrical Engineering and Computer Science > >> >>> Class of 2014| ben...@mi... > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > ------------------------------------------------------------------------------ > >> >>> Live Security Virtual Conference > >> >>> Exclusive live event will cover all the ways today's security and > >> >>> threat landscape has changed and how IT managers can respond. > >> >>> Discussions > >> >>> will include endpoint security, mobile security and the latest in > >> >>> malware > >> >>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > >> >>> _______________________________________________ > >> >>> Pytables-users mailing list > >> >>> Pyt...@li... > >> >>> https://lists.sourceforge.net/lists/listinfo/pytables-users > >> >>> > >> >> > >> >> > >> >> > >> >> > >> >> > ------------------------------------------------------------------------------ > >> >> Live Security Virtual Conference > >> >> Exclusive live event will cover all the ways today's security and > >> >> threat landscape has changed and how IT managers can respond. > >> >> Discussions > >> >> will include endpoint security, mobile security and the latest in > >> >> malware > >> >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > >> >> _______________________________________________ > >> >> Pytables-users mailing list > >> >> Pyt...@li... > >> >> https://lists.sourceforge.net/lists/listinfo/pytables-users > >> >> > >> > > >> > > >> > > >> > -- > >> > Jacob Bennett > >> > Massachusetts Institute of Technology > >> > Department of Electrical Engineering and Computer Science > >> > Class of 2014| ben...@mi... > >> > > >> > > >> > > >> > > ------------------------------------------------------------------------------ > >> > Live Security Virtual Conference > >> > Exclusive live event will cover all the ways today's security and > >> > threat landscape has changed and how IT managers can respond. > >> > Discussions > >> > will include endpoint security, mobile security and the latest in > >> > malware > >> > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > >> > _______________________________________________ > >> > Pytables-users mailing list > >> > Pyt...@li... > >> > https://lists.sourceforge.net/lists/listinfo/pytables-users > >> > > >> > >> > >> > ------------------------------------------------------------------------------ > >> Live Security Virtual Conference > >> Exclusive live event will cover all the ways today's security and > >> threat landscape has changed and how IT managers can respond. > Discussions > >> will include endpoint security, mobile security and the latest in > malware > >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > >> _______________________________________________ > >> Pytables-users mailing list > >> Pyt...@li... > >> https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > > > > > > -- > > Jacob Bennett > > Massachusetts Institute of Technology > > Department of Electrical Engineering and Computer Science > > Class of 2014| ben...@mi... > > > > > > > ------------------------------------------------------------------------------ > > Live Security Virtual Conference > > Exclusive live event will cover all the ways today's security and > > threat landscape has changed and how IT managers can respond. Discussions > > will include endpoint security, mobile security and the latest in malware > > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > > _______________________________________________ > > Pytables-users mailing list > > Pyt...@li... > > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > -- Jacob Bennett Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Class of 2014| ben...@mi... |
From: Ümit S. <uem...@gm...> - 2012-07-18 11:55:09
|
On Wed, Jul 18, 2012 at 1:32 PM, Jacob Bennett <jac...@gm...> wrote: > Sounds awesome, thanks for the help, I also have two more concerns. > > #1 - I will never concurrently write, I only have to worry about one write > with many reads, will the hdf5 metadata for a tree-like structure be able to > hold up in this scenario? To be honest I haven't really tried the concurrent read and single write use case. In my case I had a cherrypy python web-server (which uses multiple processes to handle requests) and usually I write from one request and reading is done from the same or another. But I don't think I ever had the use case where I read and wrote at the same time. However I had to keep the files open because of the way PyTables handles files (it cashes them as singleton object without a lock). For example if you close the file after you finished writing and at the same time you are reading from another process it will cause an exception in the read thread/process because it loses the file handle. So you probably have to take care of this yourself in your code. > #2 - When you have around 30,000 tables in your hdf5 file, you do not want > to have every node directly linked to root (plus I don't think hdf5 can > support that); however, I have no other natural grouping besides this, could > this be a concern also. Well in my case my datasets consisted not only of one table but also attional data (CArray, etc). So I naturally created groups for each datasets and stored meta-information as attributes on the group. These groups could contain sometimes additional groups and the actual data in form of tables and CArrays. It looked something like this: - root - data - dataset1 - table - transformation -table - CArray - dataset2 . . . - dataset30.000 > If you could help me out with these two items, I think I will have enough > knowledge under my belt to know what I need to do. Thanks again! ;) > > > On Wed, Jul 18, 2012 at 6:21 AM, Ümit Seren <uem...@gm...> wrote: >> >> I think it depends and there are different ways to do it. >> But concurrent writes to one HDF5 file is not really supported (not >> even by the underlying HDF5 library unless you use the MPI version). >> So in case you want to write from different threads/processes you >> probably have to use separate hdf5 files. >> However writing from one process and reading from another is not much >> of an issue. >> >> Having everything in one hdf5 file has it's advantages as well as >> putting everything in separate hdf5 files. >> Filesystems can usually cope with one huge file much better than will >> millions of small files (copying, listing, etc). >> Of course if you have the datasets in separate hdf5 files it's easier >> to copy/move just single datasets compared to having everything in one >> hdf5 file (tough that's also possible using ptrepack). >> >> You could also create one hdf5 file for the meta information and >> create separate hdf5 files for each dataset. Then you can use >> hardlinks to connect the hdf5 file containing the meta-information to >> the hdf5 files for the datasets. >> >> I usually tend to put everything in one hdf5 file. >> >> On Wed, Jul 18, 2012 at 12:49 PM, Jacob Bennett >> <jac...@gm...> wrote: >> > I really like this way about going about it; however, would it be better >> > to >> > use the built in hierarchy for separation of the tables or to write to >> > separate hdf5 files? When I am currently experimenting with concurrent >> > read/write operations to a shared hdf5 file w/o hierarchy, I notice that >> > the >> > only errors that I get are occasional read errors (which isn't much of a >> > problem for me), so I am thinking. Could there be a way to reduce the >> > metadata within an hdf5 and at the same time, use a multi-tabled >> > approach to >> > solve my problem? >> > >> > Thanks, >> > Jacob >> > >> > >> > On Wed, Jul 18, 2012 at 1:22 AM, Ümit Seren <uem...@gm...> >> > wrote: >> >> >> >> Just to add what Anthony said: >> >> In the end it also depends how unrelated your data is and how you want >> >> to access it. If the access scenaria is that you usually only search >> >> or select within a specific dataset then splitting up the datasets and >> >> putting them into separate tables is the way to go. In RBDMS terms >> >> this is btw called sharding. >> >> I have such a use case where I do have around 30000 datasets (each of >> >> them with around 5 million rows). I am only interested in one dataset >> >> at a time. So I created 30.000 tables. It works really good. >> >> And in case you want to access the data across the datasets (for >> >> aggregating or calculating averages) you can take a MapReduce approach >> >> which should work very well with this approach. >> >> >> >> >> >> On Tue, Jul 17, 2012 at 11:55 PM, Jacob Bennett >> >> <jac...@gm...> wrote: >> >> > Thanks for the input Anthony! >> >> > >> >> > -Jake >> >> > >> >> > >> >> > On Tue, Jul 17, 2012 at 4:20 PM, Anthony Scopatz <sc...@gm...> >> >> > wrote: >> >> >> >> >> >> On Tue, Jul 17, 2012 at 3:30 PM, Jacob Bennett >> >> >> <jac...@gm...> >> >> >> wrote: >> >> >>> >> >> >>> Hello PyTables Users & Contributors, >> >> >>> >> >> >>> Just a quick question, let's say that I have certain identifiers >> >> >>> that >> >> >>> link to a set of data. Would it generally be faster for lookup to >> >> >>> have >> >> >>> each >> >> >>> set a data as a separate table with an id as the tables name or to >> >> >>> add >> >> >>> this >> >> >>> id as another column to a universal table of data and then let the >> >> >>> in-kernel >> >> >>> search query data only with a specific id? >> >> >> >> >> >> >> >> >> I think that in general it is faster to have more tables with ids as >> >> >> names. For very small data, searching through a single larger table >> >> >> might >> >> >> be quicker than node access...but even then I doubt it. >> >> >> >> >> >>> >> >> >>> I hope you can understand my question would 1,000 tables of 100,000 >> >> >>> records each be better for searching than 1 table with 100 million >> >> >>> records >> >> >>> and one extra id column? >> >> >> >> >> >> >> >> >> For these data sizes more tables is probably faster. >> >> >> >> >> >> (It should also be noted that in the more tables case, that data is >> >> >> actually smaller, because you can eliminate the id column.) >> >> >> >> >> >> Be Well >> >> >> Anthony >> >> >> >> >> >>> >> >> >>> >> >> >>> Thanks, >> >> >>> Jacob Bennett >> >> >>> >> >> >>> -- >> >> >>> Jacob Bennett >> >> >>> Massachusetts Institute of Technology >> >> >>> Department of Electrical Engineering and Computer Science >> >> >>> Class of 2014| ben...@mi... >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> ------------------------------------------------------------------------------ >> >> >>> Live Security Virtual Conference >> >> >>> Exclusive live event will cover all the ways today's security and >> >> >>> threat landscape has changed and how IT managers can respond. >> >> >>> Discussions >> >> >>> will include endpoint security, mobile security and the latest in >> >> >>> malware >> >> >>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> >> >>> _______________________________________________ >> >> >>> Pytables-users mailing list >> >> >>> Pyt...@li... >> >> >>> https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> >>> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> ------------------------------------------------------------------------------ >> >> >> Live Security Virtual Conference >> >> >> Exclusive live event will cover all the ways today's security and >> >> >> threat landscape has changed and how IT managers can respond. >> >> >> Discussions >> >> >> will include endpoint security, mobile security and the latest in >> >> >> malware >> >> >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> >> >> _______________________________________________ >> >> >> Pytables-users mailing list >> >> >> Pyt...@li... >> >> >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> >> >> >> > >> >> > >> >> > >> >> > -- >> >> > Jacob Bennett >> >> > Massachusetts Institute of Technology >> >> > Department of Electrical Engineering and Computer Science >> >> > Class of 2014| ben...@mi... >> >> > >> >> > >> >> > >> >> > >> >> > ------------------------------------------------------------------------------ >> >> > Live Security Virtual Conference >> >> > Exclusive live event will cover all the ways today's security and >> >> > threat landscape has changed and how IT managers can respond. >> >> > Discussions >> >> > will include endpoint security, mobile security and the latest in >> >> > malware >> >> > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> >> > _______________________________________________ >> >> > Pytables-users mailing list >> >> > Pyt...@li... >> >> > https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> > >> >> >> >> >> >> >> >> ------------------------------------------------------------------------------ >> >> Live Security Virtual Conference >> >> Exclusive live event will cover all the ways today's security and >> >> threat landscape has changed and how IT managers can respond. >> >> Discussions >> >> will include endpoint security, mobile security and the latest in >> >> malware >> >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> >> _______________________________________________ >> >> Pytables-users mailing list >> >> Pyt...@li... >> >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> > >> > >> > >> > >> > -- >> > Jacob Bennett >> > Massachusetts Institute of Technology >> > Department of Electrical Engineering and Computer Science >> > Class of 2014| ben...@mi... >> > >> > >> > >> > ------------------------------------------------------------------------------ >> > Live Security Virtual Conference >> > Exclusive live event will cover all the ways today's security and >> > threat landscape has changed and how IT managers can respond. >> > Discussions >> > will include endpoint security, mobile security and the latest in >> > malware >> > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> > _______________________________________________ >> > Pytables-users mailing list >> > Pyt...@li... >> > https://lists.sourceforge.net/lists/listinfo/pytables-users >> > >> >> >> ------------------------------------------------------------------------------ >> Live Security Virtual Conference >> Exclusive live event will cover all the ways today's security and >> threat landscape has changed and how IT managers can respond. Discussions >> will include endpoint security, mobile security and the latest in malware >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > -- > Jacob Bennett > Massachusetts Institute of Technology > Department of Electrical Engineering and Computer Science > Class of 2014| ben...@mi... > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Jacob B. <jac...@gm...> - 2012-07-18 12:05:10
|
Good to hear, were you able to get away with having 30,000 datasets directly linked to a similar node (in this case, data)? I seem to have a problem putting that many nodes from one root. -Jacob On Wed, Jul 18, 2012 at 6:54 AM, Ümit Seren <uem...@gm...> wrote: > On Wed, Jul 18, 2012 at 1:32 PM, Jacob Bennett > <jac...@gm...> wrote: > > Sounds awesome, thanks for the help, I also have two more concerns. > > > > #1 - I will never concurrently write, I only have to worry about one > write > > with many reads, will the hdf5 metadata for a tree-like structure be > able to > > hold up in this scenario? > > To be honest I haven't really tried the concurrent read and single > write use case. > In my case I had a cherrypy python web-server (which uses multiple > processes to handle requests) and usually I write from one request and > reading is done from the same or another. But I don't think I ever had > the use case where I read and wrote at the same time. > However I had to keep the files open because of the way PyTables > handles files (it cashes them as singleton object without a lock). > For example if you close the file after you finished writing and at > the same time you are reading from another process it will cause an > exception in the read thread/process because it loses the file handle. > So you probably have to take care of this yourself in your code. > > > > #2 - When you have around 30,000 tables in your hdf5 file, you do not > want > > to have every node directly linked to root (plus I don't think hdf5 can > > support that); however, I have no other natural grouping besides this, > could > > this be a concern also. > > > Well in my case my datasets consisted not only of one table but also > attional data (CArray, etc). > So I naturally created groups for each datasets and stored > meta-information as attributes on the group. These groups could > contain sometimes additional groups and the actual data in form of > tables and CArrays. It looked something like this: > > - root > - data > - dataset1 > - table > - transformation > -table > - CArray > - dataset2 > . > . > . > - dataset30.000 > > > > If you could help me out with these two items, I think I will have enough > > knowledge under my belt to know what I need to do. Thanks again! ;) > > > > > > On Wed, Jul 18, 2012 at 6:21 AM, Ümit Seren <uem...@gm...> > wrote: > >> > >> I think it depends and there are different ways to do it. > >> But concurrent writes to one HDF5 file is not really supported (not > >> even by the underlying HDF5 library unless you use the MPI version). > >> So in case you want to write from different threads/processes you > >> probably have to use separate hdf5 files. > >> However writing from one process and reading from another is not much > >> of an issue. > >> > >> Having everything in one hdf5 file has it's advantages as well as > >> putting everything in separate hdf5 files. > >> Filesystems can usually cope with one huge file much better than will > >> millions of small files (copying, listing, etc). > >> Of course if you have the datasets in separate hdf5 files it's easier > >> to copy/move just single datasets compared to having everything in one > >> hdf5 file (tough that's also possible using ptrepack). > >> > >> You could also create one hdf5 file for the meta information and > >> create separate hdf5 files for each dataset. Then you can use > >> hardlinks to connect the hdf5 file containing the meta-information to > >> the hdf5 files for the datasets. > >> > >> I usually tend to put everything in one hdf5 file. > >> > >> On Wed, Jul 18, 2012 at 12:49 PM, Jacob Bennett > >> <jac...@gm...> wrote: > >> > I really like this way about going about it; however, would it be > better > >> > to > >> > use the built in hierarchy for separation of the tables or to write to > >> > separate hdf5 files? When I am currently experimenting with concurrent > >> > read/write operations to a shared hdf5 file w/o hierarchy, I notice > that > >> > the > >> > only errors that I get are occasional read errors (which isn't much > of a > >> > problem for me), so I am thinking. Could there be a way to reduce the > >> > metadata within an hdf5 and at the same time, use a multi-tabled > >> > approach to > >> > solve my problem? > >> > > >> > Thanks, > >> > Jacob > >> > > >> > > >> > On Wed, Jul 18, 2012 at 1:22 AM, Ümit Seren <uem...@gm...> > >> > wrote: > >> >> > >> >> Just to add what Anthony said: > >> >> In the end it also depends how unrelated your data is and how you > want > >> >> to access it. If the access scenaria is that you usually only search > >> >> or select within a specific dataset then splitting up the datasets > and > >> >> putting them into separate tables is the way to go. In RBDMS terms > >> >> this is btw called sharding. > >> >> I have such a use case where I do have around 30000 datasets (each of > >> >> them with around 5 million rows). I am only interested in one dataset > >> >> at a time. So I created 30.000 tables. It works really good. > >> >> And in case you want to access the data across the datasets (for > >> >> aggregating or calculating averages) you can take a MapReduce > approach > >> >> which should work very well with this approach. > >> >> > >> >> > >> >> On Tue, Jul 17, 2012 at 11:55 PM, Jacob Bennett > >> >> <jac...@gm...> wrote: > >> >> > Thanks for the input Anthony! > >> >> > > >> >> > -Jake > >> >> > > >> >> > > >> >> > On Tue, Jul 17, 2012 at 4:20 PM, Anthony Scopatz < > sc...@gm...> > >> >> > wrote: > >> >> >> > >> >> >> On Tue, Jul 17, 2012 at 3:30 PM, Jacob Bennett > >> >> >> <jac...@gm...> > >> >> >> wrote: > >> >> >>> > >> >> >>> Hello PyTables Users & Contributors, > >> >> >>> > >> >> >>> Just a quick question, let's say that I have certain identifiers > >> >> >>> that > >> >> >>> link to a set of data. Would it generally be faster for lookup to > >> >> >>> have > >> >> >>> each > >> >> >>> set a data as a separate table with an id as the tables name or > to > >> >> >>> add > >> >> >>> this > >> >> >>> id as another column to a universal table of data and then let > the > >> >> >>> in-kernel > >> >> >>> search query data only with a specific id? > >> >> >> > >> >> >> > >> >> >> I think that in general it is faster to have more tables with ids > as > >> >> >> names. For very small data, searching through a single larger > table > >> >> >> might > >> >> >> be quicker than node access...but even then I doubt it. > >> >> >> > >> >> >>> > >> >> >>> I hope you can understand my question would 1,000 tables of > 100,000 > >> >> >>> records each be better for searching than 1 table with 100 > million > >> >> >>> records > >> >> >>> and one extra id column? > >> >> >> > >> >> >> > >> >> >> For these data sizes more tables is probably faster. > >> >> >> > >> >> >> (It should also be noted that in the more tables case, that data > is > >> >> >> actually smaller, because you can eliminate the id column.) > >> >> >> > >> >> >> Be Well > >> >> >> Anthony > >> >> >> > >> >> >>> > >> >> >>> > >> >> >>> Thanks, > >> >> >>> Jacob Bennett > >> >> >>> > >> >> >>> -- > >> >> >>> Jacob Bennett > >> >> >>> Massachusetts Institute of Technology > >> >> >>> Department of Electrical Engineering and Computer Science > >> >> >>> Class of 2014| ben...@mi... > >> >> >>> > >> >> >>> > >> >> >>> > >> >> >>> > >> >> >>> > >> >> >>> > ------------------------------------------------------------------------------ > >> >> >>> Live Security Virtual Conference > >> >> >>> Exclusive live event will cover all the ways today's security and > >> >> >>> threat landscape has changed and how IT managers can respond. > >> >> >>> Discussions > >> >> >>> will include endpoint security, mobile security and the latest in > >> >> >>> malware > >> >> >>> threats. > http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > >> >> >>> _______________________________________________ > >> >> >>> Pytables-users mailing list > >> >> >>> Pyt...@li... > >> >> >>> https://lists.sourceforge.net/lists/listinfo/pytables-users > >> >> >>> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > ------------------------------------------------------------------------------ > >> >> >> Live Security Virtual Conference > >> >> >> Exclusive live event will cover all the ways today's security and > >> >> >> threat landscape has changed and how IT managers can respond. > >> >> >> Discussions > >> >> >> will include endpoint security, mobile security and the latest in > >> >> >> malware > >> >> >> threats. > http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > >> >> >> _______________________________________________ > >> >> >> Pytables-users mailing list > >> >> >> Pyt...@li... > >> >> >> https://lists.sourceforge.net/lists/listinfo/pytables-users > >> >> >> > >> >> > > >> >> > > >> >> > > >> >> > -- > >> >> > Jacob Bennett > >> >> > Massachusetts Institute of Technology > >> >> > Department of Electrical Engineering and Computer Science > >> >> > Class of 2014| ben...@mi... > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > ------------------------------------------------------------------------------ > >> >> > Live Security Virtual Conference > >> >> > Exclusive live event will cover all the ways today's security and > >> >> > threat landscape has changed and how IT managers can respond. > >> >> > Discussions > >> >> > will include endpoint security, mobile security and the latest in > >> >> > malware > >> >> > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > >> >> > _______________________________________________ > >> >> > Pytables-users mailing list > >> >> > Pyt...@li... > >> >> > https://lists.sourceforge.net/lists/listinfo/pytables-users > >> >> > > >> >> > >> >> > >> >> > >> >> > ------------------------------------------------------------------------------ > >> >> Live Security Virtual Conference > >> >> Exclusive live event will cover all the ways today's security and > >> >> threat landscape has changed and how IT managers can respond. > >> >> Discussions > >> >> will include endpoint security, mobile security and the latest in > >> >> malware > >> >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > >> >> _______________________________________________ > >> >> Pytables-users mailing list > >> >> Pyt...@li... > >> >> https://lists.sourceforge.net/lists/listinfo/pytables-users > >> > > >> > > >> > > >> > > >> > -- > >> > Jacob Bennett > >> > Massachusetts Institute of Technology > >> > Department of Electrical Engineering and Computer Science > >> > Class of 2014| ben...@mi... > >> > > >> > > >> > > >> > > ------------------------------------------------------------------------------ > >> > Live Security Virtual Conference > >> > Exclusive live event will cover all the ways today's security and > >> > threat landscape has changed and how IT managers can respond. > >> > Discussions > >> > will include endpoint security, mobile security and the latest in > >> > malware > >> > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > >> > _______________________________________________ > >> > Pytables-users mailing list > >> > Pyt...@li... > >> > https://lists.sourceforge.net/lists/listinfo/pytables-users > >> > > >> > >> > >> > ------------------------------------------------------------------------------ > >> Live Security Virtual Conference > >> Exclusive live event will cover all the ways today's security and > >> threat landscape has changed and how IT managers can respond. > Discussions > >> will include endpoint security, mobile security and the latest in > malware > >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > >> _______________________________________________ > >> Pytables-users mailing list > >> Pyt...@li... > >> https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > > > > > > -- > > Jacob Bennett > > Massachusetts Institute of Technology > > Department of Electrical Engineering and Computer Science > > Class of 2014| ben...@mi... > > > > > > > ------------------------------------------------------------------------------ > > Live Security Virtual Conference > > Exclusive live event will cover all the ways today's security and > > threat landscape has changed and how IT managers can respond. Discussions > > will include endpoint security, mobile security and the latest in malware > > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > > _______________________________________________ > > Pytables-users mailing list > > Pyt...@li... > > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > -- Jacob Bennett Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Class of 2014| ben...@mi... |
From: Ümit S. <uem...@gm...> - 2012-07-18 12:07:49
|
I actually had 30.000 groups attached to the data group. But I guess it doesn't really matter whether it is a table or a group. They both are nodes. On Wed, Jul 18, 2012 at 2:04 PM, Jacob Bennett <jac...@gm...> wrote: > Good to hear, were you able to get away with having 30,000 datasets directly > linked to a similar node (in this case, data)? I seem to have a problem > putting that many nodes from one root. > > -Jacob > > > On Wed, Jul 18, 2012 at 6:54 AM, Ümit Seren <uem...@gm...> wrote: >> >> On Wed, Jul 18, 2012 at 1:32 PM, Jacob Bennett >> <jac...@gm...> wrote: >> > Sounds awesome, thanks for the help, I also have two more concerns. >> > >> > #1 - I will never concurrently write, I only have to worry about one >> > write >> > with many reads, will the hdf5 metadata for a tree-like structure be >> > able to >> > hold up in this scenario? >> >> To be honest I haven't really tried the concurrent read and single >> write use case. >> In my case I had a cherrypy python web-server (which uses multiple >> processes to handle requests) and usually I write from one request and >> reading is done from the same or another. But I don't think I ever had >> the use case where I read and wrote at the same time. >> However I had to keep the files open because of the way PyTables >> handles files (it cashes them as singleton object without a lock). >> For example if you close the file after you finished writing and at >> the same time you are reading from another process it will cause an >> exception in the read thread/process because it loses the file handle. >> So you probably have to take care of this yourself in your code. >> >> >> > #2 - When you have around 30,000 tables in your hdf5 file, you do not >> > want >> > to have every node directly linked to root (plus I don't think hdf5 can >> > support that); however, I have no other natural grouping besides this, >> > could >> > this be a concern also. >> >> >> Well in my case my datasets consisted not only of one table but also >> attional data (CArray, etc). >> So I naturally created groups for each datasets and stored >> meta-information as attributes on the group. These groups could >> contain sometimes additional groups and the actual data in form of >> tables and CArrays. It looked something like this: >> >> - root >> - data >> - dataset1 >> - table >> - transformation >> -table >> - CArray >> - dataset2 >> . >> . >> . >> - dataset30.000 >> >> >> > If you could help me out with these two items, I think I will have >> > enough >> > knowledge under my belt to know what I need to do. Thanks again! ;) >> > >> > >> > On Wed, Jul 18, 2012 at 6:21 AM, Ümit Seren <uem...@gm...> >> > wrote: >> >> >> >> I think it depends and there are different ways to do it. >> >> But concurrent writes to one HDF5 file is not really supported (not >> >> even by the underlying HDF5 library unless you use the MPI version). >> >> So in case you want to write from different threads/processes you >> >> probably have to use separate hdf5 files. >> >> However writing from one process and reading from another is not much >> >> of an issue. >> >> >> >> Having everything in one hdf5 file has it's advantages as well as >> >> putting everything in separate hdf5 files. >> >> Filesystems can usually cope with one huge file much better than will >> >> millions of small files (copying, listing, etc). >> >> Of course if you have the datasets in separate hdf5 files it's easier >> >> to copy/move just single datasets compared to having everything in one >> >> hdf5 file (tough that's also possible using ptrepack). >> >> >> >> You could also create one hdf5 file for the meta information and >> >> create separate hdf5 files for each dataset. Then you can use >> >> hardlinks to connect the hdf5 file containing the meta-information to >> >> the hdf5 files for the datasets. >> >> >> >> I usually tend to put everything in one hdf5 file. >> >> >> >> On Wed, Jul 18, 2012 at 12:49 PM, Jacob Bennett >> >> <jac...@gm...> wrote: >> >> > I really like this way about going about it; however, would it be >> >> > better >> >> > to >> >> > use the built in hierarchy for separation of the tables or to write >> >> > to >> >> > separate hdf5 files? When I am currently experimenting with >> >> > concurrent >> >> > read/write operations to a shared hdf5 file w/o hierarchy, I notice >> >> > that >> >> > the >> >> > only errors that I get are occasional read errors (which isn't much >> >> > of a >> >> > problem for me), so I am thinking. Could there be a way to reduce the >> >> > metadata within an hdf5 and at the same time, use a multi-tabled >> >> > approach to >> >> > solve my problem? >> >> > >> >> > Thanks, >> >> > Jacob >> >> > >> >> > >> >> > On Wed, Jul 18, 2012 at 1:22 AM, Ümit Seren <uem...@gm...> >> >> > wrote: >> >> >> >> >> >> Just to add what Anthony said: >> >> >> In the end it also depends how unrelated your data is and how you >> >> >> want >> >> >> to access it. If the access scenaria is that you usually only search >> >> >> or select within a specific dataset then splitting up the datasets >> >> >> and >> >> >> putting them into separate tables is the way to go. In RBDMS terms >> >> >> this is btw called sharding. >> >> >> I have such a use case where I do have around 30000 datasets (each >> >> >> of >> >> >> them with around 5 million rows). I am only interested in one >> >> >> dataset >> >> >> at a time. So I created 30.000 tables. It works really good. >> >> >> And in case you want to access the data across the datasets (for >> >> >> aggregating or calculating averages) you can take a MapReduce >> >> >> approach >> >> >> which should work very well with this approach. >> >> >> >> >> >> >> >> >> On Tue, Jul 17, 2012 at 11:55 PM, Jacob Bennett >> >> >> <jac...@gm...> wrote: >> >> >> > Thanks for the input Anthony! >> >> >> > >> >> >> > -Jake >> >> >> > >> >> >> > >> >> >> > On Tue, Jul 17, 2012 at 4:20 PM, Anthony Scopatz >> >> >> > <sc...@gm...> >> >> >> > wrote: >> >> >> >> >> >> >> >> On Tue, Jul 17, 2012 at 3:30 PM, Jacob Bennett >> >> >> >> <jac...@gm...> >> >> >> >> wrote: >> >> >> >>> >> >> >> >>> Hello PyTables Users & Contributors, >> >> >> >>> >> >> >> >>> Just a quick question, let's say that I have certain identifiers >> >> >> >>> that >> >> >> >>> link to a set of data. Would it generally be faster for lookup >> >> >> >>> to >> >> >> >>> have >> >> >> >>> each >> >> >> >>> set a data as a separate table with an id as the tables name or >> >> >> >>> to >> >> >> >>> add >> >> >> >>> this >> >> >> >>> id as another column to a universal table of data and then let >> >> >> >>> the >> >> >> >>> in-kernel >> >> >> >>> search query data only with a specific id? >> >> >> >> >> >> >> >> >> >> >> >> I think that in general it is faster to have more tables with ids >> >> >> >> as >> >> >> >> names. For very small data, searching through a single larger >> >> >> >> table >> >> >> >> might >> >> >> >> be quicker than node access...but even then I doubt it. >> >> >> >> >> >> >> >>> >> >> >> >>> I hope you can understand my question would 1,000 tables of >> >> >> >>> 100,000 >> >> >> >>> records each be better for searching than 1 table with 100 >> >> >> >>> million >> >> >> >>> records >> >> >> >>> and one extra id column? >> >> >> >> >> >> >> >> >> >> >> >> For these data sizes more tables is probably faster. >> >> >> >> >> >> >> >> (It should also be noted that in the more tables case, that data >> >> >> >> is >> >> >> >> actually smaller, because you can eliminate the id column.) >> >> >> >> >> >> >> >> Be Well >> >> >> >> Anthony >> >> >> >> >> >> >> >>> >> >> >> >>> >> >> >> >>> Thanks, >> >> >> >>> Jacob Bennett >> >> >> >>> >> >> >> >>> -- >> >> >> >>> Jacob Bennett >> >> >> >>> Massachusetts Institute of Technology >> >> >> >>> Department of Electrical Engineering and Computer Science >> >> >> >>> Class of 2014| ben...@mi... >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> ------------------------------------------------------------------------------ >> >> >> >>> Live Security Virtual Conference >> >> >> >>> Exclusive live event will cover all the ways today's security >> >> >> >>> and >> >> >> >>> threat landscape has changed and how IT managers can respond. >> >> >> >>> Discussions >> >> >> >>> will include endpoint security, mobile security and the latest >> >> >> >>> in >> >> >> >>> malware >> >> >> >>> threats. >> >> >> >>> http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> >> >> >>> _______________________________________________ >> >> >> >>> Pytables-users mailing list >> >> >> >>> Pyt...@li... >> >> >> >>> https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> >> >>> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> ------------------------------------------------------------------------------ >> >> >> >> Live Security Virtual Conference >> >> >> >> Exclusive live event will cover all the ways today's security and >> >> >> >> threat landscape has changed and how IT managers can respond. >> >> >> >> Discussions >> >> >> >> will include endpoint security, mobile security and the latest in >> >> >> >> malware >> >> >> >> threats. >> >> >> >> http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> >> >> >> _______________________________________________ >> >> >> >> Pytables-users mailing list >> >> >> >> Pyt...@li... >> >> >> >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> >> >> >> >> >> > >> >> >> > >> >> >> > >> >> >> > -- >> >> >> > Jacob Bennett >> >> >> > Massachusetts Institute of Technology >> >> >> > Department of Electrical Engineering and Computer Science >> >> >> > Class of 2014| ben...@mi... >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > ------------------------------------------------------------------------------ >> >> >> > Live Security Virtual Conference >> >> >> > Exclusive live event will cover all the ways today's security and >> >> >> > threat landscape has changed and how IT managers can respond. >> >> >> > Discussions >> >> >> > will include endpoint security, mobile security and the latest in >> >> >> > malware >> >> >> > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> >> >> > _______________________________________________ >> >> >> > Pytables-users mailing list >> >> >> > Pyt...@li... >> >> >> > https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> >> > >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> ------------------------------------------------------------------------------ >> >> >> Live Security Virtual Conference >> >> >> Exclusive live event will cover all the ways today's security and >> >> >> threat landscape has changed and how IT managers can respond. >> >> >> Discussions >> >> >> will include endpoint security, mobile security and the latest in >> >> >> malware >> >> >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> >> >> _______________________________________________ >> >> >> Pytables-users mailing list >> >> >> Pyt...@li... >> >> >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> > >> >> > >> >> > >> >> > >> >> > -- >> >> > Jacob Bennett >> >> > Massachusetts Institute of Technology >> >> > Department of Electrical Engineering and Computer Science >> >> > Class of 2014| ben...@mi... >> >> > >> >> > >> >> > >> >> > >> >> > ------------------------------------------------------------------------------ >> >> > Live Security Virtual Conference >> >> > Exclusive live event will cover all the ways today's security and >> >> > threat landscape has changed and how IT managers can respond. >> >> > Discussions >> >> > will include endpoint security, mobile security and the latest in >> >> > malware >> >> > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> >> > _______________________________________________ >> >> > Pytables-users mailing list >> >> > Pyt...@li... >> >> > https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> > >> >> >> >> >> >> >> >> ------------------------------------------------------------------------------ >> >> Live Security Virtual Conference >> >> Exclusive live event will cover all the ways today's security and >> >> threat landscape has changed and how IT managers can respond. >> >> Discussions >> >> will include endpoint security, mobile security and the latest in >> >> malware >> >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> >> _______________________________________________ >> >> Pytables-users mailing list >> >> Pyt...@li... >> >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> > >> > >> > >> > >> > -- >> > Jacob Bennett >> > Massachusetts Institute of Technology >> > Department of Electrical Engineering and Computer Science >> > Class of 2014| ben...@mi... >> > >> > >> > >> > ------------------------------------------------------------------------------ >> > Live Security Virtual Conference >> > Exclusive live event will cover all the ways today's security and >> > threat landscape has changed and how IT managers can respond. >> > Discussions >> > will include endpoint security, mobile security and the latest in >> > malware >> > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> > _______________________________________________ >> > Pytables-users mailing list >> > Pyt...@li... >> > https://lists.sourceforge.net/lists/listinfo/pytables-users >> > >> >> >> ------------------------------------------------------------------------------ >> Live Security Virtual Conference >> Exclusive live event will cover all the ways today's security and >> threat landscape has changed and how IT managers can respond. Discussions >> will include endpoint security, mobile security and the latest in malware >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > -- > Jacob Bennett > Massachusetts Institute of Technology > Department of Electrical Engineering and Computer Science > Class of 2014| ben...@mi... > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Jacob B. <jac...@gm...> - 2012-07-18 12:10:34
|
Cool, thanks again for your help! -Jake On Wed, Jul 18, 2012 at 7:07 AM, Ümit Seren <uem...@gm...> wrote: > I actually had 30.000 groups attached to the data group. But I guess > it doesn't really matter whether it is a table or a group. They both > are nodes. > > > On Wed, Jul 18, 2012 at 2:04 PM, Jacob Bennett > <jac...@gm...> wrote: > > Good to hear, were you able to get away with having 30,000 datasets > directly > > linked to a similar node (in this case, data)? I seem to have a problem > > putting that many nodes from one root. > > > > -Jacob > > > > > > On Wed, Jul 18, 2012 at 6:54 AM, Ümit Seren <uem...@gm...> > wrote: > >> > >> On Wed, Jul 18, 2012 at 1:32 PM, Jacob Bennett > >> <jac...@gm...> wrote: > >> > Sounds awesome, thanks for the help, I also have two more concerns. > >> > > >> > #1 - I will never concurrently write, I only have to worry about one > >> > write > >> > with many reads, will the hdf5 metadata for a tree-like structure be > >> > able to > >> > hold up in this scenario? > >> > >> To be honest I haven't really tried the concurrent read and single > >> write use case. > >> In my case I had a cherrypy python web-server (which uses multiple > >> processes to handle requests) and usually I write from one request and > >> reading is done from the same or another. But I don't think I ever had > >> the use case where I read and wrote at the same time. > >> However I had to keep the files open because of the way PyTables > >> handles files (it cashes them as singleton object without a lock). > >> For example if you close the file after you finished writing and at > >> the same time you are reading from another process it will cause an > >> exception in the read thread/process because it loses the file handle. > >> So you probably have to take care of this yourself in your code. > >> > >> > >> > #2 - When you have around 30,000 tables in your hdf5 file, you do not > >> > want > >> > to have every node directly linked to root (plus I don't think hdf5 > can > >> > support that); however, I have no other natural grouping besides this, > >> > could > >> > this be a concern also. > >> > >> > >> Well in my case my datasets consisted not only of one table but also > >> attional data (CArray, etc). > >> So I naturally created groups for each datasets and stored > >> meta-information as attributes on the group. These groups could > >> contain sometimes additional groups and the actual data in form of > >> tables and CArrays. It looked something like this: > >> > >> - root > >> - data > >> - dataset1 > >> - table > >> - transformation > >> -table > >> - CArray > >> - dataset2 > >> . > >> . > >> . > >> - dataset30.000 > >> > >> > >> > If you could help me out with these two items, I think I will have > >> > enough > >> > knowledge under my belt to know what I need to do. Thanks again! ;) > >> > > >> > > >> > On Wed, Jul 18, 2012 at 6:21 AM, Ümit Seren <uem...@gm...> > >> > wrote: > >> >> > >> >> I think it depends and there are different ways to do it. > >> >> But concurrent writes to one HDF5 file is not really supported (not > >> >> even by the underlying HDF5 library unless you use the MPI version). > >> >> So in case you want to write from different threads/processes you > >> >> probably have to use separate hdf5 files. > >> >> However writing from one process and reading from another is not much > >> >> of an issue. > >> >> > >> >> Having everything in one hdf5 file has it's advantages as well as > >> >> putting everything in separate hdf5 files. > >> >> Filesystems can usually cope with one huge file much better than will > >> >> millions of small files (copying, listing, etc). > >> >> Of course if you have the datasets in separate hdf5 files it's easier > >> >> to copy/move just single datasets compared to having everything in > one > >> >> hdf5 file (tough that's also possible using ptrepack). > >> >> > >> >> You could also create one hdf5 file for the meta information and > >> >> create separate hdf5 files for each dataset. Then you can use > >> >> hardlinks to connect the hdf5 file containing the meta-information to > >> >> the hdf5 files for the datasets. > >> >> > >> >> I usually tend to put everything in one hdf5 file. > >> >> > >> >> On Wed, Jul 18, 2012 at 12:49 PM, Jacob Bennett > >> >> <jac...@gm...> wrote: > >> >> > I really like this way about going about it; however, would it be > >> >> > better > >> >> > to > >> >> > use the built in hierarchy for separation of the tables or to write > >> >> > to > >> >> > separate hdf5 files? When I am currently experimenting with > >> >> > concurrent > >> >> > read/write operations to a shared hdf5 file w/o hierarchy, I notice > >> >> > that > >> >> > the > >> >> > only errors that I get are occasional read errors (which isn't much > >> >> > of a > >> >> > problem for me), so I am thinking. Could there be a way to reduce > the > >> >> > metadata within an hdf5 and at the same time, use a multi-tabled > >> >> > approach to > >> >> > solve my problem? > >> >> > > >> >> > Thanks, > >> >> > Jacob > >> >> > > >> >> > > >> >> > On Wed, Jul 18, 2012 at 1:22 AM, Ümit Seren <uem...@gm... > > > >> >> > wrote: > >> >> >> > >> >> >> Just to add what Anthony said: > >> >> >> In the end it also depends how unrelated your data is and how you > >> >> >> want > >> >> >> to access it. If the access scenaria is that you usually only > search > >> >> >> or select within a specific dataset then splitting up the datasets > >> >> >> and > >> >> >> putting them into separate tables is the way to go. In RBDMS terms > >> >> >> this is btw called sharding. > >> >> >> I have such a use case where I do have around 30000 datasets (each > >> >> >> of > >> >> >> them with around 5 million rows). I am only interested in one > >> >> >> dataset > >> >> >> at a time. So I created 30.000 tables. It works really good. > >> >> >> And in case you want to access the data across the datasets (for > >> >> >> aggregating or calculating averages) you can take a MapReduce > >> >> >> approach > >> >> >> which should work very well with this approach. > >> >> >> > >> >> >> > >> >> >> On Tue, Jul 17, 2012 at 11:55 PM, Jacob Bennett > >> >> >> <jac...@gm...> wrote: > >> >> >> > Thanks for the input Anthony! > >> >> >> > > >> >> >> > -Jake > >> >> >> > > >> >> >> > > >> >> >> > On Tue, Jul 17, 2012 at 4:20 PM, Anthony Scopatz > >> >> >> > <sc...@gm...> > >> >> >> > wrote: > >> >> >> >> > >> >> >> >> On Tue, Jul 17, 2012 at 3:30 PM, Jacob Bennett > >> >> >> >> <jac...@gm...> > >> >> >> >> wrote: > >> >> >> >>> > >> >> >> >>> Hello PyTables Users & Contributors, > >> >> >> >>> > >> >> >> >>> Just a quick question, let's say that I have certain > identifiers > >> >> >> >>> that > >> >> >> >>> link to a set of data. Would it generally be faster for lookup > >> >> >> >>> to > >> >> >> >>> have > >> >> >> >>> each > >> >> >> >>> set a data as a separate table with an id as the tables name > or > >> >> >> >>> to > >> >> >> >>> add > >> >> >> >>> this > >> >> >> >>> id as another column to a universal table of data and then let > >> >> >> >>> the > >> >> >> >>> in-kernel > >> >> >> >>> search query data only with a specific id? > >> >> >> >> > >> >> >> >> > >> >> >> >> I think that in general it is faster to have more tables with > ids > >> >> >> >> as > >> >> >> >> names. For very small data, searching through a single larger > >> >> >> >> table > >> >> >> >> might > >> >> >> >> be quicker than node access...but even then I doubt it. > >> >> >> >> > >> >> >> >>> > >> >> >> >>> I hope you can understand my question would 1,000 tables of > >> >> >> >>> 100,000 > >> >> >> >>> records each be better for searching than 1 table with 100 > >> >> >> >>> million > >> >> >> >>> records > >> >> >> >>> and one extra id column? > >> >> >> >> > >> >> >> >> > >> >> >> >> For these data sizes more tables is probably faster. > >> >> >> >> > >> >> >> >> (It should also be noted that in the more tables case, that > data > >> >> >> >> is > >> >> >> >> actually smaller, because you can eliminate the id column.) > >> >> >> >> > >> >> >> >> Be Well > >> >> >> >> Anthony > >> >> >> >> > >> >> >> >>> > >> >> >> >>> > >> >> >> >>> Thanks, > >> >> >> >>> Jacob Bennett > >> >> >> >>> > >> >> >> >>> -- > >> >> >> >>> Jacob Bennett > >> >> >> >>> Massachusetts Institute of Technology > >> >> >> >>> Department of Electrical Engineering and Computer Science > >> >> >> >>> Class of 2014| ben...@mi... > >> >> >> >>> > >> >> >> >>> > >> >> >> >>> > >> >> >> >>> > >> >> >> >>> > >> >> >> >>> > >> >> >> >>> > ------------------------------------------------------------------------------ > >> >> >> >>> Live Security Virtual Conference > >> >> >> >>> Exclusive live event will cover all the ways today's security > >> >> >> >>> and > >> >> >> >>> threat landscape has changed and how IT managers can respond. > >> >> >> >>> Discussions > >> >> >> >>> will include endpoint security, mobile security and the latest > >> >> >> >>> in > >> >> >> >>> malware > >> >> >> >>> threats. > >> >> >> >>> http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > >> >> >> >>> _______________________________________________ > >> >> >> >>> Pytables-users mailing list > >> >> >> >>> Pyt...@li... > >> >> >> >>> https://lists.sourceforge.net/lists/listinfo/pytables-users > >> >> >> >>> > >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > ------------------------------------------------------------------------------ > >> >> >> >> Live Security Virtual Conference > >> >> >> >> Exclusive live event will cover all the ways today's security > and > >> >> >> >> threat landscape has changed and how IT managers can respond. > >> >> >> >> Discussions > >> >> >> >> will include endpoint security, mobile security and the latest > in > >> >> >> >> malware > >> >> >> >> threats. > >> >> >> >> http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > >> >> >> >> _______________________________________________ > >> >> >> >> Pytables-users mailing list > >> >> >> >> Pyt...@li... > >> >> >> >> https://lists.sourceforge.net/lists/listinfo/pytables-users > >> >> >> >> > >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> > -- > >> >> >> > Jacob Bennett > >> >> >> > Massachusetts Institute of Technology > >> >> >> > Department of Electrical Engineering and Computer Science > >> >> >> > Class of 2014| ben...@mi... > >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> > > ------------------------------------------------------------------------------ > >> >> >> > Live Security Virtual Conference > >> >> >> > Exclusive live event will cover all the ways today's security > and > >> >> >> > threat landscape has changed and how IT managers can respond. > >> >> >> > Discussions > >> >> >> > will include endpoint security, mobile security and the latest > in > >> >> >> > malware > >> >> >> > threats. > http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > >> >> >> > _______________________________________________ > >> >> >> > Pytables-users mailing list > >> >> >> > Pyt...@li... > >> >> >> > https://lists.sourceforge.net/lists/listinfo/pytables-users > >> >> >> > > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > ------------------------------------------------------------------------------ > >> >> >> Live Security Virtual Conference > >> >> >> Exclusive live event will cover all the ways today's security and > >> >> >> threat landscape has changed and how IT managers can respond. > >> >> >> Discussions > >> >> >> will include endpoint security, mobile security and the latest in > >> >> >> malware > >> >> >> threats. > http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > >> >> >> _______________________________________________ > >> >> >> Pytables-users mailing list > >> >> >> Pyt...@li... > >> >> >> https://lists.sourceforge.net/lists/listinfo/pytables-users > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > -- > >> >> > Jacob Bennett > >> >> > Massachusetts Institute of Technology > >> >> > Department of Electrical Engineering and Computer Science > >> >> > Class of 2014| ben...@mi... > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > ------------------------------------------------------------------------------ > >> >> > Live Security Virtual Conference > >> >> > Exclusive live event will cover all the ways today's security and > >> >> > threat landscape has changed and how IT managers can respond. > >> >> > Discussions > >> >> > will include endpoint security, mobile security and the latest in > >> >> > malware > >> >> > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > >> >> > _______________________________________________ > >> >> > Pytables-users mailing list > >> >> > Pyt...@li... > >> >> > https://lists.sourceforge.net/lists/listinfo/pytables-users > >> >> > > >> >> > >> >> > >> >> > >> >> > ------------------------------------------------------------------------------ > >> >> Live Security Virtual Conference > >> >> Exclusive live event will cover all the ways today's security and > >> >> threat landscape has changed and how IT managers can respond. > >> >> Discussions > >> >> will include endpoint security, mobile security and the latest in > >> >> malware > >> >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > >> >> _______________________________________________ > >> >> Pytables-users mailing list > >> >> Pyt...@li... > >> >> https://lists.sourceforge.net/lists/listinfo/pytables-users > >> > > >> > > >> > > >> > > >> > -- > >> > Jacob Bennett > >> > Massachusetts Institute of Technology > >> > Department of Electrical Engineering and Computer Science > >> > Class of 2014| ben...@mi... > >> > > >> > > >> > > >> > > ------------------------------------------------------------------------------ > >> > Live Security Virtual Conference > >> > Exclusive live event will cover all the ways today's security and > >> > threat landscape has changed and how IT managers can respond. > >> > Discussions > >> > will include endpoint security, mobile security and the latest in > >> > malware > >> > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > >> > _______________________________________________ > >> > Pytables-users mailing list > >> > Pyt...@li... > >> > https://lists.sourceforge.net/lists/listinfo/pytables-users > >> > > >> > >> > >> > ------------------------------------------------------------------------------ > >> Live Security Virtual Conference > >> Exclusive live event will cover all the ways today's security and > >> threat landscape has changed and how IT managers can respond. > Discussions > >> will include endpoint security, mobile security and the latest in > malware > >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > >> _______________________________________________ > >> Pytables-users mailing list > >> Pyt...@li... > >> https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > > > > > > -- > > Jacob Bennett > > Massachusetts Institute of Technology > > Department of Electrical Engineering and Computer Science > > Class of 2014| ben...@mi... > > > > > > > ------------------------------------------------------------------------------ > > Live Security Virtual Conference > > Exclusive live event will cover all the ways today's security and > > threat landscape has changed and how IT managers can respond. Discussions > > will include endpoint security, mobile security and the latest in malware > > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > > _______________________________________________ > > Pytables-users mailing list > > Pyt...@li... > > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > -- Jacob Bennett Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Class of 2014| ben...@mi... |
From: Francesc A. <fa...@py...> - 2012-07-18 13:55:18
|
On 7/18/12 2:07 PM, Ümit Seren wrote: > I actually had 30.000 groups attached to the data group. But I guess > it doesn't really matter whether it is a table or a group. They both > are nodes. 30.000 datasets attached to the same group? I'm interested in knowing if you detected performance problems because of this. My experience is that it is better to split the datasets in different groups, so that you don't exceed, say, 1000 per each group. But I might be wrong... -- Francesc Alted |
From: Ümit S. <uem...@gm...> - 2012-07-18 14:11:50
|
Actually I had 30.000 groups in a parent group. Each of the 30.000 groups had maybe 3 datasets. So to be honest I never had 30.000 datasets in a single group. I guess you will probably have to disable the LRU cache in that case right? On Wed, Jul 18, 2012 at 3:55 PM, Francesc Alted <fa...@py...> wrote: > On 7/18/12 2:07 PM, Ümit Seren wrote: >> I actually had 30.000 groups attached to the data group. But I guess >> it doesn't really matter whether it is a table or a group. They both >> are nodes. > > 30.000 datasets attached to the same group? I'm interested in knowing > if you detected performance problems because of this. My experience is > that it is better to split the datasets in different groups, so that you > don't exceed, say, 1000 per each group. But I might be wrong... > > -- > Francesc Alted > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users |
From: Francesc A. <fa...@py...> - 2012-07-18 14:39:53
|
On 7/18/12 4:11 PM, Ümit Seren wrote: > Actually I had 30.000 groups in a parent group. > Each of the 30.000 groups had maybe 3 datasets. > So to be honest I never had 30.000 datasets in a single group. > I guess you will probably have to disable the LRU cache in that case right? Okay. So I'd say that having 30.000 entries (no matter if they are groups or datasets) would be a bad performance practice in general, but maybe it is a difference between groups and datasets (i.e. it affects more to datasets than groups)?. Just curious, PyTables did not complain when you created 30.000 groups in the same group? Regarding the LRU cache, no, I don't think this is the problem, but rather how HDF5 implements the 'inodes' (or whatever they call that). This is a big issue in general (inodes in filesystems have similar problems too), and what hurts performance in this case. -- Francesc Alted |
From: Ümit S. <uem...@gm...> - 2012-07-18 14:48:08
|
Actually it did complain that it is over a certain limit and it also suggested a flag with which I can turn off the warning. But performance seemed fine. So if I randomly accessed any of the 30.000 groups I got the group handle in a fraction of a second Am 18.07.2012 16:40 schrieb "Francesc Alted" <fa...@py...>: > On 7/18/12 4:11 PM, Ümit Seren wrote: > > Actually I had 30.000 groups in a parent group. > > Each of the 30.000 groups had maybe 3 datasets. > > So to be honest I never had 30.000 datasets in a single group. > > I guess you will probably have to disable the LRU cache in that case > right? > > Okay. So I'd say that having 30.000 entries (no matter if they are > groups or datasets) would be a bad performance practice in general, but > maybe it is a difference between groups and datasets (i.e. it affects > more to datasets than groups)?. Just curious, PyTables did not complain > when you created 30.000 groups in the same group? > > Regarding the LRU cache, no, I don't think this is the problem, but > rather how HDF5 implements the 'inodes' (or whatever they call that). > This is a big issue in general (inodes in filesystems have similar > problems too), and what hurts performance in this case. > > -- > Francesc Alted > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |