From: Ümit S. <uem...@gm...> - 2012-07-18 12:07:49
|
I actually had 30.000 groups attached to the data group. But I guess it doesn't really matter whether it is a table or a group. They both are nodes. On Wed, Jul 18, 2012 at 2:04 PM, Jacob Bennett <jac...@gm...> wrote: > Good to hear, were you able to get away with having 30,000 datasets directly > linked to a similar node (in this case, data)? I seem to have a problem > putting that many nodes from one root. > > -Jacob > > > On Wed, Jul 18, 2012 at 6:54 AM, Ümit Seren <uem...@gm...> wrote: >> >> On Wed, Jul 18, 2012 at 1:32 PM, Jacob Bennett >> <jac...@gm...> wrote: >> > Sounds awesome, thanks for the help, I also have two more concerns. >> > >> > #1 - I will never concurrently write, I only have to worry about one >> > write >> > with many reads, will the hdf5 metadata for a tree-like structure be >> > able to >> > hold up in this scenario? >> >> To be honest I haven't really tried the concurrent read and single >> write use case. >> In my case I had a cherrypy python web-server (which uses multiple >> processes to handle requests) and usually I write from one request and >> reading is done from the same or another. But I don't think I ever had >> the use case where I read and wrote at the same time. >> However I had to keep the files open because of the way PyTables >> handles files (it cashes them as singleton object without a lock). >> For example if you close the file after you finished writing and at >> the same time you are reading from another process it will cause an >> exception in the read thread/process because it loses the file handle. >> So you probably have to take care of this yourself in your code. >> >> >> > #2 - When you have around 30,000 tables in your hdf5 file, you do not >> > want >> > to have every node directly linked to root (plus I don't think hdf5 can >> > support that); however, I have no other natural grouping besides this, >> > could >> > this be a concern also. >> >> >> Well in my case my datasets consisted not only of one table but also >> attional data (CArray, etc). >> So I naturally created groups for each datasets and stored >> meta-information as attributes on the group. These groups could >> contain sometimes additional groups and the actual data in form of >> tables and CArrays. It looked something like this: >> >> - root >> - data >> - dataset1 >> - table >> - transformation >> -table >> - CArray >> - dataset2 >> . >> . >> . >> - dataset30.000 >> >> >> > If you could help me out with these two items, I think I will have >> > enough >> > knowledge under my belt to know what I need to do. Thanks again! ;) >> > >> > >> > On Wed, Jul 18, 2012 at 6:21 AM, Ümit Seren <uem...@gm...> >> > wrote: >> >> >> >> I think it depends and there are different ways to do it. >> >> But concurrent writes to one HDF5 file is not really supported (not >> >> even by the underlying HDF5 library unless you use the MPI version). >> >> So in case you want to write from different threads/processes you >> >> probably have to use separate hdf5 files. >> >> However writing from one process and reading from another is not much >> >> of an issue. >> >> >> >> Having everything in one hdf5 file has it's advantages as well as >> >> putting everything in separate hdf5 files. >> >> Filesystems can usually cope with one huge file much better than will >> >> millions of small files (copying, listing, etc). >> >> Of course if you have the datasets in separate hdf5 files it's easier >> >> to copy/move just single datasets compared to having everything in one >> >> hdf5 file (tough that's also possible using ptrepack). >> >> >> >> You could also create one hdf5 file for the meta information and >> >> create separate hdf5 files for each dataset. Then you can use >> >> hardlinks to connect the hdf5 file containing the meta-information to >> >> the hdf5 files for the datasets. >> >> >> >> I usually tend to put everything in one hdf5 file. >> >> >> >> On Wed, Jul 18, 2012 at 12:49 PM, Jacob Bennett >> >> <jac...@gm...> wrote: >> >> > I really like this way about going about it; however, would it be >> >> > better >> >> > to >> >> > use the built in hierarchy for separation of the tables or to write >> >> > to >> >> > separate hdf5 files? When I am currently experimenting with >> >> > concurrent >> >> > read/write operations to a shared hdf5 file w/o hierarchy, I notice >> >> > that >> >> > the >> >> > only errors that I get are occasional read errors (which isn't much >> >> > of a >> >> > problem for me), so I am thinking. Could there be a way to reduce the >> >> > metadata within an hdf5 and at the same time, use a multi-tabled >> >> > approach to >> >> > solve my problem? >> >> > >> >> > Thanks, >> >> > Jacob >> >> > >> >> > >> >> > On Wed, Jul 18, 2012 at 1:22 AM, Ümit Seren <uem...@gm...> >> >> > wrote: >> >> >> >> >> >> Just to add what Anthony said: >> >> >> In the end it also depends how unrelated your data is and how you >> >> >> want >> >> >> to access it. If the access scenaria is that you usually only search >> >> >> or select within a specific dataset then splitting up the datasets >> >> >> and >> >> >> putting them into separate tables is the way to go. In RBDMS terms >> >> >> this is btw called sharding. >> >> >> I have such a use case where I do have around 30000 datasets (each >> >> >> of >> >> >> them with around 5 million rows). I am only interested in one >> >> >> dataset >> >> >> at a time. So I created 30.000 tables. It works really good. >> >> >> And in case you want to access the data across the datasets (for >> >> >> aggregating or calculating averages) you can take a MapReduce >> >> >> approach >> >> >> which should work very well with this approach. >> >> >> >> >> >> >> >> >> On Tue, Jul 17, 2012 at 11:55 PM, Jacob Bennett >> >> >> <jac...@gm...> wrote: >> >> >> > Thanks for the input Anthony! >> >> >> > >> >> >> > -Jake >> >> >> > >> >> >> > >> >> >> > On Tue, Jul 17, 2012 at 4:20 PM, Anthony Scopatz >> >> >> > <sc...@gm...> >> >> >> > wrote: >> >> >> >> >> >> >> >> On Tue, Jul 17, 2012 at 3:30 PM, Jacob Bennett >> >> >> >> <jac...@gm...> >> >> >> >> wrote: >> >> >> >>> >> >> >> >>> Hello PyTables Users & Contributors, >> >> >> >>> >> >> >> >>> Just a quick question, let's say that I have certain identifiers >> >> >> >>> that >> >> >> >>> link to a set of data. Would it generally be faster for lookup >> >> >> >>> to >> >> >> >>> have >> >> >> >>> each >> >> >> >>> set a data as a separate table with an id as the tables name or >> >> >> >>> to >> >> >> >>> add >> >> >> >>> this >> >> >> >>> id as another column to a universal table of data and then let >> >> >> >>> the >> >> >> >>> in-kernel >> >> >> >>> search query data only with a specific id? >> >> >> >> >> >> >> >> >> >> >> >> I think that in general it is faster to have more tables with ids >> >> >> >> as >> >> >> >> names. For very small data, searching through a single larger >> >> >> >> table >> >> >> >> might >> >> >> >> be quicker than node access...but even then I doubt it. >> >> >> >> >> >> >> >>> >> >> >> >>> I hope you can understand my question would 1,000 tables of >> >> >> >>> 100,000 >> >> >> >>> records each be better for searching than 1 table with 100 >> >> >> >>> million >> >> >> >>> records >> >> >> >>> and one extra id column? >> >> >> >> >> >> >> >> >> >> >> >> For these data sizes more tables is probably faster. >> >> >> >> >> >> >> >> (It should also be noted that in the more tables case, that data >> >> >> >> is >> >> >> >> actually smaller, because you can eliminate the id column.) >> >> >> >> >> >> >> >> Be Well >> >> >> >> Anthony >> >> >> >> >> >> >> >>> >> >> >> >>> >> >> >> >>> Thanks, >> >> >> >>> Jacob Bennett >> >> >> >>> >> >> >> >>> -- >> >> >> >>> Jacob Bennett >> >> >> >>> Massachusetts Institute of Technology >> >> >> >>> Department of Electrical Engineering and Computer Science >> >> >> >>> Class of 2014| ben...@mi... >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> ------------------------------------------------------------------------------ >> >> >> >>> Live Security Virtual Conference >> >> >> >>> Exclusive live event will cover all the ways today's security >> >> >> >>> and >> >> >> >>> threat landscape has changed and how IT managers can respond. >> >> >> >>> Discussions >> >> >> >>> will include endpoint security, mobile security and the latest >> >> >> >>> in >> >> >> >>> malware >> >> >> >>> threats. >> >> >> >>> http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> >> >> >>> _______________________________________________ >> >> >> >>> Pytables-users mailing list >> >> >> >>> Pyt...@li... >> >> >> >>> https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> >> >>> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> ------------------------------------------------------------------------------ >> >> >> >> Live Security Virtual Conference >> >> >> >> Exclusive live event will cover all the ways today's security and >> >> >> >> threat landscape has changed and how IT managers can respond. >> >> >> >> Discussions >> >> >> >> will include endpoint security, mobile security and the latest in >> >> >> >> malware >> >> >> >> threats. >> >> >> >> http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> >> >> >> _______________________________________________ >> >> >> >> Pytables-users mailing list >> >> >> >> Pyt...@li... >> >> >> >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> >> >> >> >> >> > >> >> >> > >> >> >> > >> >> >> > -- >> >> >> > Jacob Bennett >> >> >> > Massachusetts Institute of Technology >> >> >> > Department of Electrical Engineering and Computer Science >> >> >> > Class of 2014| ben...@mi... >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > ------------------------------------------------------------------------------ >> >> >> > Live Security Virtual Conference >> >> >> > Exclusive live event will cover all the ways today's security and >> >> >> > threat landscape has changed and how IT managers can respond. >> >> >> > Discussions >> >> >> > will include endpoint security, mobile security and the latest in >> >> >> > malware >> >> >> > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> >> >> > _______________________________________________ >> >> >> > Pytables-users mailing list >> >> >> > Pyt...@li... >> >> >> > https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> >> > >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> ------------------------------------------------------------------------------ >> >> >> Live Security Virtual Conference >> >> >> Exclusive live event will cover all the ways today's security and >> >> >> threat landscape has changed and how IT managers can respond. >> >> >> Discussions >> >> >> will include endpoint security, mobile security and the latest in >> >> >> malware >> >> >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> >> >> _______________________________________________ >> >> >> Pytables-users mailing list >> >> >> Pyt...@li... >> >> >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> > >> >> > >> >> > >> >> > >> >> > -- >> >> > Jacob Bennett >> >> > Massachusetts Institute of Technology >> >> > Department of Electrical Engineering and Computer Science >> >> > Class of 2014| ben...@mi... >> >> > >> >> > >> >> > >> >> > >> >> > ------------------------------------------------------------------------------ >> >> > Live Security Virtual Conference >> >> > Exclusive live event will cover all the ways today's security and >> >> > threat landscape has changed and how IT managers can respond. >> >> > Discussions >> >> > will include endpoint security, mobile security and the latest in >> >> > malware >> >> > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> >> > _______________________________________________ >> >> > Pytables-users mailing list >> >> > Pyt...@li... >> >> > https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> > >> >> >> >> >> >> >> >> ------------------------------------------------------------------------------ >> >> Live Security Virtual Conference >> >> Exclusive live event will cover all the ways today's security and >> >> threat landscape has changed and how IT managers can respond. >> >> Discussions >> >> will include endpoint security, mobile security and the latest in >> >> malware >> >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> >> _______________________________________________ >> >> Pytables-users mailing list >> >> Pyt...@li... >> >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> > >> > >> > >> > >> > -- >> > Jacob Bennett >> > Massachusetts Institute of Technology >> > Department of Electrical Engineering and Computer Science >> > Class of 2014| ben...@mi... >> > >> > >> > >> > ------------------------------------------------------------------------------ >> > Live Security Virtual Conference >> > Exclusive live event will cover all the ways today's security and >> > threat landscape has changed and how IT managers can respond. >> > Discussions >> > will include endpoint security, mobile security and the latest in >> > malware >> > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> > _______________________________________________ >> > Pytables-users mailing list >> > Pyt...@li... >> > https://lists.sourceforge.net/lists/listinfo/pytables-users >> > >> >> >> ------------------------------------------------------------------------------ >> Live Security Virtual Conference >> Exclusive live event will cover all the ways today's security and >> threat landscape has changed and how IT managers can respond. Discussions >> will include endpoint security, mobile security and the latest in malware >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > -- > Jacob Bennett > Massachusetts Institute of Technology > Department of Electrical Engineering and Computer Science > Class of 2014| ben...@mi... > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |