Re: [Pytables-users] Faster Performance: A set of nodes vs A new column that ranges within a set?

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Good to hear, were you able to get away with having 30,000 datasets
directly linked to a similar node (in this case, data)? I seem to have a
problem putting that many nodes from one root.

-Jacob

On Wed, Jul 18, 2012 at 6:54 AM, Ümit Seren <uem...@gm...> wrote:

> On Wed, Jul 18, 2012 at 1:32 PM, Jacob Bennett
> <jac...@gm...> wrote:
> > Sounds awesome, thanks for the help, I also have two more concerns.
> >
> > #1 - I will never concurrently write, I only have to worry about one
> write
> > with many reads, will the hdf5 metadata for a tree-like structure be
> able to
> > hold up in this scenario?
>
> To be honest I haven't really tried the concurrent read and single
> write use case.
> In my case I had a cherrypy python web-server (which uses multiple
> processes to handle requests) and usually I write from one request and
> reading is done from the same or another. But I don't think I ever had
> the use case where I read and wrote at the same time.
> However I had to keep the files open because of the way PyTables
> handles files (it cashes them as singleton object without a lock).
> For example if you close the file after you finished writing and at
> the same time you are reading from another process it will cause an
> exception in the read thread/process because it loses the file handle.
> So you probably have to take care of this yourself in your code.
>
>
> > #2 - When you have around 30,000 tables in your hdf5 file, you do not
> want
> > to have every node directly linked to root (plus I don't think hdf5 can
> > support that); however, I have no other natural grouping besides this,
> could
> > this be a concern also.
>
>
> Well in my case my datasets consisted not only of one table but also
> attional data (CArray, etc).
> So I naturally created groups for each datasets and stored
> meta-information as attributes on the group. These groups could
> contain sometimes additional groups and the actual data in form of
> tables and CArrays. It looked something like this:
>
>  - root
>     - data
>         - dataset1
>             - table
>             - transformation
>                 -table
>                 - CArray
>         - dataset2
>         .
>         .
>         .
>        - dataset30.000
>
>
> > If you could help me out with these two items, I think I will have enough
> > knowledge under my belt to know what I need to do. Thanks again! ;)
> >
> >
> > On Wed, Jul 18, 2012 at 6:21 AM, Ümit Seren <uem...@gm...>
> wrote:
> >>
> >> I think it depends and there are different ways to do it.
> >> But concurrent writes to one HDF5 file is not really supported (not
> >> even by the underlying HDF5 library unless you use the MPI version).
> >> So in case you want to write from different threads/processes you
> >> probably have to use separate hdf5 files.
> >> However writing from one process and reading from another is not much
> >> of an issue.
> >>
> >> Having everything in one hdf5 file has it's advantages as well as
> >> putting everything in separate hdf5 files.
> >> Filesystems can usually cope with one huge file much better than will
> >> millions of small files (copying, listing, etc).
> >> Of course if you have the datasets in separate hdf5 files it's easier
> >> to copy/move just single datasets compared to having everything in one
> >> hdf5 file  (tough that's also possible using ptrepack).
> >>
> >> You could also create one hdf5 file for the meta information and
> >> create separate hdf5 files for each dataset. Then you can use
> >> hardlinks to connect the hdf5 file containing the meta-information to
> >> the hdf5 files for the datasets.
> >>
> >> I usually tend to put everything in one hdf5 file.
> >>
> >> On Wed, Jul 18, 2012 at 12:49 PM, Jacob Bennett
> >> <jac...@gm...> wrote:
> >> > I really like this way about going about it; however, would it be
> better
> >> > to
> >> > use the built in hierarchy for separation of the tables or to write to
> >> > separate hdf5 files? When I am currently experimenting with concurrent
> >> > read/write operations to a shared hdf5 file w/o hierarchy, I notice
> that
> >> > the
> >> > only errors that I get are occasional read errors (which isn't much
> of a
> >> > problem for me), so I am thinking. Could there be a way to reduce the
> >> > metadata within an hdf5 and at the same time, use a multi-tabled
> >> > approach to
> >> > solve my problem?
> >> >
> >> > Thanks,
> >> > Jacob
> >> >
> >> >
> >> > On Wed, Jul 18, 2012 at 1:22 AM, Ümit Seren <uem...@gm...>
> >> > wrote:
> >> >>
> >> >> Just to add what Anthony said:
> >> >> In the end it also depends how unrelated your data is and how you
> want
> >> >> to access it. If the access scenaria is that you usually only search
> >> >> or select within a specific dataset then splitting up the datasets
> and
> >> >> putting them into separate tables is the way to go. In RBDMS terms
> >> >> this is btw called sharding.
> >> >> I have such a use case where I do have around 30000 datasets (each of
> >> >> them with around 5 million rows). I am only interested in one dataset
> >> >> at a time. So I created 30.000 tables. It works really good.
> >> >> And in case you want to access the data across the datasets (for
> >> >> aggregating or calculating averages) you can take a MapReduce
> approach
> >> >> which should work very well with this approach.
> >> >>
> >> >>
> >> >> On Tue, Jul 17, 2012 at 11:55 PM, Jacob Bennett
> >> >> <jac...@gm...> wrote:
> >> >> > Thanks for the input Anthony!
> >> >> >
> >> >> > -Jake
> >> >> >
> >> >> >
> >> >> > On Tue, Jul 17, 2012 at 4:20 PM, Anthony Scopatz <
> sc...@gm...>
> >> >> > wrote:
> >> >> >>
> >> >> >> On Tue, Jul 17, 2012 at 3:30 PM, Jacob Bennett
> >> >> >> <jac...@gm...>
> >> >> >> wrote:
> >> >> >>>
> >> >> >>> Hello PyTables Users & Contributors,
> >> >> >>>
> >> >> >>> Just a quick question, let's say that I have certain identifiers
> >> >> >>> that
> >> >> >>> link to a set of data. Would it generally be faster for lookup to
> >> >> >>> have
> >> >> >>> each
> >> >> >>> set a data as a separate table with an id as the tables name or
> to
> >> >> >>> add
> >> >> >>> this
> >> >> >>> id as another column to a universal table of data and then let
> the
> >> >> >>> in-kernel
> >> >> >>> search query data only with a specific id?
> >> >> >>
> >> >> >>
> >> >> >> I think that in general it is faster to have more tables with ids
> as
> >> >> >> names.  For very small data, searching through a single larger
> table
> >> >> >> might
> >> >> >> be quicker than node access...but even then I doubt it.
> >> >> >>
> >> >> >>>
> >> >> >>> I hope you can understand my question would 1,000 tables of
> 100,000
> >> >> >>> records each be better for searching than 1 table with 100
> million
> >> >> >>> records
> >> >> >>> and one extra id column?
> >> >> >>
> >> >> >>
> >> >> >> For these data sizes more tables is probably faster.
> >> >> >>
> >> >> >> (It should also be noted that in the more tables case, that data
> is
> >> >> >> actually smaller, because you can eliminate the id column.)
> >> >> >>
> >> >> >> Be Well
> >> >> >> Anthony
> >> >> >>
> >> >> >>>
> >> >> >>>
> >> >> >>> Thanks,
> >> >> >>> Jacob Bennett
> >> >> >>>
> >> >> >>> --
> >> >> >>> Jacob Bennett
> >> >> >>> Massachusetts Institute of Technology
> >> >> >>> Department of Electrical Engineering and Computer Science
> >> >> >>> Class of 2014| ben...@mi...
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>>
> ------------------------------------------------------------------------------
> >> >> >>> Live Security Virtual Conference
> >> >> >>> Exclusive live event will cover all the ways today's security and
> >> >> >>> threat landscape has changed and how IT managers can respond.
> >> >> >>> Discussions
> >> >> >>> will include endpoint security, mobile security and the latest in
> >> >> >>> malware
> >> >> >>> threats.
> http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> >> >> >>> _______________________________________________
> >> >> >>> Pytables-users mailing list
> >> >> >>> Pyt...@li...
> >> >> >>> https://lists.sourceforge.net/lists/listinfo/pytables-users
> >> >> >>>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> ------------------------------------------------------------------------------
> >> >> >> Live Security Virtual Conference
> >> >> >> Exclusive live event will cover all the ways today's security and
> >> >> >> threat landscape has changed and how IT managers can respond.
> >> >> >> Discussions
> >> >> >> will include endpoint security, mobile security and the latest in
> >> >> >> malware
> >> >> >> threats.
> http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> >> >> >> _______________________________________________
> >> >> >> Pytables-users mailing list
> >> >> >> Pyt...@li...
> >> >> >> https://lists.sourceforge.net/lists/listinfo/pytables-users
> >> >> >>
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Jacob Bennett
> >> >> > Massachusetts Institute of Technology
> >> >> > Department of Electrical Engineering and Computer Science
> >> >> > Class of 2014| ben...@mi...
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> ------------------------------------------------------------------------------
> >> >> > Live Security Virtual Conference
> >> >> > Exclusive live event will cover all the ways today's security and
> >> >> > threat landscape has changed and how IT managers can respond.
> >> >> > Discussions
> >> >> > will include endpoint security, mobile security and the latest in
> >> >> > malware
> >> >> > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> >> >> > _______________________________________________
> >> >> > Pytables-users mailing list
> >> >> > Pyt...@li...
> >> >> > https://lists.sourceforge.net/lists/listinfo/pytables-users
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >>
> ------------------------------------------------------------------------------
> >> >> Live Security Virtual Conference
> >> >> Exclusive live event will cover all the ways today's security and
> >> >> threat landscape has changed and how IT managers can respond.
> >> >> Discussions
> >> >> will include endpoint security, mobile security and the latest in
> >> >> malware
> >> >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> >> >> _______________________________________________
> >> >> Pytables-users mailing list
> >> >> Pyt...@li...
> >> >> https://lists.sourceforge.net/lists/listinfo/pytables-users
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Jacob Bennett
> >> > Massachusetts Institute of Technology
> >> > Department of Electrical Engineering and Computer Science
> >> > Class of 2014| ben...@mi...
> >> >
> >> >
> >> >
> >> >
> ------------------------------------------------------------------------------
> >> > Live Security Virtual Conference
> >> > Exclusive live event will cover all the ways today's security and
> >> > threat landscape has changed and how IT managers can respond.
> >> > Discussions
> >> > will include endpoint security, mobile security and the latest in
> >> > malware
> >> > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> >> > _______________________________________________
> >> > Pytables-users mailing list
> >> > Pyt...@li...
> >> > https://lists.sourceforge.net/lists/listinfo/pytables-users
> >> >
> >>
> >>
> >>
> ------------------------------------------------------------------------------
> >> Live Security Virtual Conference
> >> Exclusive live event will cover all the ways today's security and
> >> threat landscape has changed and how IT managers can respond.
> Discussions
> >> will include endpoint security, mobile security and the latest in
> malware
> >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> >> _______________________________________________
> >> Pytables-users mailing list
> >> Pyt...@li...
> >> https://lists.sourceforge.net/lists/listinfo/pytables-users
> >
> >
> >
> >
> > --
> > Jacob Bennett
> > Massachusetts Institute of Technology
> > Department of Electrical Engineering and Computer Science
> > Class of 2014| ben...@mi...
> >
> >
> >
> ------------------------------------------------------------------------------
> > Live Security Virtual Conference
> > Exclusive live event will cover all the ways today's security and
> > threat landscape has changed and how IT managers can respond. Discussions
> > will include endpoint security, mobile security and the latest in malware
> > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> > _______________________________________________
> > Pytables-users mailing list
> > Pyt...@li...
> > https://lists.sourceforge.net/lists/listinfo/pytables-users
> >
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>

-- 
Jacob Bennett
Massachusetts Institute of Technology
Department of Electrical Engineering and Computer Science
Class of 2014| ben...@mi...