Re: [Pytables-users] Faster Performance: A set of nodes vs A new column that ranges within a set?

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Sounds awesome, thanks for the help, I also have two more concerns.

#1 - I will never concurrently write, I only have to worry about one write
with many reads, will the hdf5 metadata for a tree-like structure be able
to hold up in this scenario?
#2 - When you have around 30,000 tables in your hdf5 file, you do not want
to have every node directly linked to root (plus I don't think hdf5 can
support that); however, I have no other natural grouping besides this,
could this be a concern also.

If you could help me out with these two items, I think I will have enough
knowledge under my belt to know what I need to do. Thanks again! ;)

On Wed, Jul 18, 2012 at 6:21 AM, Ümit Seren <uem...@gm...> wrote:

> I think it depends and there are different ways to do it.
> But concurrent writes to one HDF5 file is not really supported (not
> even by the underlying HDF5 library unless you use the MPI version).
> So in case you want to write from different threads/processes you
> probably have to use separate hdf5 files.
> However writing from one process and reading from another is not much
> of an issue.
>
> Having everything in one hdf5 file has it's advantages as well as
> putting everything in separate hdf5 files.
> Filesystems can usually cope with one huge file much better than will
> millions of small files (copying, listing, etc).
> Of course if you have the datasets in separate hdf5 files it's easier
> to copy/move just single datasets compared to having everything in one
> hdf5 file  (tough that's also possible using ptrepack).
>
> You could also create one hdf5 file for the meta information and
> create separate hdf5 files for each dataset. Then you can use
> hardlinks to connect the hdf5 file containing the meta-information to
> the hdf5 files for the datasets.
>
> I usually tend to put everything in one hdf5 file.
>
> On Wed, Jul 18, 2012 at 12:49 PM, Jacob Bennett
> <jac...@gm...> wrote:
> > I really like this way about going about it; however, would it be better
> to
> > use the built in hierarchy for separation of the tables or to write to
> > separate hdf5 files? When I am currently experimenting with concurrent
> > read/write operations to a shared hdf5 file w/o hierarchy, I notice that
> the
> > only errors that I get are occasional read errors (which isn't much of a
> > problem for me), so I am thinking. Could there be a way to reduce the
> > metadata within an hdf5 and at the same time, use a multi-tabled
> approach to
> > solve my problem?
> >
> > Thanks,
> > Jacob
> >
> >
> > On Wed, Jul 18, 2012 at 1:22 AM, Ümit Seren <uem...@gm...>
> wrote:
> >>
> >> Just to add what Anthony said:
> >> In the end it also depends how unrelated your data is and how you want
> >> to access it. If the access scenaria is that you usually only search
> >> or select within a specific dataset then splitting up the datasets and
> >> putting them into separate tables is the way to go. In RBDMS terms
> >> this is btw called sharding.
> >> I have such a use case where I do have around 30000 datasets (each of
> >> them with around 5 million rows). I am only interested in one dataset
> >> at a time. So I created 30.000 tables. It works really good.
> >> And in case you want to access the data across the datasets (for
> >> aggregating or calculating averages) you can take a MapReduce approach
> >> which should work very well with this approach.
> >>
> >>
> >> On Tue, Jul 17, 2012 at 11:55 PM, Jacob Bennett
> >> <jac...@gm...> wrote:
> >> > Thanks for the input Anthony!
> >> >
> >> > -Jake
> >> >
> >> >
> >> > On Tue, Jul 17, 2012 at 4:20 PM, Anthony Scopatz <sc...@gm...>
> >> > wrote:
> >> >>
> >> >> On Tue, Jul 17, 2012 at 3:30 PM, Jacob Bennett
> >> >> <jac...@gm...>
> >> >> wrote:
> >> >>>
> >> >>> Hello PyTables Users & Contributors,
> >> >>>
> >> >>> Just a quick question, let's say that I have certain identifiers
> that
> >> >>> link to a set of data. Would it generally be faster for lookup to
> have
> >> >>> each
> >> >>> set a data as a separate table with an id as the tables name or to
> add
> >> >>> this
> >> >>> id as another column to a universal table of data and then let the
> >> >>> in-kernel
> >> >>> search query data only with a specific id?
> >> >>
> >> >>
> >> >> I think that in general it is faster to have more tables with ids as
> >> >> names.  For very small data, searching through a single larger table
> >> >> might
> >> >> be quicker than node access...but even then I doubt it.
> >> >>
> >> >>>
> >> >>> I hope you can understand my question would 1,000 tables of 100,000
> >> >>> records each be better for searching than 1 table with 100 million
> >> >>> records
> >> >>> and one extra id column?
> >> >>
> >> >>
> >> >> For these data sizes more tables is probably faster.
> >> >>
> >> >> (It should also be noted that in the more tables case, that data is
> >> >> actually smaller, because you can eliminate the id column.)
> >> >>
> >> >> Be Well
> >> >> Anthony
> >> >>
> >> >>>
> >> >>>
> >> >>> Thanks,
> >> >>> Jacob Bennett
> >> >>>
> >> >>> --
> >> >>> Jacob Bennett
> >> >>> Massachusetts Institute of Technology
> >> >>> Department of Electrical Engineering and Computer Science
> >> >>> Class of 2014| ben...@mi...
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> ------------------------------------------------------------------------------
> >> >>> Live Security Virtual Conference
> >> >>> Exclusive live event will cover all the ways today's security and
> >> >>> threat landscape has changed and how IT managers can respond.
> >> >>> Discussions
> >> >>> will include endpoint security, mobile security and the latest in
> >> >>> malware
> >> >>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> >> >>> _______________________________________________
> >> >>> Pytables-users mailing list
> >> >>> Pyt...@li...
> >> >>> https://lists.sourceforge.net/lists/listinfo/pytables-users
> >> >>>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> ------------------------------------------------------------------------------
> >> >> Live Security Virtual Conference
> >> >> Exclusive live event will cover all the ways today's security and
> >> >> threat landscape has changed and how IT managers can respond.
> >> >> Discussions
> >> >> will include endpoint security, mobile security and the latest in
> >> >> malware
> >> >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> >> >> _______________________________________________
> >> >> Pytables-users mailing list
> >> >> Pyt...@li...
> >> >> https://lists.sourceforge.net/lists/listinfo/pytables-users
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Jacob Bennett
> >> > Massachusetts Institute of Technology
> >> > Department of Electrical Engineering and Computer Science
> >> > Class of 2014| ben...@mi...
> >> >
> >> >
> >> >
> >> >
> ------------------------------------------------------------------------------
> >> > Live Security Virtual Conference
> >> > Exclusive live event will cover all the ways today's security and
> >> > threat landscape has changed and how IT managers can respond.
> >> > Discussions
> >> > will include endpoint security, mobile security and the latest in
> >> > malware
> >> > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> >> > _______________________________________________
> >> > Pytables-users mailing list
> >> > Pyt...@li...
> >> > https://lists.sourceforge.net/lists/listinfo/pytables-users
> >> >
> >>
> >>
> >>
> ------------------------------------------------------------------------------
> >> Live Security Virtual Conference
> >> Exclusive live event will cover all the ways today's security and
> >> threat landscape has changed and how IT managers can respond.
> Discussions
> >> will include endpoint security, mobile security and the latest in
> malware
> >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> >> _______________________________________________
> >> Pytables-users mailing list
> >> Pyt...@li...
> >> https://lists.sourceforge.net/lists/listinfo/pytables-users
> >
> >
> >
> >
> > --
> > Jacob Bennett
> > Massachusetts Institute of Technology
> > Department of Electrical Engineering and Computer Science
> > Class of 2014| ben...@mi...
> >
> >
> >
> ------------------------------------------------------------------------------
> > Live Security Virtual Conference
> > Exclusive live event will cover all the ways today's security and
> > threat landscape has changed and how IT managers can respond. Discussions
> > will include endpoint security, mobile security and the latest in malware
> > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> > _______________________________________________
> > Pytables-users mailing list
> > Pyt...@li...
> > https://lists.sourceforge.net/lists/listinfo/pytables-users
> >
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>

-- 
Jacob Bennett
Massachusetts Institute of Technology
Department of Electrical Engineering and Computer Science
Class of 2014| ben...@mi...