Re: [Pytables-users] details of user's guide performance plots

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Damon,

A Divendres 25 Febrer 2005 20:44, damon fasching va escriure:
> I wonder if someone can shed a little light on Figures
> 6.1 and 6.2 in the User's Guide.  The horizontal axis
> is labeled "Number of rows".  I assume from the scale
> on that axis that this is "Number of rows in table"
> and not "Number of rows accessed".  That isn't stated
> directly anywhere in the text, though it is suggested
> by the wording of the 4th paragraph in section 6.2.2.
> (I assume a 900MHz machine could not access 6e+08 rows
> in a second...)

Yes, it's "Number of rows in table".

> If the axis is actually "Number of rows in table",
> does anyone know roughly how many rows were accessed
> for each data point?  Is the number of rows accessed
> the same for all data points or does the number of
> rows satisfying "table.where(table.cols.var1 <=3D 20)"
> grow with table size?  If it grows with table size, is
> it linear?

Well, I choosed a normal distribution (see bench/search-bench.py and
bench/search-bench-rnd.sh), with the aim that the number of selected
rows would remain more or less constant independently of the table
size. I must say, however, I've had a moderate success doing that. The
only thing that I can assure is that the number of selected rows is
very little compared with the total number of rows, specially for very
large tables. In pytables 1.0 branch, I've reworked the benchmark so
that I can control better the number of selected rows.

> Finally, for the table being accessed, are the rows
> ordered by "var1", in which case there is no disk
> seeking going on, or are rows with various values of
> "var1" scattered throughout the table, in which case
> the index is accessed seqentially, but the data would
> be more or less randomly accessed, i.e. would require
> disk seeks.

The values should be scattered throughout the tables, as I've choosed
a *random* normal distribution to fill the table.

> I can only make sense of the figures if I first assume
> that the horizontal axis should read "Number of rows
> in table".  If this is correct (or even if it is not
> correct) perhaps it could be clarified in the text
> and/or axis label.

I'll try to clarify this point in forthcoming benchmarks.

> Given that, I have to make one further assumption to
> understand the figures, but in this case I'm not sure
> what is correct.  I can either assume that a large
> number of rows is accessed, that this number grows
> linearly with table size and that rows in the table
> appear in order of variable "var1".  Or, I can assume
> that a small number or rows is accessed, that this
> number may grow (probably sublinearly) with table
> size, and that rows with a given value of var1 are
> scattered throughout the table.  Which of these
> assumptions, if either, is correct?

The second one is the correct one.

Cheers,

=2D-=20
>qo<   Francesc Altet =A0 =A0 http://www.carabos.com/
V =A0V   C=E1rabos Coop. V. =A0=A0Enjoy Data
 ""