Re: [Modeling-users] Summary objects and Proxy objects

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Examples for totals (sums) and summaries:

As we study economic data to decide if a company goes right, here are some
cases:
* Total revenues by month (group by year-month)
    select month(year) as mes, sum(totalAmount) where date between
'2003-01-01' and '2003-12-31'
    group by month (date)
    for more than a year:
    select year(date) as ano, month(year) as mes, sum(totalAmount) where
date between '2003-01-01' and '2003-12-31'
    group by year(month), month(date)
    The database does this in 2 seconds for 20.000 records

* costs, revenues and difference (benefit) by month

* customers who bought prodcuts with a total amount more than X (send them
some gifts)
   select customer,sum(totalAmount) as Total from invoices where date
between.... group by customer having total>10.000

* costs to produce each order of the month (hours, materials, etc.)

* we use them a lot if we have to compare numbers as a whole or in groups,
for example, accounting with invoices (because invoice generate assets, but
user change invoices after assets have been done, and elsewhere change any
another data, so we check if the accounting is more or less ok, so the
company can pay VAT to the government.)

* consistency checks, e.g. see if anybody really has one address:
select person.id from person inner join address on
person.id=address.FK_person group by person.id having count(*)>1

* best-sellers:
select article,count(*) As numberSold from productLine group by article
having numberSold>20.000

Ok this could all be done in Python, but I think this is a bit slower. One
of our customers has something like 50.000 delivery notes a year, and this
is not much as bigger companies (we only have 'small' customers) can have
hundreds of thousands of documents a year. It's amazing how fast some
databases are. They only take 2 or 3 seconds to answer those questions. I
think Python takes at least 10 times more and consumes a lot of memory.

The implication on vertical inheritence mapping is that we have to retrieve
subtotals for each table and total them in python at the final step.
Although there are other operations like average, max, min, etc. but these
are easy done. To really get knowledge out of the data it doesn't make sense
that the result is too big, because we are only human. Perhaps 100-200 rows
max? So if each table summary is 100 records, we have to sum them for each
class of the inheritance tree, so we would have to process <1000 numbers in
python which is fast enough. (In real-life queries, we'll have often queries
much smaller with <100 numbers to process.)

The idea is that the query has not raw sql, as we need to process the
summary operations at both levels, first SQL than Python. The 'having'
operation should also be done at both levels (so db returns less data, and
Python only has to filter out the ultimate result.) Although 'where'
condition could be raw, Modeling already processes it the right way, so
nothing has to be done.

The SQL 'GROUP BY' returns a tuple for each combination of the the grouping
values, but it really expresses a tree (list of lists, or dictionary of
dictionaries): e.g. how many products of each has bought every customer on a
per month basis?

cust1    product1    month1    100
cust1    product1    month2    130
cust1    product1    month4    90        # 0 for month3 doesn't appear using
SQL
....
cust1    product2    month1    150
cust1    product2    month7    130
cust1    product3    month2    10
....
cust4    product1    month3    10
....

If this was a multi-dimensional structure (customer, product, date, units
bought), every summary is a projection into a space with fewer dimensions.
This is just was OLAP and data-warehounsing does: it picks up data from
several tables, several databases, etc. and puts it into a multidimensional
structure (getting effectively hypercubes) and throws projections to answer
business strategic questions.

This is not for the 0.9 release (nor perhaps for the 1.0 release, I don't
know). Better a bit slower but more solid. (We'll need at least one year
more for our software, although we have to pick up all the basic components
now: PyGTK & XML-RPC (client)  -  Modeling & Webware or Zope / XML-RPC
(middle-tier) - postgres (back-end) - reportlab (pdf - generation) -
cheetah, etc.)

Erny

----- Original Message -----
From: "Sebastien Bigaret" <sbi...@us...>
To: "modeling-users" <mod...@li...>
Sent: Friday, June 13, 2003 1:55 AM
Subject: Re: [Modeling-users] Summary objects and Proxy objects

"Ernesto Revilla" <er...@si...> wrote:
> Don't take me too serious, I'm new to this list.
>
> I agree that, especially for summaries, like statistics (counts, sums,
> average, etc.) raw SQL access may be needed, I saw direct SQL calls in
some
> of existing products. But they return raw rows, i.e. an array. No
middletier
> functionality. On the otherhand, it may be interesting to encapsulate this
> for later enhacenments.
>
> In one of the papers I studied on this theme
> (http://www.ambysoft.com/persistenceLayer.pdf, pg 7, point 8.) I saw the
use
> of proxy objects which are light weight objects with the most basic
> attributes (some user-key, name and few more). This idea could be
extended,
> telling the fetch machinery to return objects with some additional, user
> requested attributes.

The idea of proxies is quite interesting and notably pretty
straightforward to implement. Suppose the ability to fetch raw rows is
implemented, the common idiom would be (not speaking about the fetch()
shortcut) something like this:

  >>> qual=qualifierWithQualifierFormat('title ilike "%t%"')
  >>> fs=FetchSpecification('Book', qual)
  >>> fs.setFetchesRawRows(true)
  >>> fs.setRawRowsKeyPath('id', 'title', 'author.lastName')
  >>> rows=ec.objectsWithFetchSpecification(fs)
  >>> print rows
  ({'id': 1, 'title': 'Gargantua', 'author_lastName': 'Rabelais'},
   {'id': 2, 'title': 'Bouge ton pied...', 'author_lastName': 'Dard'},
   {'id': 4, 'title': "T'assieds pas...", 'author_lastName': 'Dard'},
  )

  (note: I consider that dicts are better than arrays as a result set)

  Then it simply a matter of declaring:
  >>> class GenericProxy: pass
  after which the creation and initialization of the proxies is just:
  >>> proxies=[GenericProxy() for i in range(len(rows))]
  >>> [proxies[i].__dict__.update(rows[i]) for i in range(len(rows))]

  That's all!

  And that's a really nice idea.

> On the other hand, we could have summary objects, because it's very usual
to
> get sums, counts and averages. The idea is that the resultset would be an
> encapsulated a collection of summary objects:
>
> objects=ec.fetchSummary(sourceEntityName="Book", what="count(*),
avg(price),
>   sum(toInvoiceLines.amount)", where="author.name like 'John*'",
>   groupBy="author, genre", resultEntityName="BookSummary")
>
> This may seem to be a  SQL  (OQL is very near to), but it has advantages
> over SQL:
> * the dotted notation is allowed (get away from ugly inner joins).
> * it would process inheritance trees
> * don't have to repeat the grouping fields in select clause, they are
added
> automatically
> The only purpose of the entityName is to give an 'entry entity' which the
> attributes are related to. The result could be just an instance of
> 'SummaryObject', or perhaps an instance of a special BookSummary which
> encapsulates special behaviour.

* You make me realize that my previous answer to Mario saying ''you'll
  have to be able to distinguish between table alias t1 and table alias
  t2 when declaring attribute author.age and that's a problem bla bla
  bla'' was stupid: the dotted notation obviously solves this naturally
  (the distinction between author.age and author.pygmalion.age is
  clear).

* You're partially answering to the question of real-life examples
  --although I still need to familiarize myself with 'group by' and
  'having' statements.

* This brings a different light to what Mario formerly proposed
  (different api but the very same idea as far as I can see) --just
  because we here use a different method (fetchSummmary instead of
  fetch); sorry, sometimes I can't see the forest for the trees because
  of such details :|

  Now that the concept (summaries) is clearer to me I'm beginning to see
  the ideas behind. And I guess I'm not far from thinking that, okay,
  fetchSummary() does contain raw sql pieces, but that it's maybe not
  such a big deal if this covers common needs.

* could this be made clearer: despite the fact that I cannot really see
  its implication for the moment being, I have the feeling that 'group
  by' and multiple fetches (walking through the inheritance tree) is not
  only a matter of getting the results and assembling them one after the
  other, is it?

> Should the result be a flat list or a tree-like list or dictionary?

  I cannot answer that: as I said I'm not familiar enough with this kind
  of queries to see what form would be the most accurate, or whether it
  could depend on the query itself. I'd need different examples and your
  opinion here.

    BTW I suggest that we do not consider this as an element of the
    current API change proposal, rather as a new feature request.

  Last, you wrote:
> On the other hand, we could have summary objects, because it's very
> usual to get sums, counts and averages.

I'd like to hear more about those usual cases, if you could give some
examples that would help --help me, at least ;) since my personal
approach to that kind of things would be e.g. to treat the raw rows
on the python side rather than to ask the rdbms to do the job.

-- Sébastien.

-------------------------------------------------------
This SF.NET email is sponsored by: eBay
Great deals on office technology -- on eBay now! Click here:
http://adfarm.mediaplex.com/ad/ck/711-11697-6916-5
_______________________________________________
Modeling-users mailing list
Mod...@li...
https://lists.sourceforge.net/lists/listinfo/modeling-users