Re: [Modeling-users] Summary objects and Proxy objects
Status: Abandoned
Brought to you by:
sbigaret
From: Ernesto R. <er...@si...> - 2003-06-13 01:18:58
|
Examples for totals (sums) and summaries: As we study economic data to decide if a company goes right, here are some cases: * Total revenues by month (group by year-month) select month(year) as mes, sum(totalAmount) where date between '2003-01-01' and '2003-12-31' group by month (date) for more than a year: select year(date) as ano, month(year) as mes, sum(totalAmount) where date between '2003-01-01' and '2003-12-31' group by year(month), month(date) The database does this in 2 seconds for 20.000 records * costs, revenues and difference (benefit) by month * customers who bought prodcuts with a total amount more than X (send them some gifts) select customer,sum(totalAmount) as Total from invoices where date between.... group by customer having total>10.000 * costs to produce each order of the month (hours, materials, etc.) * we use them a lot if we have to compare numbers as a whole or in groups, for example, accounting with invoices (because invoice generate assets, but user change invoices after assets have been done, and elsewhere change any another data, so we check if the accounting is more or less ok, so the company can pay VAT to the government.) * consistency checks, e.g. see if anybody really has one address: select person.id from person inner join address on person.id=address.FK_person group by person.id having count(*)>1 * best-sellers: select article,count(*) As numberSold from productLine group by article having numberSold>20.000 Ok this could all be done in Python, but I think this is a bit slower. One of our customers has something like 50.000 delivery notes a year, and this is not much as bigger companies (we only have 'small' customers) can have hundreds of thousands of documents a year. It's amazing how fast some databases are. They only take 2 or 3 seconds to answer those questions. I think Python takes at least 10 times more and consumes a lot of memory. The implication on vertical inheritence mapping is that we have to retrieve subtotals for each table and total them in python at the final step. Although there are other operations like average, max, min, etc. but these are easy done. To really get knowledge out of the data it doesn't make sense that the result is too big, because we are only human. Perhaps 100-200 rows max? So if each table summary is 100 records, we have to sum them for each class of the inheritance tree, so we would have to process <1000 numbers in python which is fast enough. (In real-life queries, we'll have often queries much smaller with <100 numbers to process.) The idea is that the query has not raw sql, as we need to process the summary operations at both levels, first SQL than Python. The 'having' operation should also be done at both levels (so db returns less data, and Python only has to filter out the ultimate result.) Although 'where' condition could be raw, Modeling already processes it the right way, so nothing has to be done. The SQL 'GROUP BY' returns a tuple for each combination of the the grouping values, but it really expresses a tree (list of lists, or dictionary of dictionaries): e.g. how many products of each has bought every customer on a per month basis? cust1 product1 month1 100 cust1 product1 month2 130 cust1 product1 month4 90 # 0 for month3 doesn't appear using SQL .... cust1 product2 month1 150 cust1 product2 month7 130 cust1 product3 month2 10 .... cust4 product1 month3 10 .... If this was a multi-dimensional structure (customer, product, date, units bought), every summary is a projection into a space with fewer dimensions. This is just was OLAP and data-warehounsing does: it picks up data from several tables, several databases, etc. and puts it into a multidimensional structure (getting effectively hypercubes) and throws projections to answer business strategic questions. This is not for the 0.9 release (nor perhaps for the 1.0 release, I don't know). Better a bit slower but more solid. (We'll need at least one year more for our software, although we have to pick up all the basic components now: PyGTK & XML-RPC (client) - Modeling & Webware or Zope / XML-RPC (middle-tier) - postgres (back-end) - reportlab (pdf - generation) - cheetah, etc.) Erny ----- Original Message ----- From: "Sebastien Bigaret" <sbi...@us...> To: "modeling-users" <mod...@li...> Sent: Friday, June 13, 2003 1:55 AM Subject: Re: [Modeling-users] Summary objects and Proxy objects "Ernesto Revilla" <er...@si...> wrote: > Don't take me too serious, I'm new to this list. > > I agree that, especially for summaries, like statistics (counts, sums, > average, etc.) raw SQL access may be needed, I saw direct SQL calls in some > of existing products. But they return raw rows, i.e. an array. No middletier > functionality. On the otherhand, it may be interesting to encapsulate this > for later enhacenments. > > In one of the papers I studied on this theme > (http://www.ambysoft.com/persistenceLayer.pdf, pg 7, point 8.) I saw the use > of proxy objects which are light weight objects with the most basic > attributes (some user-key, name and few more). This idea could be extended, > telling the fetch machinery to return objects with some additional, user > requested attributes. The idea of proxies is quite interesting and notably pretty straightforward to implement. Suppose the ability to fetch raw rows is implemented, the common idiom would be (not speaking about the fetch() shortcut) something like this: >>> qual=qualifierWithQualifierFormat('title ilike "%t%"') >>> fs=FetchSpecification('Book', qual) >>> fs.setFetchesRawRows(true) >>> fs.setRawRowsKeyPath('id', 'title', 'author.lastName') >>> rows=ec.objectsWithFetchSpecification(fs) >>> print rows ({'id': 1, 'title': 'Gargantua', 'author_lastName': 'Rabelais'}, {'id': 2, 'title': 'Bouge ton pied...', 'author_lastName': 'Dard'}, {'id': 4, 'title': "T'assieds pas...", 'author_lastName': 'Dard'}, ) (note: I consider that dicts are better than arrays as a result set) Then it simply a matter of declaring: >>> class GenericProxy: pass after which the creation and initialization of the proxies is just: >>> proxies=[GenericProxy() for i in range(len(rows))] >>> [proxies[i].__dict__.update(rows[i]) for i in range(len(rows))] That's all! And that's a really nice idea. > On the other hand, we could have summary objects, because it's very usual to > get sums, counts and averages. The idea is that the resultset would be an > encapsulated a collection of summary objects: > > objects=ec.fetchSummary(sourceEntityName="Book", what="count(*), avg(price), > sum(toInvoiceLines.amount)", where="author.name like 'John*'", > groupBy="author, genre", resultEntityName="BookSummary") > > This may seem to be a SQL (OQL is very near to), but it has advantages > over SQL: > * the dotted notation is allowed (get away from ugly inner joins). > * it would process inheritance trees > * don't have to repeat the grouping fields in select clause, they are added > automatically > The only purpose of the entityName is to give an 'entry entity' which the > attributes are related to. The result could be just an instance of > 'SummaryObject', or perhaps an instance of a special BookSummary which > encapsulates special behaviour. * You make me realize that my previous answer to Mario saying ''you'll have to be able to distinguish between table alias t1 and table alias t2 when declaring attribute author.age and that's a problem bla bla bla'' was stupid: the dotted notation obviously solves this naturally (the distinction between author.age and author.pygmalion.age is clear). * You're partially answering to the question of real-life examples --although I still need to familiarize myself with 'group by' and 'having' statements. * This brings a different light to what Mario formerly proposed (different api but the very same idea as far as I can see) --just because we here use a different method (fetchSummmary instead of fetch); sorry, sometimes I can't see the forest for the trees because of such details :| Now that the concept (summaries) is clearer to me I'm beginning to see the ideas behind. And I guess I'm not far from thinking that, okay, fetchSummary() does contain raw sql pieces, but that it's maybe not such a big deal if this covers common needs. * could this be made clearer: despite the fact that I cannot really see its implication for the moment being, I have the feeling that 'group by' and multiple fetches (walking through the inheritance tree) is not only a matter of getting the results and assembling them one after the other, is it? > Should the result be a flat list or a tree-like list or dictionary? I cannot answer that: as I said I'm not familiar enough with this kind of queries to see what form would be the most accurate, or whether it could depend on the query itself. I'd need different examples and your opinion here. BTW I suggest that we do not consider this as an element of the current API change proposal, rather as a new feature request. Last, you wrote: > On the other hand, we could have summary objects, because it's very > usual to get sums, counts and averages. I'd like to hear more about those usual cases, if you could give some examples that would help --help me, at least ;) since my personal approach to that kind of things would be e.g. to treat the raw rows on the python side rather than to ask the rdbms to do the job. -- Sébastien. ------------------------------------------------------- This SF.NET email is sponsored by: eBay Great deals on office technology -- on eBay now! Click here: http://adfarm.mediaplex.com/ad/ck/711-11697-6916-5 _______________________________________________ Modeling-users mailing list Mod...@li... https://lists.sourceforge.net/lists/listinfo/modeling-users |