Thread: [Modeling-users] Modeling performance for large number of objects?

Status: Abandoned

Brought to you by: sbigaret

modeling-users

[Modeling-users] Modeling performance for large number of objects?

From: Wolfgang K. <wol...@gm...> - 2004-12-20 11:09:57

Hello,

given the significant penalty for the creation of Python objects indicated
by most benchmarks I have seen so far, I wonder how and how well Modeling
deals with this issue...?

TIA,

Best regards

Wolfgang Keller

Re: [Modeling-users] Modeling performance for large number of objects?

From: John L. <jl...@gm...> - 2004-12-20 13:05:32

On Mon, 20 Dec 2004 11:43:07 +0100, Wolfgang Keller
<wol...@gm...> wrote:
> 
> given the significant penalty for the creation of Python objects indicated
> by most benchmarks I have seen so far, I wonder how and how well Modeling
> deals with this issue...?

Usually, the penalty does not interfere with the job at hand. However,
when you actually need to manipulate a large number of objects, the
Modeling overhead can be quite noticeable; when said manipulation is
only for querying, using rawRows avoids most of the creation overhead,
but when you have to modify the objects in question, you might find
yourself waiting quite a long time for saveChanges to complete.

One reference point you might find useful is that when I loading 3000
objects from a database, modifying them, and then saving the changes,
on a 700MHz p3 notebook, the loading took about 40 seconds, and the
saving, 200. That's 20 times what a direct sql script would've taken.

On the other hand, loading 100000 objects using rawRows takes about 20
seconds on this same machine. That's 4 times what the straight sql
would've taken.

Of course, in both cases, writing the sql script would've taken a
*lot* longer than the difference in run time, for me. However, it's
obvious that there are cases where the runtime difference overpowers
the developer time difference...

-- 
John Lenton (jl...@gm...) -- Random fortune:
bash: fortune: command not found

Re: [Modeling-users] Modeling performance for large number of objects?

From: Wolfgang K. <wol...@gm...> - 2004-12-20 13:28:31

Hello,

and thanks for your reply.

> Usually, the penalty does not interfere with the job at hand. 

Oops, why?

Obviously, when all objects get read into memory at startup of the
application server, and written back only at night, then...

> One reference point you might find useful is that when I loading 3000
> objects from a database, modifying them, and then saving the changes,
> on a 700MHz p3 notebook, the loading took about 40 seconds, and the
> saving, 200. That's 20 times what a direct sql script would've taken.

This gives me an idea, thanks. A multiplier of 20 is quite significant
imho.

> Of course, in both cases, writing the sql script would've taken a
> *lot* longer than the difference in run time, for me. However, it's
> obvious that there are cases where the runtime difference overpowers
> the developer time difference...

I was wondering whether Modeling would be suitable as a persistence layer
for a Python application that needs to process (create - transform - store)
rather important amounts of data.

The question is for me whether Modeling tries to and/or whether there would
be some other way to cut the hourglass-display-time to the unavoidable
minimum (dependend on the database) by some kind of smart caching of
objects or by maintaining some kind of pool of pre-created object
instances.

Best regards,

Wolfgang Keller

Re: [Modeling-users] Modeling performance for large number of objects?

From: John L. <jl...@gm...> - 2004-12-20 13:44:10

On Mon, 20 Dec 2004 14:21:40 +0100, Wolfgang Keller
<wol...@gm...> wrote:
> Hello,
>=20
> and thanks for your reply.
>=20
> > Usually, the penalty does not interfere with the job at hand.
>=20
> Oops, why?
>=20
> Obviously, when all objects get read into memory at startup of the
> application server, and written back only at night, then...

because this (reading in all the objects, modifying them all, and
saving them all) is not the usual use case. Usually you might
*display* all the objects (where rawRows comes in handy), and then the
user selects one of these objects to actually modify (so you fault the
raw object into a real one, work on it, and saveChanges). You still
have a 20x penalty, but it's much less then a second in this use case.

> > One reference point you might find useful is that when I loading 3000
> > objects from a database, modifying them, and then saving the changes,
> > on a 700MHz p3 notebook, the loading took about 40 seconds, and the
> > saving, 200. That's 20 times what a direct sql script would've taken.
>=20
> This gives me an idea, thanks. A multiplier of 20 is quite significant
> imho.
>=20
> > Of course, in both cases, writing the sql script would've taken a
> > *lot* longer than the difference in run time, for me. However, it's
> > obvious that there are cases where the runtime difference overpowers
> > the developer time difference...
>=20
> I was wondering whether Modeling would be suitable as a persistence layer
> for a Python application that needs to process (create - transform - stor=
e)
> rather important amounts of data.
>=20
> The question is for me whether Modeling tries to and/or whether there wou=
ld
> be some other way to cut the hourglass-display-time to the unavoidable
> minimum (dependend on the database) by some kind of smart caching of
> objects or by maintaining some kind of pool of pre-created object
> instances.

Modeling does cache the objects, and only saves those objects that
have effectively changed, so depending on your actual use cases you
might be surprised at how well it works. The loading, modifying and
saving of all the objects is pretty much the worse case; Modeling
isn't meant (AFAICT) for that kind of batch processing. It certainly
is convenient, though :)

Of course, maybe S=E9bastien has a trick up his sleeve as to how one
could go about using Modeling for batch processing...

--=20
John Lenton (jl...@gm...) -- Random fortune:
bash: fortune: command not found

Re: [Modeling-users] Modeling performance for large number of objects?

From: Wolfgang K. <wol...@gm...> - 2004-12-20 19:23:03

>>> Usually, the penalty does not interfere with the job at hand.
>> 
>> Oops, why?

*snip*
 
> Usually you might
> *display* all the objects (where rawRows comes in handy), and then the
> user selects one of these objects to actually modify (so you fault the
> raw object into a real one, work on it, and saveChanges).

*click* Of course.

> Of course, maybe S=E9bastien has a trick up his sleeve as to how one
> could go about using Modeling for batch processing...

Not just for batch-processing...

In fact my question was raised when I read the article about ERP5 on
pythonology.org, with the performance values they claimed for ZODB with
their ZSQLCatalog add-on. I would guess that their performance claims are
only valid if all the queried objects are in fact in memory...?

Best regards

Wolfgang Keller

Re: [Modeling-users] Modeling performance for large number of objects?

From: Sebastien B. <sbi...@us...> - 2004-12-21 14:05:21

  Hi Wolfgang, John and all,

Thanks John for giving the figures for the overhead induced by the
framework when creating/manipulating objects.  I'm currently away from
my computer and I only have a web access, so that's hard to try and
compare anything in those conditions ;)

Wolfgang Keller <wol...@gm...> wrote:
> The question is for me whether Modeling tries to and/or whether there
> would be some other way to cut the hourglass-display-time to the
> unavoidable minimum (dependend on the database) by some kind of smart
> caching of objects or by maintaining some kind of pool of pre-created
> object instances.

For the moment being the framework does not use any kind of pool of
pre-created objects.  On the other hand, it caches database snapshots so
that the data that have already been fetched is not fetched again (this
avoids a time-expensive round-trip to the database).

John Lenton <jl...@gm...> wrote:
> Modeling [...] only saves those objects that have effectively changed,
> so depending on your actual use cases you might be surprised at how
> well it works. The loading, modifying and saving of all the objects is
> pretty much the worse case; Modeling isn't meant (AFAICT) for that
> kind of batch processing. It certainly is convenient, though :)

Right: when saving changes the framework uses that cache to save only
the objects that were actually modified/deleted.  And I definitely agree
w/ John, in that performance will highly depend on your particular
use-case --maybe you could be more explicit on it?  When you say that
you need <<to process (create - transform - store) rather important
amounts of data >>, do you mean that every single fetched object will be
updated and stored back in the database?  If this is the case, as John
already pointed it out, this is the worst case and the most
time-comsuming process you'll have w/ the framework.

John Lenton <jl...@gm...> wrote:
> Of course, in both cases, writing the sql script would've taken a
> *lot* longer than the difference in run time, for me. However, it's
> obvious that there are cases where the runtime difference overpowers
> the developer time difference...

  ...and when runtime matters, you can also use the framework on sample
  data, extract the generated SQL statements and then directly use those
  statements in the real batch.

Wolfgang Keller <wol...@gm...> wrote:
> In fact my question was raised when I read the article about ERP5 on
> pythonology.org, with the performance values they claimed for ZODB with
> their ZSQLCatalog add-on. I would guess that their performance claims are
> only valid if all the queried objects are in fact in memory...?

I didn't read that article (didn't search it either I admit, do you have
the hyperlink at hand?), but I suspect that the performance mostly comes
from the fact that the ZODB.Persistent mixin class is written in C:
while the overhead for object creation is probably still the same, the
process of fully initializing an object (assign values to attributes) is
pretty much quicker (as far as I remember, it directly sets the object's
__dict__, so yes that's fast ;)

  The framework spend most of its initializing time in KeyValueCoding,
(http://modeling.sourceforge.net/UserGuide/customobject-key-value-coding.html)
examining objects and finding the correct way of setting the attributes.
While this allows a certain flexibility, I now tend to believe that most
applications pay the price for a feature they do not need (for example,
the way attributes'values are assigned by the framework should probably
be cached per class rather then systematically determined for every
object).  [1]

John Lenton <jl...@gm...> wrote:
> Of course, maybe Sébastien has a trick up his sleeve as to how one
> could go about using Modeling for batch processing...

  Well, that's always hard to tell without specific use-cases, but
  the general advices are:

  - use the latest python,

  - use new-style classes rather than old-styles ones,

  - activate MDL_ENABLE_SIMPLE_METHOD_CACHE
    (http://modeling.sourceforge.net/UserGuide/env-vars-core.html)

  - specifically for batch processing: be sure to read:
    http://modeling.sourceforge.net/UserGuide/ec-discard-changes.html

And of course, we'll be happy to examine your particular use-cases to
help you optimize the process.

-- Sébastien.

[1] and thinking a little more about this, I now realize that the way
    this is done in the framework at initialization time is pretty
    stupid (the KVC mechanism should definitely be cached somehow at
    this point: since the framework creates the object before
    initializing it there is absolutely no reason for different objects
    of the same class to behave differently wrt KVC at this point)...
    I'll get back on this for sure.  For the curious this is done in
    DatabaseContext.initializeObject(), lines 1588-1594.  For the
    records I'll add that the prepareForInitializationWithKeys() stuff
    is also not needed.

Re: [Modeling-users] Modeling performance for large number of objects?

From: Wolfgang K. <wol...@gm...> - 2004-12-21 16:43:39

Hello,

and thanks for your reply.

>> In fact my question was raised when I read the article about ERP5 on
>> pythonology.org, with the performance values they claimed for ZODB with
>> their ZSQLCatalog add-on. I would guess that their performance claims are
>> only valid if all the queried objects are in fact in memory...?
> 
> I didn't read that article (didn't search it either I admit, do you have
> the hyperlink at hand?), 

http://www.pythonology.org/success&story=nexedi

They claim that "Reading the Zope object database is 10 to 100 times faster
than retrieving a row from the fastest relational databases available on
the market today".

And about ZSQLCatalog in particular: "a Zope database with more than
2,000,000 objects can be queried with statistical methods in a few
milliseconds"

Best regards,

Wolfgang Keller

Re: [Modeling-users] Modeling performance for large number of objects?

From: Wolfgang K. <wol...@gm...> - 2004-12-22 18:15:16

> And I definitely agree w/ John, in that performance will highly depend on
> your particular use-case --maybe you could be more explicit on it?

The specific envisaged application case would be as a persistence framework
for a toolkit for extracting, transformating and forwarding of data from
various sources to various sinks - a re-implementation in 100% pure
(C)Python of the Retic toolkit currently which is currently implemented in
Jython.

But the question was also for general "enterprise" application cases like
for an ERP or CMMS system.

Best regards,

Wolfgang Keller