Re: [Openinteract-dev] Questions about new has_a/has_many/links_to

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Simon,

This is going to be a quick, first-pass response to the issues you 
raise ...

For number 1, let me respond with a general comment about caching and 
keeping multiple Perl objects in sync, etc, since I think that is the 
core issue. First, let me say that I have not used caching at the SPOPS 
level, but I have used caching of SPOPS objects at the application 
level in my work. In my opinion, there are two ways of handling this 
issue.

1. SPOPS assumes that the application is keeping track of how many 
copies of an object are in memory and which ones have unsaved changes, 
etc. The only MASTER copy of the object is the saved one. In this case 
SPOPS should not do anything special to try to keep things in sync, 
that's the job of the application level.

2. SPOPS always assumes a 1 to 1 correspondence between the Perl object 
and the object in the database. Do caching at the SPOPS level with 
SPOPS making sure there is never more than one copy of the object in 
memory. Copies of the object are simply multiple references to a single 
cached object. This is the approach used by Tangram if I'm not 
mistaken. Unless I'm missing something, this seems pretty clean and 
straightforward. However, it doesn't doesn't address all of the 
consistency issues in contexts where you have multiple processes 
running simultaneously (e.g. multiple Apache children), where there is 
one copy of an object in the database, but multiple copies in the 
memories of various processes, in their individual caches. To address 
this at the SPOPS caching level you have to use some sort of a shared 
memory cache with synchronization/locking mechanisms, which in my 
opinion takes you back to handling the issue at the application level 
again.

So the bottom line, for me is, unless you are in the context where you 
only have one process running at a time (not the case for my apps), you 
ALWAYS have to handle the issue at the application level anyway. Having 
SPOPS do the caching as in (2) can help you with that, but assumptions 
can never be made at the SPOPS level that even a single cached object 
is necessarily in sync with the database since some other process may 
have changed it behind your back.

On Apr 25, 2004, at 6:29 PM, Vsevolod (Simon) Ilyushchenko wrote:
> 1a. The whole thing won't work properly without caching turned on. 
> Assume that an A has many X'es, and the table X has a column 'a_id'. 
> When I pass a list of ids to the function A->list_of_x_add (similar to 
> linksto_add), the application may have various copies of X'es with 
> those ids floating around, some of whose field values may have been 
> changed. Since (at least in the auto-save case) it does not make sense 
> to just update the 'a_id' field in the database and not the rest of 
> the fields, we need to save the relevant X objects. But we don't know 
> about them without caching!

My proposal here (found in last section under "Fetch" in my 7/3/01 
post) was to pass in the X objects, not the ids ...

For auto_by and lazy_by, two additional methods are created in A, one 
for adding objects to its list of X's and one for removing objects from 
it. These can only be used after the A object has been saved. Their 
primary purpose is to keep the list in memory in sync with what's in 
the database, so when using auto_by or lazy_by it's a good idea to use 
only these methods to add or remove corresponding X's. If the 'name' 
parameter is present, the methods are named add_<name> and 
remove_<name>. If the 'name' parameter is not present they are named 
add_to_<list_field> and remove_from_<list_field>. The method to add X's 
takes an X object or an arrayref of X objects as inputs and returns the 
same object or arrayref to the objects after saving them. The method to 
remove X's takes an id or arrayref of ids and returns the number of X's 
successfuly removed.

> 1b. However, even the current cache is inadequate for the task. Right 
> now, the first time an object is retrieved, it's saved in the cache. 
> If it's retrieved again, a copy of the object is returned. Thus, 
> whoever asked for the object first, has the "master" copy, meaning 
> that everybody else will see his changes. But if other requestors make 
> changes and the first requestor's copy is saved, their changes will be 
> lost.

I'm not really familiar with SPOPS caching and admit I haven't paid 
attention to your previous posts on caching, so educate me here. My 
understanding is the SPOPS doesn't implement caching, it just provides 
hooks to do it. So is this returning of a new copy of a cached object 
instead of a reference to the existing cached object a feature(bug) of 
the hooks in SPOPS or of a particular implementation of caching?

As I mentioned above, I think any caching at the SPOPS level should 
make sure there is only ONE copy of the object in memory. I don't see 
the purpose of having a "master" copy in memory with other copies of 
it. The only "master" copy of the object is the one in the datastore.

> Here is a sample code that illustrates the problem. Assume that A 
> still has many X'es.
>
> my $a = A->fetch(1);
> my $x = A->list_of_x->[0];
> my $a1 = $x->myA;
>
> Here, logically, $a and $a1 refer to the same object with the same ID. 
> But they are different Perl objects. If I change $a1 and save $a, my 
> changes to $a1 will be lost.
> Is there a reason the cache does not simply return the stored object?

I agree. An SPOPS level cache should always return the same object, not 
a copy.

> 1c. Normally, calling $a->list_of_x_add($x) will make sure that the 
> changes to the 'a_id' field in the X table are saved. There is a fun 
> special case, though - what if $a has been just created and not saved 
> yet? There are two possible behaviors: a) save $a behind the scenes to 
> obtain a_id, or b) throw an error requiring the user to call save() 
> explicitly. Variant a) makes list_of_x_add() behave similarly to the 
> normal case, but does something that the user may not want. Variant 
> b), conversely, exposes some inner workings of SPOPS to the user, but 
> does not do a potentially undesirable save. What is preferable here?

I say (b). Quoting from the same paragraph of my proposal again "These 
can only be used after the A object has been saved." ... implying that 
it throws an exception otherwise. I don't think this necessarily 
exposes inner workings of SPOPS to the user. I think it just needs to 
be documented that these methods throw exceptions if called for objects 
that are not saved.

> 2. You may have noticed that I used 'has_many', not 'has_a' as Ray 
> originally suggested. I do think it's cleaner to separate them, but if 
> you insist, I will eventually roll them back into one - I just 
> separate them now for the ease of coding.

I'm not sure I've seen how you're using 'has_many' in the 
configuration. It sounds to me though that it's putting the definition 
of the relationship at the other end, that's all. Does this then 
replace the manual_by|auto_by|lazy_by configuration syntax?

I guess I would vote for sticking with only the 'has_a' unless and 
until I see the full detail of the syntax spelled out and can see that 
it doesn't bring up new issues. I spent a lot of time on the syntax I 
proposed and am fairly comfortable that it is general and consistent.

> 3. For the many-to-many 'links_to' case (where A has-many Bs via the 
> linking table X), Ray suggested having the configuration hash in the X 
> class, not in the A class where 'links_to' lives now. This has the 
> added benefit of adding more fields to X if necessary, but IMO also a 
> major drawback of changing the API. Why don't we try to keep the API 
> as constant as possible and leave the 'links_to' stanza in A? We can 
> add new hash keys to specify extra X fields and to create a Perl class 
> corresponding to X if necessary.

On this point (and the previous one now that I think about it), my 
approach regarding where to put the configuration hash was to put it in 
the class which has the fields. The configuration hash for a class 
defines the meaning of each of its fields. It can also add behavior 
related to those fields to other classes. I think it's essential that 
we are consistent about where we put configuration. You propose putting 
the configuration in A ... but why A and not B?

> 4. Ray also suggested two different APIs for the simple has_a case (an 
> X has one A). If a dependent object is autofetched, $x->myA returns an 
> instance of A. However, if the fetch is manual, $x->myA returns a_id, 
> and only $x->fetch_myA returns an actual object. Is there a reason to 
> do it differently?

My thought here was that if myA is an auto or lazy-fetched field, then 
you always assume that $x->{myA} is an object. Otherwise, you always 
assume that $x->{myA} is an id. You still have a convenience method to 
fetch the corresponding object if you need it, but even after fetching 
the object, $x->{myA} is still just the id. It just seemed the most 
consistent to me. Otherwise, for manual fetches you end up with the 
case where you don't know when you access $x->{myA} whether to expect 
an id or an object, since it depends on whether or not you've done the 
manual fetch.

> 5. The issue of avoiding circular saves can be addressed simply by 
> setting a certain flag after an object is saved and checking for this 
> flag each time an object is reached in the relationship graph during 
> the save. (Obviously, this will require full caching as described 
> above.) Let me know if this for some reason won't work.

Why does this require full caching? Maybe an example would help. I 
don't think any of what I proposed requires caching, just the 
assumption that consistency is being maintained, with or without a 
cache, by the application level logic.

Thanks again, Simon, for all your work on this area ...

Ray Zimmerman
Director, Laboratory for Experimental Economics and Decision Research
428-B Phillips Hall, Cornell University, Ithaca, NY 14853
phone:  (607) 255-9645       fax: (815) 377-3932