Re: [Gramps-devel] Database performance enhancements for 2.2

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

I forgot about the fact that the person maintaines references to the source
already. In that case, could the source also maintain back-references to each
person that references it? That way you would not need a new table at all. The
process to remove a source would be as follows:

1)Look up the source in the source table
  for each person referenced by the source:
  2)Look up the person in the person table
  3)Remove the source reference from the person
4)Remove the source from the source table

That seems simple. What am I missing?

~Brian

--- Alex Roitman <sh...@gr...> wrote:

> I am not sure which message to respond to :-)
> So I am replying to the last one so far, but I really
> mean to continue a broader discussion.
> 
> I just thought I'd clear the details of what we discussed
> earlier today on IRC, and I think it addresses most of
> the concerns.
> 
> What is being proposed is the table with entries like this:
> key: a source_handle, such as 'JHGSDFHSDF2342SJDHFS'
> data: a (pickled) Set of the following form:
>    [ ('Person','IUN78VUYV786V87U'),
>      ('Place', 'OIU87TY65ADS98ADS'),
>      ('Family','WI0WF0WF564WF87FE'),
>      ('Person','blah'),
>      ('Event','blah2'), ]
> and so on. I may have switched the type/handle order in a tuple,
> but this is not really important.
> 
> These would identify all the primary objects that
> refer to the source. Adding a reference to this source
> will set-add a tuple to this record (set-add meaning
> that if such tuple already exists then there's no actual
> addition -- union operation). Removing will have to be
> smarter and make sure that referring object from which
> the reference is removed does not have any remaining
> references to that source.
> 
> Same table would also hold referring objects' handles
> for other references such as Media objects, Places,
> Events, Person, and Family. Handles are unique, so
> any handle can be the key without worrying of mixing
> them up. Example:
> 
> key: a media_handle, such as 'media_handle'
> data: a (pickled) Set of the following form:
>    [ ('Person','234IU234N78VUYV786V87U'),
>      ('Place', '456OIU87TY65ADS98ADS'),
>      ('Source','WI0WF0WF564WF87FE'),
>      ('Person','blah234'),
>      ('Event','blah234'), ]
> 
> Now, two important concerns are being addressed as follows:
> 1. A person 'blah' may refer a given source 15 times. Smart
>    removal will skip removing the tuple from the set in the
>    above reference_map data unless the last reference is being
>    removed.
> 2. A person 'blah' may refer the source not only from
>    the actual primary Person object, but also from its
>    secondary members such as Name, Address, Attribute, etc.
>    Currently this is handled by recursive checks through
>    all the members and their members etc. We will leave
>    this exactly as it is. This should not have any noticeable
>    performance penalty, because the primary object is
>    already in memory and no DB lookup is needed. In essence,
>    this would be just checking selected fields of selected
>    members.
> 
> So Julio's concern that multiple references will be mixed up
> seems not to really be a concern here -- all we need is
> from 100000 people to identify 50 that refer to this source.
> Then we can quickly search the 50 correct people for details
> of specific refs (can be 15 per person, both from different
> members as well as the Person object itself). If all 100000
> people refer to that source then we don't gain anything,
> but it's tough luck, but then no other optimization can go
> around that either. Our goal is to make search scale as O(K)
> where K is the number of existing refs, as opposed to O(N)
> where N is the database size. The proposed reference_map achieves
> that goal.
> 
> As for Brian's concern: I think you mis-placed the keys in
> your example. When Person refers to the source, Person object
> already stores source handle (along with more reference-related
> data like page number etc). So for a given source we know
> exactly which sources it refers. The problem is how to efficiently
> do the opposite, i.e. given the source find people referring
> it. At present we walk the db (bad). The proposed table would
> only store such back references, so given the source we would
> be able to identify the persons referring it with a single lookup.
> 
> So, to delete a source we do a single lookup, find handles
> of all N objects referring to the source, and then modify those
> objects (clean up the refs) and delete the source. We need
> as many DB lookups and edits as there are records referencing
> the source, and this is a reasonable thing. Plus one initial
> lookup and one final edit in a reference_map, for any number of
> referring objects. In an alternative setup, we just transferring
> same number of edits from Person table to this associate table:
> each reference has to be found and deleted, right?
> 
> Finally, Benny's suggestion of Person-Source reference table
> would work just fine for the relational database. We're in a
> situation both better and worse than that. Better because we
> are not limited to a one-to-one relationship of relational
> databases. Our Person object can hold as many references
> to any number of source as needed, with no additional effort.
> This is impossible in a relational db with the fixed number
> of columns in a table, hence people usually resort to the
> associated tables. We don't have to.
> 
> Worse because we can't ask a simple SQL question of "give
> me people referenced by source "blah" and have DB do the
> work for us. The proposed reference_map would help us there.
> For other kinds of queries we'd have to do some more footwork.
> Either additional tables or smart indexes or something
> else.
> 
> In any case, having Person-Source table would be OK, except
> (1) we don't need it for Person->Source references, and
> (2) we would need similar tables for other reference types.
> A single table with all the references is arguably better.
> Especially since a single lookup is needed for any
> reference-related query. Also, with the Person-Source table
> multiple references become a problem. Finally, this means
> rewriting/rethinking lots of reference-related code,
> while the proposed setup is actually adding some code.
> 
> Disclaimer: I know very little about databases, just what
> I learned while hacking on gramps, plus some FoxPro hacking
> over ten years ago :-)
> 
> Alex
> 
> -- 
> Alexander Roitman   http://www.gramps-project.org
> 

Re: [Gramps-devel] Database performance enhancements for 2.2

Gramps, the open source genealogy program

Re: [Gramps-devel] Database performance enhancements for 2.2