Thread: [Gramps-devel] Storing data from large sources

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

I want to open up a discussion about how best to store data from large
sources.  By "large" I mean sources such as registers, logs, family
bibles, censuses, member lists and other things that contain many
entries (millions, in the case of a census).  Usually, each entry in
such a document has several columns of data.  For example, a marriage
register at a church will have names of the bride, groom, witnesses
and possibly parents, date, officiating minister's name and other
things.  A death register may include cause of death, place and other
things.  A census may have all sorts of data, including whether the
house was brick or frame and how many stories it had, if the person
was an employer (and if so, how many employees he had) or employee,
whether he was deaf, dumb, blind, crazy (or "lunatic") and how many
sheep he has.

One challenge with this sort of data is the document in which it is
found.  If we treat it as a source (which seems natural to me) then we
have a problem storing all the bits of data we find for one
individual.  You can't use source attributes, since those are shared
in gramps.  We could treat the document as a repository and the
entries as sources from that repository, but that seems unnatural when
you are looking at a book or a bible or a microfilm.  Also, since a
repository can't have a repository (i.e. you can't nest them) and the
real repository (building, web site, etc.) may house many such
sources, how can you bring all these pseudo-repositories together
under that real repository?

Another challenge is all the bits of data we find in this document.
Some should no doubt find there way to other objects: the cause of
death to a death event attribute; the witness names to a marriage
event attribute; the house construction and size to a residence
attribute, perhaps.  However, it is still good (very good, I feel) to
also keep all this data together and tie it to the source in which it
is found.

The Census gramplet addresses this by using event reference
attributes.  So the fact that the person was a lunatic is recorded in
an attribute of the event reference -- Lunatic: yes -- and similarly,
the other attributes.  This solves the immediate problem for censuses
but may not be generally extensible to other documents -- especially
if there are multiple documents for some event that disagree with each
other.  Furthermore, it is not clear to me that this is the best way
to handle this data since the data is not really an attribute of the
event but of the source document since the data was recorded in the
source document at the time of the event.

I'm now wondering if we should add attributes to source references
analogous to event references.  If available, this would be a natural
place to store all the bits of data for each entry while keeping one
source object (the book, film, etc.) at the repositories where it can
be found. On the other hand, introducing source reference attributes
may introduce challenges for GEDCOM exports and imports.

So let's discuss!  What creative ways can we devise to handle these
sorts of source documents?  Should we extend our data model and, if
so, how?  If we extend the data model, what are the repercussions?

-- 
Gerald Britton

Thread: [Gramps-devel] Storing data from large sources

Gramps, the open source genealogy program

gramps-devel