2010/11/29 Gerald Britton <gerald.britton@gmail.com>
I want to open up a discussion about how best to store data from large
sources.  By "large" I mean sources such as registers, logs, family
bibles, censuses, member lists and other things that contain many
entries (millions, in the case of a census).  Usually, each entry in
such a document has several columns of data.  For example, a marriage
register at a church will have names of the bride, groom, witnesses
and possibly parents, date, officiating minister's name and other
things.  A death register may include cause of death, place and other
things.  A census may have all sorts of data, including whether the
house was brick or frame and how many stories it had, if the person
was an employer (and if so, how many employees he had) or employee,
whether he was deaf, dumb, blind, crazy (or "lunatic") and how many
sheep he has.

One challenge with this sort of data is the document in which it is
found.  If we treat it as a source (which seems natural to me) then we
have a problem storing all the bits of data we find for one
individual.  You can't use source attributes, since those are shared
in gramps.  We could treat the document as a repository and the
entries as sources from that repository, but that seems unnatural when
you are looking at a book or a bible or a microfilm.  Also, since a
repository can't have a repository (i.e. you can't nest them) and the
real repository (building, web site, etc.) may house many such
sources, how can you bring all these pseudo-repositories together
under that real repository?

Another challenge is all the bits of data we find in this document.
Some should no doubt find there way to other objects: the cause of
death to a death event attribute; the witness names to a marriage
event attribute; the house construction and size to a residence
attribute, perhaps.  However, it is still good (very good, I feel) to
also keep all this data together and tie it to the source in which it
is found.

The Census gramplet addresses this by using event reference
attributes.  So the fact that the person was a lunatic is recorded in
an attribute of the event reference -- Lunatic: yes -- and similarly,
the other attributes.  This solves the immediate problem for censuses
but may not be generally extensible to other documents -- especially
if there are multiple documents for some event that disagree with each
other.  Furthermore, it is not clear to me that this is the best way
to handle this data since the data is not really an attribute of the
event but of the source document since the data was recorded in the
source document at the time of the event.

I'm now wondering if we should add attributes to source references
analogous to event references.  If available, this would be a natural
place to store all the bits of data for each entry while keeping one
source object (the book, film, etc.) at the repositories where it can
be found. On the other hand, introducing source reference attributes
may introduce challenges for GEDCOM exports and imports.

So let's discuss!  What creative ways can we devise to handle these
sorts of source documents?  Should we extend our data model and, if
so, how?  If we extend the data model, what are the repercussions?

For me,

1. Repository is where you find a source. We should not misuse it

2. Source is the the book/registery, or a part of it. The source holds information, and literal transcripts of a source should hence be stored in this object.
A source does _not_ have what we call attributes, a source has "Data". This is not exported to GEDCOM. The Data is not shared, the source is what is shared.

3. An event is something happening to a person/family at a certain time/place. Census event is the census taker that passes and writes info in the census source.

4. You learn from a source information about a person or family, so you want to add information about the person/family in the person/family object. You add this information, eg an attribute: Description, Blue eyes.  Source of this attribute is the census souce.

I don't see problems here, except for the fact that you can only store the census data in the source as a note if you want it stored. So there is no 'database scheme' for it. You can use Source Data for key-value pairs.

Now, the other way around. You have a person, and you see a source saying green eyes. You go to attributes and you see blue eyes. You wonder if there is no error. You click on the attribute to from what source you have this information, you open the census source, and you look at the data inside of it. If you used a note for the data in the census, you can share it in the source reference, and you know what the census said. If you are uncertain and you want to recheck the census, you go to the repository tab and you see where this census is stored to check in the repository the source again.

So, In all this, you normally _don't_ check the census event! It seems stupid to me to store data obtained in the census taking there. At most, I would share a note with the transcript there.

So, in my view, the way census gramplet works is wrong.


Gerald Britton

Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
Gramps-devel mailing list