Re: [Gramps-devel] Storing data from large sources

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Benny Malengier wrote:
>
>
> 2010/11/29 Gerald Britton <ger...@gm... 
> <mailto:ger...@gm...>>
>
>     I want to open up a discussion about how best to store data from large
>     sources.  By "large" I mean sources such as registers, logs, family
>     bibles, censuses, member lists and other things that contain many
>     entries (millions, in the case of a census).  Usually, each entry in
>     such a document has several columns of data.  For example, a marriage
>     register at a church will have names of the bride, groom, witnesses
>     and possibly parents, date, officiating minister's name and other
>     things.  A death register may include cause of death, place and other
>     things.  A census may have all sorts of data, including whether the
>     house was brick or frame and how many stories it had, if the person
>     was an employer (and if so, how many employees he had) or employee,
>     whether he was deaf, dumb, blind, crazy (or "lunatic") and how many
>     sheep he has.
>
>     One challenge with this sort of data is the document in which it is
>     found.  If we treat it as a source (which seems natural to me) then we
>     have a problem storing all the bits of data we find for one
>     individual.  You can't use source attributes, since those are shared
>     in gramps.  We could treat the document as a repository and the
>     entries as sources from that repository, but that seems unnatural when
>     you are looking at a book or a bible or a microfilm.  Also, since a
>     repository can't have a repository (i.e. you can't nest them) and the
>     real repository (building, web site, etc.) may house many such
>     sources, how can you bring all these pseudo-repositories together
>     under that real repository?
>
>     Another challenge is all the bits of data we find in this document.
>     Some should no doubt find there way to other objects: the cause of
>     death to a death event attribute; the witness names to a marriage
>     event attribute; the house construction and size to a residence
>     attribute, perhaps.  However, it is still good (very good, I feel) to
>     also keep all this data together and tie it to the source in which it
>     is found.
>
>     The Census gramplet addresses this by using event reference
>     attributes.  So the fact that the person was a lunatic is recorded in
>     an attribute of the event reference -- Lunatic: yes -- and similarly,
>     the other attributes.  This solves the immediate problem for censuses
>     but may not be generally extensible to other documents -- especially
>     if there are multiple documents for some event that disagree with each
>     other.  Furthermore, it is not clear to me that this is the best way
>     to handle this data since the data is not really an attribute of the
>     event but of the source document since the data was recorded in the
>     source document at the time of the event.
>
>     I'm now wondering if we should add attributes to source references
>     analogous to event references.  If available, this would be a natural
>     place to store all the bits of data for each entry while keeping one
>     source object (the book, film, etc.) at the repositories where it can
>     be found. On the other hand, introducing source reference attributes
>     may introduce challenges for GEDCOM exports and imports.
>
>     So let's discuss!  What creative ways can we devise to handle these
>     sorts of source documents?  Should we extend our data model and, if
>     so, how?  If we extend the data model, what are the repercussions?
>
>
> For me,
>
> 1. Repository is where you find a source. We should not misuse it

I agree.  The repository for a census might be "National Archives".

>
> 2. Source is the the book/registery, or a part of it. The source holds 
> information, and literal transcripts of a source should hence be 
> stored in this object.

Yes.  The source for a census might be "1851 England Census".

You could store literal transcripts here, but you would have a large 
number of them.  Wouldn't it be better to store them in a Source 
Reference where you would only have the transcript of a page?

> A source does _not_ have what we call attributes, a source has "Data". 
> This is not exported to GEDCOM. The Data is not shared, the source is 
> what is shared.
>
> 3. An event is something happening to a person/family at a certain 
> time/place. Census event is the census taker that passes and writes 
> info in the census source.

Yes.  A census event will in general have several people attached to 
it.  It will also have a census source with the source reference 
containing a full reference to its page, and possibly a transcript.  (On 
my ToDo list).

>
> 4. You learn from a source information about a person or family, so 
> you want to add information about the person/family in the 
> person/family object. You add this information, eg an attribute: 
> Description, Blue eyes.  Source of this attribute is the census souce.

OK, this is where we have a problem.   One of the reasons that I wrote 
the census add-ons is that it is common to get contradictory 
information.  You want to record all this information against a 
Person/Census combination.   The natural place to store this is as an 
attribute on the event reference object.

>
> I don't see problems here, except for the fact that you can only store 
> the census data in the source as a note if you want it stored. So 
> there is no 'database scheme' for it. You can use Source Data for 
> key-value pairs.

I don't like the idea of using Source Data.  Storing a transcript 
against a source and/or source reference as a shared Note is a good idea.

>
> Now, the other way around. You have a person, and you see a source 
> saying green eyes. You go to attributes and you see blue eyes. You 
> wonder if there is no error.

Good example.  You might have added from a census or from another source.

> You click on the attribute to from what source you have this 
> information, you open the census source, and you look at the data 
> inside of it. If you used a note for the data in the census, you can 
> share it in the source reference, and you know what the census said. 
> If you are uncertain and you want to recheck the census, you go to the 
> repository tab and you see where this census is stored to check in the 
> repository the source again.

Well at this point you probably want to stop editing and run some 
reports to examine your data.  The Census report is written just for 
this purpose - it allows you to compare all census data for a person in 
a structured way.  Once you have evaluated your data you can either go 
back and edit the record.

>
> So, In all this, you normally _don't_ check the census event!

It's not really a matter of checking an event.  We want the data stored 
in a structured manner so that we can run reports to analyse the data.

> It seems stupid to me to store data obtained in the census taking there.

I was suggesting storing data such as "number of rooms" as attributes of 
a census event.  Again, this is a natural place to store the data and 
allows convenient access for the Census report and Census editor.

> At most, I would share a note with the transcript there.

I would prefer for transcripts to be stored on the Source Reference 
rather than Event.  I only suggested storing an image on the Event 
because it is not possible to store it on the Source Reference.

>
> So, in my view, the way census gramplet works is wrong.

Well I see it as, transcripts and images on the Source Reference or 
Source, maybe shared.  Data extracted from this source data on People, 
Families, Events.

Nick.

>
> Benny
>
> --
>
>     Gerald Britton
>
>     ------------------------------------------------------------------------------
>     Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
>     Tap into the largest installed PC base & get more eyes on your game by
>     optimizing for Intel(R) Graphics Technology. Get started today
>     with the
>     Intel(R) Software Partner Program. Five $500 cash prizes are up
>     for grabs.
>     http://p.sf.net/sfu/intelisp-dev2dev
>     _______________________________________________
>     Gramps-devel mailing list
>     Gra...@li...
>     <mailto:Gra...@li...>
>     https://lists.sourceforge.net/lists/listinfo/gramps-devel
>
>
> ------------------------------------------------------------------------
>
> ------------------------------------------------------------------------------
> Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
> Tap into the largest installed PC base & get more eyes on your game by
> optimizing for Intel(R) Graphics Technology. Get started today with the
> Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
> http://p.sf.net/sfu/intelisp-dev2dev
> ------------------------------------------------------------------------
>
> _______________________________________________
> Gramps-devel mailing list
> Gra...@li...
> https://lists.sourceforge.net/lists/listinfo/gramps-devel
>   

Re: [Gramps-devel] Storing data from large sources

Gramps, the open source genealogy program

Re: [Gramps-devel] Storing data from large sources