[Gramps-devel] Database compare and merge II

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hello again, developers,

and first of all congratulations to the 3.3.1 release. Keep up the excellent
work.

Please find attached a new version of the python script for comparing data
in 2 Gramps xml files:
http://gramps.1791082.n4.nabble.com/file/n3866064/GrampsCompareV02.py
GrampsCompareV02.py 

Based on the work of the first version I found some way to improve the
matching of data in 2 different database files. The new version completely
disregards the database IDs and other database internal data and is a step
in the direction of being able to compare databases that do not have the
same base (a database given away and returned with added information).

The calling parameters are unchanged: First db file (uncompressed),
reference person ID, second db file (uncompressed) and compared person ID.

Ken, this version should be able to cope with your data containing different
IDs in both databases. Please give it a try.

While playing around with the old version I noticed that some data matching
between reference and compared database failed due to the sequencial work on
the reference data. There were mismatches if the data in the compared
database was sorted in a different order. So I changed the matching for each
node to first find all similar nodes as well in the reference data as in the
compared data and then find the best match between all the combinations to
match only those.

For the similarity check I use a counter approach. The subnodes of two
matching candidate nodes known to represent the same information in both
databases are iterated. All similar nodes are compared. (Example: If there
are 2 name entries in the first database for a person, birth and married
name, both are compared to all name entries in the compared database.) By
counting the number of equal data entries in both databases one gets a
higher count the more matches are found. The first match is given by the
user as program parameter.

As said in an earlier mail, maybe it would be a start to make a report out
of this?
I added a "tagDict", a tag dictionary in which you can specify that tag
"last" changed to "surname" so for "last" names in the reference database
matches with the tag "surname" are accepted as matches.

I did not optimize the code for performance (yet). At the beginning
readability is more important so you can follow the code in what it does. I
added lots of comments for that matter too. If you set the flag dbg to 1 you
will get a lot of debug output showing how the program works.

Jerome suggested to use lxml, but I still stuck to the python built-in
ElementTree since you have to install lxml seperately (which did not work on
my box anyway) and it is easier for you to do some first evaluations without
having to install extra packages for python.

I tested the program on my own data (different backups over time, compare
shows the added and changed data), on the gramps "example" database files in
the repositories of version 3.2 and 3.3 (which shows an abundance of added
description fields and some other changes, esp. for Person with ID I44) and
on my data compared to a database imported from GEDCOM (which is not yet
completely satisfying and may need manual assistance in the end).

Please play around with it, look at the code (Some comments are to be done)
and tell me what you think.

Kind regards
Heinz

--
View this message in context: http://gramps.1791082.n4.nabble.com/Database-compare-and-merge-II-tp3866064p3866064.html
Sent from the GRAMPS - Dev mailing list archive at Nabble.com.

[Gramps-devel] Database compare and merge II

Gramps, the open source genealogy program

[Gramps-devel] Database compare and merge II