From: derHeinzi <hei...@ya...> - 2011-10-02 23:12:35
|
Hello again, developers, and first of all congratulations to the 3.3.1 release. Keep up the excellent work. Please find attached a new version of the python script for comparing data in 2 Gramps xml files: http://gramps.1791082.n4.nabble.com/file/n3866064/GrampsCompareV02.py GrampsCompareV02.py Based on the work of the first version I found some way to improve the matching of data in 2 different database files. The new version completely disregards the database IDs and other database internal data and is a step in the direction of being able to compare databases that do not have the same base (a database given away and returned with added information). The calling parameters are unchanged: First db file (uncompressed), reference person ID, second db file (uncompressed) and compared person ID. Ken, this version should be able to cope with your data containing different IDs in both databases. Please give it a try. While playing around with the old version I noticed that some data matching between reference and compared database failed due to the sequencial work on the reference data. There were mismatches if the data in the compared database was sorted in a different order. So I changed the matching for each node to first find all similar nodes as well in the reference data as in the compared data and then find the best match between all the combinations to match only those. For the similarity check I use a counter approach. The subnodes of two matching candidate nodes known to represent the same information in both databases are iterated. All similar nodes are compared. (Example: If there are 2 name entries in the first database for a person, birth and married name, both are compared to all name entries in the compared database.) By counting the number of equal data entries in both databases one gets a higher count the more matches are found. The first match is given by the user as program parameter. As said in an earlier mail, maybe it would be a start to make a report out of this? I added a "tagDict", a tag dictionary in which you can specify that tag "last" changed to "surname" so for "last" names in the reference database matches with the tag "surname" are accepted as matches. I did not optimize the code for performance (yet). At the beginning readability is more important so you can follow the code in what it does. I added lots of comments for that matter too. If you set the flag dbg to 1 you will get a lot of debug output showing how the program works. Jerome suggested to use lxml, but I still stuck to the python built-in ElementTree since you have to install lxml seperately (which did not work on my box anyway) and it is easier for you to do some first evaluations without having to install extra packages for python. I tested the program on my own data (different backups over time, compare shows the added and changed data), on the gramps "example" database files in the repositories of version 3.2 and 3.3 (which shows an abundance of added description fields and some other changes, esp. for Person with ID I44) and on my data compared to a database imported from GEDCOM (which is not yet completely satisfying and may need manual assistance in the end). Please play around with it, look at the code (Some comments are to be done) and tell me what you think. Kind regards Heinz -- View this message in context: http://gramps.1791082.n4.nabble.com/Database-compare-and-merge-II-tp3866064p3866064.html Sent from the GRAMPS - Dev mailing list archive at Nabble.com. |