starting merge with date matching

  • Daniel Kionka

    Daniel Kionka - 2004-12-13

    I finally did some work on my long term goal, merging.  I think it is a new approach on deciding what is a match.  Combining 2 records is not that hard, but having it figure out which records are close enough to match is a big challenge.

    The first step is coming up with as many aspects to compare as possible.  I worked on dates.  The first thing you can do is divide it into day, month, and year, and then compare each piece.  Then there are levels for each piece, like exact match vs. being in a range or an ABT.

    What I think is new is that instead of just counting up points, I keep track in 3 dimensions: positive, negative, and unknown.  A conflicting day should not simply cancel out a matching month.  You want to know how many conflicts you have as well as how many matches.  And missing info is not the same as a conflict; it is a level of uncertainty.

    Of course 3-dimensional points don't answer the question of whether 2 dates match.  So I map it back to what will be a colored bar, but for now is made of of the characters "Y+.-N".  It is divided up into 5 sections:  positive, non-negative, unknown, non-positive, and negative.  The order may seem strange, but they are in good to bad order.  Maybe 3 sections are good enough (positive, unknown, and negative), but for now I am distinguishing true positive aspects from false negative aspects.  (A conflict is either negative, unknown, or non-negative; being non-negative is not necessarily positive.)

    Even the multi-colored bar does not give me a thumbs up/down, so the next trick is to mark where the midpoint is, which is the middle of the unknown section.  That can be expressed as a percent match, which is finally a simple number you can compare.

    Since a picture is worth a thousand words, here are some examples.  The aspects are weighted, and "|" marks the midpoint.  (It only makes sense with fixed width characters.)

    |---------NNNNNNNN 1/2/1930 4/5/1960
    ....|....-----NNNN 1/2/1930 1960
    ........|........- 1/2/1930
    YYYY++++....|....- 1/2/1930 1930
    YYYYYYYYY++++++++| 1/2/1930 1/2/1930

    For more info you can see the new directory I created, src/net/sourceforge/gdbi/prototype/merge.

    • Daniel Kionka

      Daniel Kionka - 2004-12-18

      There was a discussion of merging on the PGV forum, so I summarized my plan there, and linked back to this forum.  See:

    • Daniel Kionka

      Daniel Kionka - 2004-12-22

      I simplified the model above a little.  Instead of 5 fields, it now works like 2 tri-state variables, for agreement and conflict.  Each variable is true, false, or unknown.

      This may end up being more complex than necessary, but intuitively I feel that a conflict is different from not agreeing (and agreement is different from lack of conflict).

      I also started working on name matching.

    • Daniel Kionka

      Daniel Kionka - 2005-01-02

      I have been pounding out code like never before, and after writing 2000 new lines, I have a simple program for matching gedcoms.  It doesn't merge anything, just recursively compares and matches INDIs and FAMs from 2 databases.

      What makes it special is that it uses the comparison described above and automatically matches record pairs that have a high enough score.

      Along the way I added a 4th database.  Instead of using jLifelines, GenJ, or PGV, I extended the gedcom parser to work as a GdbiGedcom database.  That means you can use a gedcom text file directly, and keep all the records in memory (which is how GenJ works).

      The typical usage will be to use 1 of the 3 standard databases as your primary db, and to leave the import gedcom as a text file.  It writes persistant data to the import gedcom so you don't clutter up your primary db.  It has to save the results because it could take several sessions to find all the matches.

      To see what I have so far, run "make test" in gdbi/prototype/merge.  It uses built-in dummy data by default.  To use 2 gedcom files, you need to give the 2 file names and starting records, e.g.

      make RUN_OPT='primary.GED import.GED I26 I0818' test

      It does everything with text I/O.  It is just a basic test.  It doesn't even save the import gedcom yet with all the internal _GDBI data it wrote.  But it does match records in 2 databases I have wanted to merge for 3 years!

      The next steps are: write output, refine matches, add a GUI, select events to import/ignore, and merge the data.

      • Daniel Kionka

        Daniel Kionka - 2005-01-10

        I wanted to give an update on the "next steps" I mentioned above.  I enhanced the new text-file/in-memory database, called rawdb, to write the output back to the file.  I also began a GUI for matching people/families.

        I wrote a new web page describing match/merge.  From the gdbi home page you can go to the FAQ and then to merge.  The direct link is:

    • John Finlay

      John Finlay - 2005-01-10

      It is looking good.  I am really excited about this.


    • Daniel Kionka

      Daniel Kionka - 2005-01-16

      It is a good thing we discussed threads recently.  We discussed it for BKEdit, but it ended up helping on the matching.

      I have had the text I/O version working, but it is hard to convert it to a GUI.  It recursively compares pairs of INDIs/FAMs, and when you are nested several levels down, it is easy to ask the user to type y/n, but you can't ask for a button to be pressed.

      After pulling out my big Java textbooks, I finally figured out what the wait()/notify() stuff is all about.  Java builds semaphores into the language.  Basically you put the y/n char in an object with synchronized get/put methods.  The recursive compare runs in a thread, and when it needs a char, the get() calls notify() and wait().  When the GUI gets a button click, it sends it to put(), which calls notify().  The compare thread then ends the wait and picks up the results.

      That means I now have a GUI that shows family members from 2 GEDCOMS side by side and lets the user select if each person is the same or not.  In some cases it only "verifies" they are the same because it has already figured out they are the same by comparing their attributes.

      I have been doing everything bottom-up.  Now I need a window to select the people you want to start with, and then a window to find the GEDCOM to merge in.  (I am doing that with command line arguments now.)

    • Daniel Kionka

      Daniel Kionka - 2005-01-25

      Last time I said matching was working, but there was no GUI to get to it...

      There is now a new top-level window that keeps track of all open databases, lets you run programs on them, and lets you open new databases.  (Adding this was discussed on a separate forum thread.)

      One of the programs you can run is "Merge".  (The BKEdit button doesn't work yet.)  Merge will bring up a window that lets you select the import database and the 2 starting INDIs.  You then start the matching, which I described above.

      Now the only thing left for merging is the actual merge part.  I have to add a window where you can select what to take/skip.  Then finally I will add the copy routine.

    • Daniel Kionka

      Daniel Kionka - 2005-02-09

      I thought the step of combining data would be easy, but it is pretty hard to do it "right".  For example, if the 2 gedcoms have conflicting birth dates for a person, you can add a 2nd BIRT, but if they have the same birth date, and the import gedcom has the missing place, you want to put it under the original BIRT.

      The good news is that it now merges some data after doing all the matching.  The GUI allows you to select options, but they are all ignored.  All the major pieces are there, and now it is the slow process of making it all work correctly.

    • Daniel Kionka

      Daniel Kionka - 2005-03-03

      I had given a series of updates on this thread, and since I have stopped working on merging for a while, I should give the final status.

      Merging basically works, but it doesn't work the way I wanted.  It has a few options, and that might be enough in some cases, but it doesn't have enough control to where I would use the full merge on my own data.  The option to take minimal data works fine, though.

      It will take a lot of careful review to get it just the way I want, so I decided to release it as-is with the usual disclaimers about brand new code.

      So now I am going back and cleaning up the rest of GDBI.  Along the way I created a whole new top-level window that opens subsystems like BKEdit and merge, but it is still rough.  The most noticable difference is that instead of doing everything from a single BKEdit window, you have a database list window that starts up separate BKEdit and merge windows.

      When the new top-level window is working smoothly, I will release another version of GDBI.  After that I will go back to refining the merge system.

    • Wes Groleau

      Wes Groleau - 2007-08-10

      Just some wild ideas--if they were already mentioned, forgive me for not seeing them.

      If Joe has no birth date and Joseph does, there may be things like marriage date of parents, birthdates of children of children, dates of other events, etc. which could assist in determining a probability that they have the same birthdate.  Similar for other event dates.

      Recursion: If each has a list of children, siblings, parents, for each pair of those that has a 50% or better match probability (and the same relationship), add a tenth of that probability to the Joe/Joseph match.

      Optional name equivalence: If the user can provide a list of alternative name spellings for their particular database, that could be used to slightly adjust the probability.  For example, Tomkiewicz = Thompson , or Groleau = Groslot might be on the list.  These are equivalences you
      could never get from Soundex.

      • Daniel Kionka

        Daniel Kionka - 2007-08-13

        Those are good ideas.  My original plan was to "look around" and see whether 2 people matched based on their close relatives.  The infrastructure is there, but the challenge was coming up with a long list of rules like yours about adding a tenth.

        As you can see from the dates of the posts, it has been over 2 years since I worked on merging.  I moved on to other features, fsdb for local storage and drawing descendancy trees, and I never really finished any of them.  Interest in GDBI has gone way down.  It used to be popular for editing PGV, but now you can do most things directly on the web site.


Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks