On 02/01/2005 01:40:44 PM, Julio Sanchez wrote:
> > (a) Add duplicate record with another handle
> > (b) Merge information from two duplicate objects together
> > (c) Replace one record with another
> Merge them. Without hesitation.
I think we may follow Eero's suggestion, and ask the user, since
the user may have a very good idea that one db has to override another
one in all conflicts.
That way or the other, either (b) or (c) are tricky, because of
all relations you mention below.
> I have myself code that does sort of that: it bumps match scores when
> some condition is met. But it is not nearly enough: merging is still
> a very burdensome process.
I hear you :-)
> It would improve if there was an easy way to merge the ancestor chain
> of a given match first, i.e. two merge candidates are found and are
> determined to be the same. However, their parents are not the same
> (though they have similar names and are obviously the same people).=20
> If you choose a parents set now, the other set gets disconnected from
> the child and has to be found later or not. Sometimes you miss them
> completely because their names were incomplete in that pair (happens
> all the time). However, the non-chosen parents may have the family
> record that contains marriage info, etc. and you do not know that from
> the merge window.
> So what you do is stop the merge of the child now, find the parents,
> merge them first and, then, go back to the child and merge all its
> instances. If the parents happen to be in the same situation, you
> have to stop there, deal with grandparents first and so on. On a
> large tree, the number of eggs being juggled doubles on each
> Moreover, the case for merging is very obvious at the child,
> grandchild, etc. level but when you go up the ancestry you get
> disoriented and may end up merging the wrong people because you do not
> see the children while merging their parents. When you have someone
> who married two siblings in sequence you merge them by mistake more
> often than not. While from the children level it is very obvious what
> The whole process is long and error-prone. Merging whole trees is a
> nightmare. Merging them repeatedly is probably akin to what Sisiphus
> went through.
Yes. On the other hand, with duplicate handles some of these problems
are not present. Namely, the duplicate handles mean that people *are*
the same. If you want to merge duplicates and are just trying sort out
real ones from the false positives, the duplicate handles boost your
confidence to 100%. There's simply no chance to get duplicate handles
any other way than forking the same record.
Although, even with the 100% certainty the merge process is tedious,
because even if the two trees are almost completely duplicate, somewhere
there can be non-duplicate people added by different edits. These probably
should not be merged without user's intervention.
> It is better to have a way to mark info (not records!) as duplicate.=20
> Most info in records admits several instances. Just use notes or some
> other kind of mark and let a filter find the conflicting info. Maybe
> use source references to help in tagging the origin of information.
> Because when merging such trees, nearly all info is identical nearly
> everytime. So optimize for this, don't optimize for the case where
> most info is conflicting.
Yeah -- optimizing is still a big word for what we're doing :-) At the
moment, we are just halting everything on the first duplicate handle,
which is not a good solution.
> As you know, I have commented on how much I wanted to minimise the
> effort required for remerging data that can be determined to be match
> reliably. But I got sidetracked. I want to come back to this, but it
> may be a few weeks before I do. And first I need to forward port some
> patches that I have pending.
I think that this would be a very useful thing to do. I also think that wil=
take a lot of work, and probably the short term thing to do now is this:
1. Check for duplicate handles prior to import
2. Warn the user of the duplication extent
3. If the user consents, go ahead and import everything,
either replacing entries from one db with ones from another db,
or adding duplicates using new handles.
This way we will let the user find and merge duplicates when the user
wants to, and will not lose any info, except for the lost duplicate
I think it's better that way rather than poor and erroneous automatic
merging with lot more data loss. The reason I want to fix it short term
is I want to release HEAD as soon as reasonably possible. It has been
stable enough for some time, with lots of enhancements and almost
all other features from STABLE present. The proposed way to handle
duplicates is no worse than what we have in STABLE. It just does not
use the possible advantage of unique handles, but we will get it
right in the next release.
The reality is, the HEAD will never get enough testing until it is released=
Whenever we ask people to test HEAD, we only get responses from
few people, and I don't blame the rest :-)
But it's a drag to maintain two versions. Not only the maintanence
itself, but also the fact that majority of user contributions, patches,
etc only come for STABLE, which is a dead end, so a lot of work
is simply wasted.
I realize that this is another story altogether, but the import/export
was the last step before we could think of releasing 1.2.
Alexander Roitman http://ebner.neuroscience.umn.edu/people/alex.html
Dept. of Neuroscience, Lions Research Building
2001 6th Street SE, Minneapolis, MN 55455
Tel (612) 625-7566 FAX (612) 626-9201