Duplicated Contacts Distinct URI

  • Anonymous

    Anonymous - 2011-07-04

    Hi, well again I'm having troubles with duplicate data,
    now are about Contacts URI
    when I run the ImapCrawler I'm getting duplicate Contacts.

    Eg. I have a User Six in three emails the crawler generates URIs

    emailperson:xxx1 a Contact;
       fullname "User Six";
       hasEmailAddress <mailto:user6@example.com> .

    emailperson:xxx2 a Contact;
       fullname "User Six";
       hasEmailAddress <mailto:user6@example.com> .

    emailperson:xxx3 a Contact;
       fullname "User Six";
       hasEmailAddress <mailto:user6@example.com> .

    <mailto:user6@example.com> a EmailAddress;
       emailAddress "user6@example.com"

    Have anyway to avoid this problem and use only one URI?

  • Antoni Mylka

    Antoni Mylka - 2011-07-06

    The simple answer is: if you need one URI per person, you'll need to post-process the data yourself. Choose one of the URIs (or generate a new one), make all the links point at the chosen URI and remove all other contacts. Aperture won't do that for you.

    The not-so-simple answer is:

    1. Once upon a time (before rev 2085 committed on 30.09.2009) the emailperson uris would look like "emailperson:Antoni+Mylka". They would contain the URL-escaped name. This yielded problems when someone sends an email to him/herself, but to a different account. It's quite common for many people to cc themselves on every email they send. In Aperture you'd get an RDF graph where one email URI would be connected by two edges to one person uri, the person uri would then have to email addresses, but the information which address was from: and which address was to: was lost.

    2. The workaround implemented in 2009 has been invalidated by the improvements in the NIE ontology (which got taken over by freedesktop people). In the current version of NIE the range of nmo:from isn't nco:Contact but nco:EmailAddress. This was a very controversial issue back in 2007 in NEPOMUK and when the NEPOMUK ended, it got changed. We haven't updated the NIE version is use for a long while. I just couldn't get round to it during the last two years.

    3. When we update to the latest version of NIE we can theoretically allow for:
      nmo:from <mailto:antoni@home.com>;
      nmo:to <mailto:antoni@work.com> .

    <emailperson:Antoni+Mylka> nco:hasEmailAddress <mailto:antoni@home.com>, <mailto:antoni@work.com>

    We will not do that though, because in a general case saying "If two emails are sent to two different addresses, but the address has the same name then it's the same person" doesn't need to hold. Two people can have the same name and one person name can have multiple spellings. It's up to the user to decide, we can't do that on the framework level.

    Indeed if two contacts have the same email address, they are most likely the same person (though not always, e.g. my accountant has 6 employees and they all share the same company address), but the same name and two different addresses may not hold.

    Deduplication is a complex topic. If you need something simple, you can hack it yourself in 10 lines of code, but the same 10 lines of code will never satisfy everyone.


Log in to post a comment.