Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.

Close

#112 Articles with duplicate pubmed ids are getting into the db

closed-fixed
Danny Yoo
Bugs (45)
9
2005-10-18
2005-07-11
Anonymous
No

When trying to input scanned date for this paper:
http://tesuque.stanford.edu:8080/pub/DisplayArticle?article_id=32922

I get the error message saying the pubmed id is already
in use by another paper. THe link is to
:http://tesuque.stanford.edu:8080/pub/DisplayArticle?article_id=31182

This should have been caught before - they are same
papers yet both are in Pub.
Leonore

Discussion

  • Danny Yoo
    Danny Yoo
    2005-08-09

    Logged In: YES
    user_id=49843

    Ok.

    The issue appears to be that duplicate checking should
    consider articles with the same pubmed id to be duplicates
    of each other.

    First, how extensive is this problem?

    #######
    mysql> select a.id from pub_article a, pub_article b where
    a.id < b.id and a.pubmed_id = b.pubmed_id and a.pubmed_id is
    not null and a.pubmed_id != '' and a.is_obsolete='n' and
    b.is_obsolete='n';
    #######

    We're getting about 32 candidates here.

    Second, how do we correct existing damage to the database?
    We need to arbitrarily choose one of the articles to
    obsolete, and merge information as necessary. I'll look at
    the articles and see if this is a particularly simple thing
    to do.

    Finally, how do we prevent this from happening again?
    Offhand, we should have pubmed_id be a unique index, to
    prevent duplicate articles from having the same pubmed_id.

     
  • Logged In: YES
    user_id=579762

    When merging, please do NOT do this arbitrarily but retain
    the article that has annotations associated with it. If
    neither article has associations, that's fine, keep either one.

     
  • Logged In: YES
    user_id=579762

    I believe this is the item that describes the duplicate PMID
    problem. There needs to be some troubleshooting on how the
    duplicates are getting in and what criteria we need to use
    to merge them. I think Aleksey will be of some help when
    attacking this problem, at the very least in providing some
    background and some examples.

     
  • Logged In: YES
    user_id=579762

    I wanted to move this guy up to priority 9 but since I don't
    have ownership of the item, I can't do that. Can you,
    Danny? This is actually a very pressing problem and if I
    could , I'd make this priority 10.

    Thanks,

    Tanya

     
  • Danny Yoo
    Danny Yoo
    2005-10-13

    • priority: 5 --> 9
     
  • Danny Yoo
    Danny Yoo
    2005-10-14

    • summary: cannot validate paper --> Articles with duplicate pubmed ids are getting into the db
     
  • Danny Yoo
    Danny Yoo
    2005-10-18

    Logged In: YES
    user_id=49843

    Existing data problems corrected in the database, and also
    modified the bulk pubmed loader to take pubmed id into
    consideration when loading.

    This still doesn't resolve the issue of pulling updated
    information from pubmed into pubsearch, but I think that
    should be considered a separate (but related) issue. I'll
    close this bug, and open a new one that describes the problem.

    Closing bug.

     
  • Danny Yoo
    Danny Yoo
    2005-10-18

    • status: open --> closed-fixed