Menu

#51 List of External databases used in Chado (standardization of database prefixes)

next-load
open
None
1
2013-09-24
2011-12-30
No

In this list we have
ChEBI/CHEBI
PomBse
PubMed/PMID
Rfam/RFAM
TAIT possibly TAIR?
Uniprot/UniProKB/ UniProtKB/Swiss-Prot
url/URL

Need to
1. Identify correct ones
2. Identify where errors are coming from
3. Fix
4. Implement checks to prevent incorrect Ids (dbxrefs list = PomBase specific/inhouse?)

Discussion

1 2 > >> (Page 1 of 2)
  • Valerie Wood

    Valerie Wood - 2011-12-30
    • assigned_to: gomidori --> kim_rutherford
     
  • Midori Harris

    Midori Harris - 2012-01-03

    some thoughts ...

    For databases, correct abbreviations and prefixes should all be in the GO.xrf_abbs file. That will cover these:
    BFO
    CHEBI
    ECO
    EMBL
    GO
    GOA
    GOC
    GO_REF
    InterPro
    KEGG
    MOD*
    OBI
    OBO_REL
    PATO
    Pfam
    PhenoScape
    PomBase
    PMID (for PubMed)
    RESID
    Rfam
    SO
    TAIR
    UniMod

    Note 1: for "MOD" the primary abbreviation in the GO file is PSI-MOD, but they use the prefix MOD so that's listed as a synonym in the file.

    Note 2: Some aren't in yet, e.g. ECO, FYPO, but I'll add them today.

    I hope we don't end up having to maintain a separate list for PomBase because it will go out of sync with GO's at the slightest provocation. But I doubt that everything on this list would fit the scope of the GO file.

    This lot are used in PSI-MOD property_value tags, and don't identify databases in their own right:

    DeltaMass
    DiffAvg
    DiffFormula
    DiffMono
    FormalCharge
    Formula
    MassAvg
    MassMono
    Source
    TermSpec

    I assume they turn up on the db list due to how PSI-MOD is being stored, and I'd be really surprised if we have any reason to use them anywhere else.

    A few other notes:

    - PomBse is obviously a typo

    - ChEBI, PFAM, RFAM should be corrected to CHEBI, Pfam, Rfam respectively

    - Uniprot should be corrected to UniProtKB (don't get me started on why-t-f they had to add the damn "KB" ...)

    - The UniProt folks like the "UniProtKB/Swiss-Prot" designation, but the rest of the world thinks using a slash in a db prefix is more annoying than clever, so GO uses "Swiss-Prot" as the primary abbreviation. "UniProtKB/Swiss-Prot" is a synonym, though, so we can keep using it if you want.

    - feature_cvtermprop_type looks like a table name

    - 'internal' and 'null' also look like, well, internal gubbins

    - I can't tell where GI, _global, OMSSA, Origin, SOFP, TAIT, url or URL come from, but most of them don't look like databases.

    cheers,
    m

     
  • Kim Rutherford

    Kim Rutherford - 2012-01-04

    Thanks for doing the investigation Midori.

    I'll try to track down which input files the bogus db names come from, and then update the database visible from the curation tool.

    The GMOD Chado database creator add some of the more obtuse ones like "null". Some of them (like "TAIT") are gone in my latest load, so they must have been temporary typos.

    The others mostly come from one of the OBO files, BioGRID or GOA. For example "OMSSA" comes from PSI-MOD.obo. Comparing them against GO.xrf_abbs might not help us much since there are quite a lot of non-GO files that we load.

    It's going to take a bit of work to get a totally clean list, but we can do the easy ones now and delete any extras after the final load of Chado. It's very easy to do once it's in.

     
  • Valerie Wood

    Valerie Wood - 2012-01-04

    >GO.xrf_abbs might not help us much

    I have a feeling Midori is the arbiter of this file ;)

     
  • Midori Harris

    Midori Harris - 2012-01-04

    Well, there have been actual updates by actual other people since I left ;)

    The real limitation with the GO.xrf_abbs file isn't that we have a lot of other things to load, it's that not all of the things getting loaded correspond to databases. GO.xrf_abbs doesn't actively exclude things that aren't used directly by GO, and it's getting some use as a de-facto abbreviation set for general MOD uses in the ongoing absence of anything better/more centralized/more generic.

    Missed OMSSA in the psi-dev.obo file yesterday. I've plunked it into the abbs file so people will know what it stands for (and there are some other pretty sparse entries in, so the precedent's established). I think I oughta draw the line at the PSI-MOD property_value gubbins, though; I can't really make a decent case for putting those in the abbs file.

     
  • Valerie Wood

    Valerie Wood - 2012-01-04

    OK I guess we stick as much in the file as is reasonable, and then see what's left....

     
  • Kim Rutherford

    Kim Rutherford - 2012-01-04

    I'll look into checking against GO.xrf_abbs when loading.

     
  • Midori Harris

    Midori Harris - 2012-01-04

    Thanks - it'll help to have an updated list of what's missing, so I can add any others that should be in the file.

     
  • Kim Rutherford

    Kim Rutherford - 2012-01-09

    I've made some changes to the curation tool Chado viewer to make these extra database names easier to track down. I've only changed the test version so far:
    http://oliver0.sysbiol.cam.ac.uk/test/chado

    The database list now has a column that is the number of xrefs/accession numbers for the database:
    http://oliver0.sysbiol.cam.ac.uk/test/view/list/db?model=chado

    If you look at the details of a database, eg. BioCyc, it now has a list of the xrefs for the database:
    http://oliver0.sysbiol.cam.ac.uk/test/view/object/db/113?model=chado

    If the "cvterm" column has a value it means that the accession is the primary term ID of a term/cvterm. If the cvterm column is empty it means that no term has that ID, which usually means that the ID is a secondary-ish ID for the term.

    If you view the details of an accession, eg PWY-5271 on the BioCyc page you'll see "Terms referenced via cvterm_dbxref". Sorry about the name. That's a list of the terms that have the accession as a secondary-ish ID. For PWY-5271, it lists the GO process term that has this accession as an xref.

    None of this works in the main tool yet. I wanted to make sure I haven't broken anything before I update the main one.

     
  • Kim Rutherford

    Kim Rutherford - 2012-01-09

    I've also updated the Chado database that the main tool and test tool are reading from. It's now the latest load of Chado.

    Disappointingly the number of databases in the list has increased by 2 to 165.

    Added are: PO_REF and PubChem, referenced from the process ontology.

     
  • Kim Rutherford

    Kim Rutherford - 2012-01-09

    Now I think about it, we can query the number of times an accession from each database is referenced from each cv. That will help us work out where the bogus ones come from.

    Firstly here's the list of database names, CV names and the counts. Most seem to be legitimate DB cross references. It's in Dropbox as: Dropbox/pombase/Chado/queries/counts_of_db_accessions_per_cv.txt:

    select db.name, cv.name, count(cd.cvterm_dbxref_id) from cv, cvterm t, cvterm_dbxref cd, dbxref x, db where t.cv_id = cv.cv_id and t.cvterm_id = cd.cvterm_id and cd.dbxref_id = x.dbxref_id and x.db_id = db.db_id group by db.name, cv.name order by db.name, cv.name;

    Next is a table of those DBs that don't have any accessions that are actually referenced. Dropbox/pombase/Chado/queries/unreferenced_dbs.txt:

    select name, description from db where name not in (select db.name from cv, cvterm t, cvterm_dbxref cd, dbxref x, db where t.cv_id = cv.cv_id and t.cvterm_id = cd.cvterm_id and cd.dbxref_id = x.dbxref_id and x.db_id = db.db_id) and db_id not in (select db_id from dbxref x, cvterm t where t.dbxref_id = x.dbxref_id) order by name;

    It looks like most of the unreferenced DBs are added by the GMOD script that initialises the Chado database. I've fixed that now I know where it happens. If we query again, but exclude the DBs added by GMOD, we get a very short list - just 6. I'll investigate them: Dropbox/pombase/Chado/queries/non_gmod_unreferenced_dbs.txt

    select * from db where name not in (select db.name from cv, cvterm t, cvterm_dbxref cd, dbxref x, db where t.cv_id = cv.cv_id and t.cvterm_id = cd.cvterm_id and cd.dbxref_id = x.dbxref_id and x.db_id = db.db_id) and db_id not in (select db_id from dbxref x, cvterm t where t.dbxref_id = x.dbxref_id) and name not in ('ATCC', 'Affymetrix_U133', 'Affymetrix_U133PLUS', 'Affymetrix_U95', 'EMBL', 'GFF_source', 'GR', 'GenBank_protein', 'LocusLink', 'OMIM', 'PFAM', 'PIR', 'PRINTS', 'PRODOM',
    'PROFILE', 'RefSNP', 'RefSeq_protein', 'SGD', 'SMART', 'SUPERFAMILY', 'Swiss', 'TIGR', 'TIGRFAMs', 'TSC', 'genbank', 'genbank:mrna', 'genbank:protein', 'locuslink', 'omim', 'pfam',
    'refseq', 'refseq:mrna', 'refseq:protein', 'swissprot:display', 'ucla', 'ucsc', 'unigene', 'uniprot'
    ) order by name;

    Results:
    GI
    GOA
    Rfam
    SPD
    Uniprot
    UniProtKB/Swiss-Prot

     
  • Valerie Wood

    Valerie Wood - 2012-01-09

    In contigs:
    1.
    fixed
    mating_type_region.contig:FT /db_xref="GOA:P10841"
    mating_type_region.contig:FT /db_xref="GOA:P10842"

    which was duplicated as UniprotKB, so deleted

    2.
    GI is a couple of refs in the mating type contig. These aren't important (whatever they are they caome from the contributing EMBL entries)> I deleted them from our contig

    3.
    SPD
    is in our contigs It comes from the mappings to the pombe localization data
    /db_xref="SPD:08/08E08"
    so, this one nneds adding to dbxrefs
    The URL is:
    http://www.riken.jp/SPD/50/50D08.html

    The others,

    Rfam
    Uniprot
    UniProtKB/Swiss-Prot

    are not in the EMBL falt files, might be coming from the external GO data?

     
  • Kim Rutherford

    Kim Rutherford - 2012-01-09

    I've updated the test curation tool to show the dbxrefs referenced from features in Chado. Eg.
    http://oliver0.sysbiol.cam.ac.uk/test/view/object/dbxref/109676?model=chado

    If a dbxref is referenced by a feature it must come from the EMBL contig files.

     
  • Kim Rutherford

    Kim Rutherford - 2012-01-09

    Here are the counts of the databases referenced from the EMBL file (from the last load).

    select db.name, count(fx.feature_dbxref_id) from feature_dbxref fx, dbxref x, db where fx.dbxref_id = x.dbxref_id and x.db_id = db.db_id group by db.name order by db.name;

    name | count
    -----------+-------
    EMBL | 434
    GI | 4
    GOA | 4
    InterPro | 10
    KEGG | 520
    PMID | 3166
    Rfam | 24
    SPD | 10023
    UniProtKB | 24

     
  • Valerie Wood

    Valerie Wood - 2012-01-09

    Just to confirm,
    there are now no instances of
    Uniprot
    or
    UniProtKB/Swiss-Prot
    coming from *anywhere*?

    (external or internal?)

    Val

     
  • Kim Rutherford

    Kim Rutherford - 2012-01-09

    There is a db in Chado with the name "Uniprot" and one with the name "UniProtKB/Swiss-Prot". Neither are referenced by a cvterm or a feature, so they aren't created from loading the OBO, BioGRID or EMBL files.

    Most likely the GMOD initialisation script is creating them. I'll dig around.

     
  • Kim Rutherford

    Kim Rutherford - 2012-01-09

    Sorry that wasn't ritght. Neither of those Uniprot DBs have any dbxrefs/accessions in Chado. It's the dbxrefs that get referenced by a cvterm or a feature.

    It doesn't seem to be GMOD that's creating them. Odd.

     
  • Kim Rutherford

    Kim Rutherford - 2012-01-10

    After discussion, we will implement the checks against GO.xrf_abbs as a database consistency check later on, so no warnings while loading.

     
  • Kim Rutherford

    Kim Rutherford - 2012-01-10

    Removing the random databases loaded by the GMOD initialisation code and re-loading reduces the number of DBs to 130.

     
  • Valerie Wood

    Valerie Wood - 2012-10-20

    which URl gives the list of DBs?

     
  • Valerie Wood

    Valerie Wood - 2012-10-23

    thats the one. I am always confused how this is populated.
    It has entries like
    ChEBI 102
    CHEBI 517

    goc 1
    GOC 224

    lost of spellings for Uniprot

    And it has entries where we haven't used the database much but the number is quite high
    CL 105

    and some that I am not aware we have used at all (BRENDA)

    Does it need a purge, or don't we need to worry about it?

     
  • Valerie Wood

    Valerie Wood - 2012-10-23

    Lowering priority anyway as it doesn't appear to affect us....

     
  • Valerie Wood

    Valerie Wood - 2013-09-03

    Transferred from duplicate
    https://sourceforge.net/p/pombase/chado/64/


    I am confused by the list of databases
    http://oliver0/pombe/view/list/db?numrows=200&page=1&model=chado

    This contains lots of duplicates and non standard database prefixes.
    (we use standard prefixes as described here http://www.geneontology.org/cgi-bin/xrefs.cgi\), I think we found that not all of out prefixes were in this file so we will need to supplement this list. However, there should only be one way of referring to any given database and this file should be the defualt)

    e.g.
    uniprot (10)
    Uniprot (0)
    UniProt (3)
    UniProtkB/Swiss-Prot
    UniProtKB (14) This is the correct version, we have used this at least 2258 times so what do these numbers refer to?

     
  • Valerie Wood

    Valerie Wood - 2013-09-03
    • summary: List of all dbs --> List of External databases used in Chado (standardization of database prefixes)
    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -1,4 +1,3 @@
    -
    
     In this list we have
     ChEBI/CHEBI
    
    • Group: --> next-load
     
1 2 > >> (Page 1 of 2)

Log in to post a comment.