In this list we have
ChEBI/CHEBI
PomBse
PubMed/PMID
Rfam/RFAM
TAIT possibly TAIR?
Uniprot/UniProKB/ UniProtKB/Swiss-Prot
url/URL
Need to
1. Identify correct ones
2. Identify where errors are coming from
3. Fix
4. Implement checks to prevent incorrect Ids (dbxrefs list = PomBase specific/inhouse?)
some thoughts ...
For databases, correct abbreviations and prefixes should all be in the GO.xrf_abbs file. That will cover these:
BFO
CHEBI
ECO
EMBL
GO
GOA
GOC
GO_REF
InterPro
KEGG
MOD*
OBI
OBO_REL
PATO
Pfam
PhenoScape
PomBase
PMID (for PubMed)
RESID
Rfam
SO
TAIR
UniMod
Note 1: for "MOD" the primary abbreviation in the GO file is PSI-MOD, but they use the prefix MOD so that's listed as a synonym in the file.
Note 2: Some aren't in yet, e.g. ECO, FYPO, but I'll add them today.
I hope we don't end up having to maintain a separate list for PomBase because it will go out of sync with GO's at the slightest provocation. But I doubt that everything on this list would fit the scope of the GO file.
This lot are used in PSI-MOD property_value tags, and don't identify databases in their own right:
DeltaMass
DiffAvg
DiffFormula
DiffMono
FormalCharge
Formula
MassAvg
MassMono
Source
TermSpec
I assume they turn up on the db list due to how PSI-MOD is being stored, and I'd be really surprised if we have any reason to use them anywhere else.
A few other notes:
- PomBse is obviously a typo
- ChEBI, PFAM, RFAM should be corrected to CHEBI, Pfam, Rfam respectively
- Uniprot should be corrected to UniProtKB (don't get me started on why-t-f they had to add the damn "KB" ...)
- The UniProt folks like the "UniProtKB/Swiss-Prot" designation, but the rest of the world thinks using a slash in a db prefix is more annoying than clever, so GO uses "Swiss-Prot" as the primary abbreviation. "UniProtKB/Swiss-Prot" is a synonym, though, so we can keep using it if you want.
- feature_cvtermprop_type looks like a table name
- 'internal' and 'null' also look like, well, internal gubbins
- I can't tell where GI, _global, OMSSA, Origin, SOFP, TAIT, url or URL come from, but most of them don't look like databases.
cheers,
m
Thanks for doing the investigation Midori.
I'll try to track down which input files the bogus db names come from, and then update the database visible from the curation tool.
The GMOD Chado database creator add some of the more obtuse ones like "null". Some of them (like "TAIT") are gone in my latest load, so they must have been temporary typos.
The others mostly come from one of the OBO files, BioGRID or GOA. For example "OMSSA" comes from PSI-MOD.obo. Comparing them against GO.xrf_abbs might not help us much since there are quite a lot of non-GO files that we load.
It's going to take a bit of work to get a totally clean list, but we can do the easy ones now and delete any extras after the final load of Chado. It's very easy to do once it's in.
>GO.xrf_abbs might not help us much
I have a feeling Midori is the arbiter of this file ;)
Well, there have been actual updates by actual other people since I left ;)
The real limitation with the GO.xrf_abbs file isn't that we have a lot of other things to load, it's that not all of the things getting loaded correspond to databases. GO.xrf_abbs doesn't actively exclude things that aren't used directly by GO, and it's getting some use as a de-facto abbreviation set for general MOD uses in the ongoing absence of anything better/more centralized/more generic.
Missed OMSSA in the psi-dev.obo file yesterday. I've plunked it into the abbs file so people will know what it stands for (and there are some other pretty sparse entries in, so the precedent's established). I think I oughta draw the line at the PSI-MOD property_value gubbins, though; I can't really make a decent case for putting those in the abbs file.
OK I guess we stick as much in the file as is reasonable, and then see what's left....
I'll look into checking against GO.xrf_abbs when loading.
Thanks - it'll help to have an updated list of what's missing, so I can add any others that should be in the file.
I've made some changes to the curation tool Chado viewer to make these extra database names easier to track down. I've only changed the test version so far:
http://oliver0.sysbiol.cam.ac.uk/test/chado
The database list now has a column that is the number of xrefs/accession numbers for the database:
http://oliver0.sysbiol.cam.ac.uk/test/view/list/db?model=chado
If you look at the details of a database, eg. BioCyc, it now has a list of the xrefs for the database:
http://oliver0.sysbiol.cam.ac.uk/test/view/object/db/113?model=chado
If the "cvterm" column has a value it means that the accession is the primary term ID of a term/cvterm. If the cvterm column is empty it means that no term has that ID, which usually means that the ID is a secondary-ish ID for the term.
If you view the details of an accession, eg PWY-5271 on the BioCyc page you'll see "Terms referenced via cvterm_dbxref". Sorry about the name. That's a list of the terms that have the accession as a secondary-ish ID. For PWY-5271, it lists the GO process term that has this accession as an xref.
None of this works in the main tool yet. I wanted to make sure I haven't broken anything before I update the main one.
I've also updated the Chado database that the main tool and test tool are reading from. It's now the latest load of Chado.
Disappointingly the number of databases in the list has increased by 2 to 165.
Added are: PO_REF and PubChem, referenced from the process ontology.
Now I think about it, we can query the number of times an accession from each database is referenced from each cv. That will help us work out where the bogus ones come from.
Firstly here's the list of database names, CV names and the counts. Most seem to be legitimate DB cross references. It's in Dropbox as: Dropbox/pombase/Chado/queries/counts_of_db_accessions_per_cv.txt:
select db.name, cv.name, count(cd.cvterm_dbxref_id) from cv, cvterm t, cvterm_dbxref cd, dbxref x, db where t.cv_id = cv.cv_id and t.cvterm_id = cd.cvterm_id and cd.dbxref_id = x.dbxref_id and x.db_id = db.db_id group by db.name, cv.name order by db.name, cv.name;
Next is a table of those DBs that don't have any accessions that are actually referenced. Dropbox/pombase/Chado/queries/unreferenced_dbs.txt:
select name, description from db where name not in (select db.name from cv, cvterm t, cvterm_dbxref cd, dbxref x, db where t.cv_id = cv.cv_id and t.cvterm_id = cd.cvterm_id and cd.dbxref_id = x.dbxref_id and x.db_id = db.db_id) and db_id not in (select db_id from dbxref x, cvterm t where t.dbxref_id = x.dbxref_id) order by name;
It looks like most of the unreferenced DBs are added by the GMOD script that initialises the Chado database. I've fixed that now I know where it happens. If we query again, but exclude the DBs added by GMOD, we get a very short list - just 6. I'll investigate them: Dropbox/pombase/Chado/queries/non_gmod_unreferenced_dbs.txt
select * from db where name not in (select db.name from cv, cvterm t, cvterm_dbxref cd, dbxref x, db where t.cv_id = cv.cv_id and t.cvterm_id = cd.cvterm_id and cd.dbxref_id = x.dbxref_id and x.db_id = db.db_id) and db_id not in (select db_id from dbxref x, cvterm t where t.dbxref_id = x.dbxref_id) and name not in ('ATCC', 'Affymetrix_U133', 'Affymetrix_U133PLUS', 'Affymetrix_U95', 'EMBL', 'GFF_source', 'GR', 'GenBank_protein', 'LocusLink', 'OMIM', 'PFAM', 'PIR', 'PRINTS', 'PRODOM',
'PROFILE', 'RefSNP', 'RefSeq_protein', 'SGD', 'SMART', 'SUPERFAMILY', 'Swiss', 'TIGR', 'TIGRFAMs', 'TSC', 'genbank', 'genbank:mrna', 'genbank:protein', 'locuslink', 'omim', 'pfam',
'refseq', 'refseq:mrna', 'refseq:protein', 'swissprot:display', 'ucla', 'ucsc', 'unigene', 'uniprot'
) order by name;
Results:
GI
GOA
Rfam
SPD
Uniprot
UniProtKB/Swiss-Prot
In contigs:
1.
fixed
mating_type_region.contig:FT /db_xref="GOA:P10841"
mating_type_region.contig:FT /db_xref="GOA:P10842"
which was duplicated as UniprotKB, so deleted
2.
GI is a couple of refs in the mating type contig. These aren't important (whatever they are they caome from the contributing EMBL entries)> I deleted them from our contig
3.
SPD
is in our contigs It comes from the mappings to the pombe localization data
/db_xref="SPD:08/08E08"
so, this one nneds adding to dbxrefs
The URL is:
http://www.riken.jp/SPD/50/50D08.html
The others,
Rfam
Uniprot
UniProtKB/Swiss-Prot
are not in the EMBL falt files, might be coming from the external GO data?
I've updated the test curation tool to show the dbxrefs referenced from features in Chado. Eg.
http://oliver0.sysbiol.cam.ac.uk/test/view/object/dbxref/109676?model=chado
If a dbxref is referenced by a feature it must come from the EMBL contig files.
Here are the counts of the databases referenced from the EMBL file (from the last load).
select db.name, count(fx.feature_dbxref_id) from feature_dbxref fx, dbxref x, db where fx.dbxref_id = x.dbxref_id and x.db_id = db.db_id group by db.name order by db.name;
name | count
-----------+-------
EMBL | 434
GI | 4
GOA | 4
InterPro | 10
KEGG | 520
PMID | 3166
Rfam | 24
SPD | 10023
UniProtKB | 24
Just to confirm,
there are now no instances of
Uniprot
or
UniProtKB/Swiss-Prot
coming from *anywhere*?
(external or internal?)
Val
There is a db in Chado with the name "Uniprot" and one with the name "UniProtKB/Swiss-Prot". Neither are referenced by a cvterm or a feature, so they aren't created from loading the OBO, BioGRID or EMBL files.
Most likely the GMOD initialisation script is creating them. I'll dig around.
Sorry that wasn't ritght. Neither of those Uniprot DBs have any dbxrefs/accessions in Chado. It's the dbxrefs that get referenced by a cvterm or a feature.
It doesn't seem to be GMOD that's creating them. Odd.
After discussion, we will implement the checks against GO.xrf_abbs as a database consistency check later on, so no warnings while loading.
Removing the random databases loaded by the GMOD initialisation code and re-loading reduces the number of DBs to 130.
which URl gives the list of DBs?
Is this the list you need?:
http://curation.pombase.org/pombe/view/list/db?model=chado
thats the one. I am always confused how this is populated.
It has entries like
ChEBI 102
CHEBI 517
goc 1
GOC 224
lost of spellings for Uniprot
And it has entries where we haven't used the database much but the number is quite high
CL 105
and some that I am not aware we have used at all (BRENDA)
Does it need a purge, or don't we need to worry about it?
Lowering priority anyway as it doesn't appear to affect us....
Transferred from duplicate
https://sourceforge.net/p/pombase/chado/64/
I am confused by the list of databases
http://oliver0/pombe/view/list/db?numrows=200&page=1&model=chado
This contains lots of duplicates and non standard database prefixes.
(we use standard prefixes as described here http://www.geneontology.org/cgi-bin/xrefs.cgi\), I think we found that not all of out prefixes were in this file so we will need to supplement this list. However, there should only be one way of referring to any given database and this file should be the defualt)
e.g.
uniprot (10)
Uniprot (0)
UniProt (3)
UniProtkB/Swiss-Prot
UniProtKB (14) This is the correct version, we have used this at least 2258 times so what do these numbers refer to?
Diff: