As per Skype call:
Numbers of annotations made total/ number of annotations per paper per year
Also, It would be useful if we could get an overview of the phenotype annotation
i) How many phenotype annotations have we made so far (and using how many papers)
We may have further questions about phenotype data later.
For the manuscript, we'd also like a breakdown of number of annotation for each of the top-level terms, and for any other term that has 500 or more annotations, counting both direct and indirect (i.e. inherited by transitivity).
These are the top-level terms:
altered effect on growth medium FYPO:0001155
biological process phenotype FYPO:0000300
cell phenotype FYPO:0000002
cell population phenotype FYPO:0000003
molecular function phenotype FYPO:0000652
normal phenotype FYPO:0000257
*abnormal phenotype
[this one doesn't exist yet; I'll put the ID here when it is in]FYPO:0001985Does that all make sense?
cheers,
m
Last edit: Midori Harris 2013-02-14
p.s. for the annotation counts, can you let me know the date of the FYPO version you used?
(so I can do other stats from the same date)
ta
m
I've made a start on this. Here's the annotation counts by year and type:
Dropbox/pombase/Chado/queries/annotation_counts_by_year.tsv
I'll do the phenotype query next.
The are 9606 FYPO annotations in v32 from 858 publications.
Queries:
Querying the indirect children of the top level terms for FYPO may prove tricky.
We currently use a script from GMOD called "gmod_make_cvtermpath.pl" to find all the direct and indirect children and parents of each term (transitive closure?) and then store the results in Chado. The result go in a table called "cvtermpath". That works OK for GO, probably, as it is mostly used on GO.
According to this long thread:
http://gmod.827538.n3.nabble.com/filling-cvtermpath-td1824246.html
that script takes a few shortcuts and is likely to produce incorrect results for ontologies that use some of the more complicated sorts of relations. Perhaps FYPO has some of those relations.
For now, the loader only runs gmod_make_cvtermpath.pl for the GO ontologies as we haven't needed the indirect children in Chado for anything else (so far). The main use of the indirect children at the moment is the code that filters the GO annotation to remove annotation where there is a more specific annotation. We don't do that for FYPO (yet?).
I'll change the loader to run the GMOD script for FYPO too so that the indirect children are available to query, but there's a chance that there will be subtle (or unsubtle) problems.
The longer term solution is to use Oort to calculate the parents and children using a fancy ontology reasoner, then load the results. It would probably faster too (the GMOD script takes hours per ontology). I thought there was a ticket about that but I can't find it. I'll add one.
These data seem to be the dates the annotations were made, rather than the date of the publication they were from
Dropbox/pombase/Chado/queries/annotation_counts_by_year.tsv
This wasn't very explicit was it ;)
What I was thinking was
Do this query for each year:
Number of publications / number of annotations supported by those publications
considering only the data from the curation tool (i.e. the "fully" curated papers)
(we might need to make adjustments for a few HTP papers, but I can probably do this manually, or identify the papers and filter them later)
So for example
For each year
i) Get the number of papers from the year && curated in the curation tool
ii) Get the number of annotations to that paper (can do the breakdowm by GO/PRO/FYPO too, I didn't consider that but will also be useful)
iii) divide i by ii
This is to demonstrate how much more data the later papers contain and how this is increasing, rather than the fact that we can capture more data types now.
Val
Can we get the numbers from V 33 instead (much more data)
Val
That makes more sense than the annotation date.
Unfortunately the publication dates aren't in Chado so I can't easily
query all the annotation and dates.
If you just need the data from the curation tool, that's easier. The
publication dates are stored in the curation tool database.
I'll give it a go.
For the latest load (which will be v33 more or less) we have 10197 annotations from 908 papers.
That's impressive!
(~5000 are from the genome deletion paper, but its still a lot!)
908 papers seems a bit odd, but I think this is ALL papers use so far for any curation not just the ones which we have done phenotypes for, but that's fine
Why is 908 odd? The query is definitely returning the count of papers where there is at least one FYPO annotation in Chado.
There are 2406 papers in Chado that have annotation.
It's because we ave only done tool curation of 500 odd papers. I guess the rest will come from the embl files?
Ah right, it seemed a lot because there are 530 with annotation int eh curation tool, and these will not all have phenotype date. which means that the old annotations in artemis came from 300 publications. I didn't think it was that many, but it could easily be now I think about it...
Going back to the much earlier comment about paths etc. ...
FYPO does have some "more complicated" relations, but not many so far. It is a very good idea to switch to Oort/reasoning/etc. for dealing with paths, but for the most immediate need (i.e. the FYPO manuscript) I think we'll be OK even if the counts aren't totally precise. We really just need to give a decent, fairly accurate idea of how much we've used the ontology for annotation, and the numbers will be shockingly out of date by the time the paper sees the light of day no matter what.
m
OK, the next load will have the paths filled in for FYPO, so we can see how it looks.
The new load is done including the "paths" for FYPO
Here are the FYPO top level counts from the latest Chado (2013-02-19):
The numbers look rather large for some of those. Do they look too large?
Forgot to add: I used this code for generating the queries:
Actually, the numbers look pretty reasonable, assuming they're counting individual annotations (as opposed to annotated genes or alleles). I would expect them to add up to more than the 9K or 10K total annotations because of multiple paths in the ontology.
Would it be easy to get the number of genes and/or number of alleles annotated too?
m
Yep, those are counts of annotations not genes or alleles. I'm glad the number look OK.
These are the counts of alleles:
altered effect on growth medium - 14
biological process phenotype - 1391
cell phenotype - 6262
cell population phenotype - 54
molecular function phenotype - 115
normal phenotype - 333
abnormal phenotype - 1200
(Made by replacing "count(feature_cvterm_id)" with "count(distinct feature_id)" in the code above)
I've queried for each term and the count of annotation to that term and all children.
It's here: Dropbox/pombase/Chado/queries/child_annotation_counts-v33.tsv
Is that any use?
This is the SQL that makes the table:
Interesting.
I thought there would be more rows in this list though?
It includes all ontologies right? and indirect/direct annotations?
Currently, for GO alone we use 4134 term (current GAF), (direct only)
and for phenotype It should be more than the diff (485)
in fact if I grep on "phenotype" in this file I get 572
maybe I am misunderstanding what the child counts are?
v
The query should include all terms that have annotation or have a child term with annotation, so I'll investigate.
I've just run it on the database I sent to Mark today (the one with the slightly dodgy cell cycle changes). I got 15563 rows this time - much better. I must have been using an incomplete copy of the Chado database. I don't know how that could happen.
I'll run it again once the next load (with the fixed cell cycle stuff) is done.
That sounds more like it....
How easy is the average annotations per publication /per year query?
If we get that today Antonia might be able to put a graph in her poster (I think she needs to print it on Thursday)
Not to worry if it is tricky, we can use it next time.
Val
The average number of annotations per paper is: 14.1447
That number is slightly skewed by two papers that have thousands of annotations:
(I remove the GO_REF and null "publications" from the list.)
There are lots of "null" publications in that list, one for each annotation that doesn't have a publication. Most of those come from /controlled_curation annotations that have no db_xref.
The number of annotations per year will take longer to work out because our Chado doesn't have any publication details except for the pubmed ID. I wasn't planning on loading the publication details unless we really need it as it's another dataset to maintain.