PomBase / Chado / #157 stats wanted

Midori Harris - 2013-02-12

For the manuscript, we'd also like a breakdown of number of annotation for each of the top-level terms, and for any other term that has 500 or more annotations, counting both direct and indirect (i.e. inherited by transitivity).

These are the top-level terms:

altered effect on growth medium FYPO:0001155
biological process phenotype FYPO:0000300
cell phenotype FYPO:0000002
cell population phenotype FYPO:0000003
molecular function phenotype FYPO:0000652
normal phenotype FYPO:0000257
*abnormal phenotype ~~[this one doesn't exist yet; I'll put the ID here when it is in]~~ FYPO:0001985

Does that all make sense?

cheers,
m

Last edit: Midori Harris 2013-02-14

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Midori Harris - 2013-02-14

p.s. for the annotation counts, can you let me know the date of the FYPO version you used?

(so I can do other stats from the same date)

ta
m

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kim Rutherford - 2013-02-20

I've made a start on this. Here's the annotation counts by year and type:
Dropbox/pombase/Chado/queries/annotation_counts_by_year.tsv

I'll do the phenotype query next.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

The are 9606 FYPO annotations in v32 from 858 publications.

Queries:

select count(distinct pub.uniquename)
  from feature_cvterm fc, pub, cvterm t, cv
 where t.cv_id = cv.cv_id
   and fc.cvterm_id = t.cvterm_id and fc.pub_id = pub.pub_id
   and cv.name = 'fission_yeast_phenotype';

select count(fc.feature_cvterm_id)
  from feature_cvterm fc, pub, cvterm t, cv
 where t.cv_id = cv.cv_id and fc.cvterm_id = t.cvterm_id
   and fc.pub_id = pub.pub_id and cv.name = 'fission_yeast_phenotype';

Kim Rutherford - 2013-02-20

Querying the indirect children of the top level terms for FYPO may prove tricky.

We currently use a script from GMOD called "gmod_make_cvtermpath.pl" to find all the direct and indirect children and parents of each term (transitive closure?) and then store the results in Chado. The result go in a table called "cvtermpath". That works OK for GO, probably, as it is mostly used on GO.

According to this long thread:
http://gmod.827538.n3.nabble.com/filling-cvtermpath-td1824246.html
that script takes a few shortcuts and is likely to produce incorrect results for ontologies that use some of the more complicated sorts of relations. Perhaps FYPO has some of those relations.

For now, the loader only runs gmod_make_cvtermpath.pl for the GO ontologies as we haven't needed the indirect children in Chado for anything else (so far). The main use of the indirect children at the moment is the code that filters the GO annotation to remove annotation where there is a more specific annotation. We don't do that for FYPO (yet?).

I'll change the loader to run the GMOD script for FYPO too so that the indirect children are available to query, but there's a chance that there will be subtle (or unsubtle) problems.

The longer term solution is to use Oort to calculate the parents and children using a fancy ontology reasoner, then load the results. It would probably faster too (the GMOD script takes hours per ontology). I thought there was a ticket about that but I can't find it. I'll add one.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Valerie Wood - 2013-02-20

These data seem to be the dates the annotations were made, rather than the date of the publication they were from
Dropbox/pombase/Chado/queries/annotation_counts_by_year.tsv

This wasn't very explicit was it ;)

Numbers of annotations made total/ number of annotations per paper per year

What I was thinking was
Do this query for each year:
Number of publications / number of annotations supported by those publications
considering only the data from the curation tool (i.e. the "fully" curated papers)
(we might need to make adjustments for a few HTP papers, but I can probably do this manually, or identify the papers and filter them later)

So for example
For each year
i) Get the number of papers from the year && curated in the curation tool
ii) Get the number of annotations to that paper (can do the breakdowm by GO/PRO/FYPO too, I didn't consider that but will also be useful)
iii) divide i by ii

This is to demonstrate how much more data the later papers contain and how this is increasing, rather than the fact that we can capture more data types now.

Val

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Valerie Wood - 2013-02-20

The are 9606 FYPO annotations in v32 from 858 publications.

Can we get the numbers from V 33 instead (much more data)

Val

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kim Rutherford - 2013-02-20

These data seem to be the dates the annotations were made, rather
than the date of the publication they were from

That makes more sense than the annotation date.

Unfortunately the publication dates aren't in Chado so I can't easily
query all the annotation and dates.

considering only the data from the curation tool (i.e. the "fully"
curated papers)

If you just need the data from the curation tool, that's easier. The
publication dates are stored in the curation tool database.

I'll give it a go.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kim Rutherford - 2013-02-20

For the latest load (which will be v33 more or less) we have 10197 annotations from 908 papers.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Valerie Wood - 2013-02-20

That's impressive!
(~5000 are from the genome deletion paper, but its still a lot!)
908 papers seems a bit odd, but I think this is ALL papers use so far for any curation not just the ones which we have done phenotypes for, but that's fine

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kim Rutherford - 2013-02-20

Why is 908 odd? The query is definitely returning the count of papers where there is at least one FYPO annotation in Chado.

There are 2406 papers in Chado that have annotation.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Antonia Lock - 2013-02-20

It's because we ave only done tool curation of 500 odd papers. I guess the rest will come from the embl files?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Valerie Wood - 2013-02-20

Ah right, it seemed a lot because there are 530 with annotation int eh curation tool, and these will not all have phenotype date. which means that the old annotations in artemis came from 300 publications. I didn't think it was that many, but it could easily be now I think about it...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Midori Harris - 2013-02-20

Going back to the much earlier comment about paths etc. ...

FYPO does have some "more complicated" relations, but not many so far. It is a very good idea to switch to Oort/reasoning/etc. for dealing with paths, but for the most immediate need (i.e. the FYPO manuscript) I think we'll be OK even if the counts aren't totally precise. We really just need to give a decent, fairly accurate idea of how much we've used the ontology for annotation, and the numbers will be shockingly out of date by the time the paper sees the light of day no matter what.

m

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kim Rutherford - 2013-02-20

OK, the next load will have the paths filled in for FYPO, so we can see how it looks.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kim Rutherford - 2013-02-22

The new load is done including the "paths" for FYPO

Here are the FYPO top level counts from the latest Chado (2013-02-19):

altered effect on growth medium - 15 biological process phenotype - 3436 cell phenotype - 9984 cell population phenotype - 74 molecular function phenotype - 166 normal phenotype - 777 abnormal phenotype - 2407

The numbers look rather large for some of those. Do they look too large?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Forgot to add: I used this code for generating the queries:

for my $name (
"altered effect on growth medium",
"biological process phenotype",
"cell phenotype",
"cell population phenotype",
"molecular function phenotype",
"normal phenotype",
"abnormal phenotype",
) {
  print qq|select '$name', count(feature_cvterm_id) from feature_cvterm where
    cvterm_id in (select subject_id from cvtermpath where object_id in (select
    cvterm_id from cvterm where name = '$name') and pathdistance > 0 UNION
    select cvterm_id from cvterm where name = '$name');\n|;
};

Midori Harris - 2013-02-22

Actually, the numbers look pretty reasonable, assuming they're counting individual annotations (as opposed to annotated genes or alleles). I would expect them to add up to more than the 9K or 10K total annotations because of multiple paths in the ontology.

Would it be easy to get the number of genes and/or number of alleles annotated too?

m

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kim Rutherford - 2013-02-24

Yep, those are counts of annotations not genes or alleles. I'm glad the number look OK.

These are the counts of alleles:
altered effect on growth medium - 14
biological process phenotype - 1391
cell phenotype - 6262
cell population phenotype - 54
molecular function phenotype - 115
normal phenotype - 333
abnormal phenotype - 1200

(Made by replacing "count(feature_cvterm_id)" with "count(distinct feature_id)" in the code above)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

I've queried for each term and the count of annotation to that term and all children.
It's here: Dropbox/pombase/Chado/queries/child_annotation_counts-v33.tsv

Is that any use?

This is the SQL that makes the table:

create temp table all_cvtermpath as select subject_id, object_id from
   cvtermpath where pathdistance > 0;
insert into all_cvtermpath
   select cvterm_id as subject_id, cvterm_id object_id from cvterm;
create index all_cvtermpath_object on all_cvtermpath(object_id);
create index all_cvtermpath_subject on all_cvtermpath(subject_id);
select count(distinct feature_cvterm_id), t.name, cv.name
   from feature_cvterm fc,cvterm t, all_cvtermpath ap, cv
   where t.cv_id = cv.cv_id and t.cvterm_id = ap.object_id and
     fc.cvterm_id = subject_id
   group by cv.name, t.name  order by count desc;

Valerie Wood - 2013-03-02

Is that any use?

Interesting.
I thought there would be more rows in this list though?
It includes all ontologies right? and indirect/direct annotations?
Currently, for GO alone we use 4134 term (current GAF), (direct only)

and for phenotype It should be more than the diff (485)
in fact if I grep on "phenotype" in this file I get 572
maybe I am misunderstanding what the child counts are?

v

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kim Rutherford - 2013-03-02

The query should include all terms that have annotation or have a child term with annotation, so I'll investigate.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kim Rutherford - 2013-03-06

I've just run it on the database I sent to Mark today (the one with the slightly dodgy cell cycle changes). I got 15563 rows this time - much better. I must have been using an incomplete copy of the Chado database. I don't know how that could happen.

I'll run it again once the next load (with the fixed cell cycle stuff) is done.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Valerie Wood - 2013-03-06

That sounds more like it....

How easy is the average annotations per publication /per year query?
If we get that today Antonia might be able to put a graph in her poster (I think she needs to print it on Thursday)
Not to worry if it is tricky, we can use it next time.
Val

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kim Rutherford - 2013-03-06

The average number of annotations per paper is: 14.1447

That number is slightly skewed by two papers that have thousands of annotations:

Dropbox/pombase/Chado/queries/top_papers_by_annotation_count.txt

(I remove the GO_REF and null "publications" from the list.)

There are lots of "null" publications in that list, one for each annotation that doesn't have a publication. Most of those come from /controlled_curation annotations that have no db_xref.

The number of annotations per year will take longer to work out because our Chado doesn't have any publication details except for the pubmed ID. I wasn't planning on loading the publication details unless we really need it as it's another dataset to maintain.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

stats wanted

Group

Searches

Help

#157 stats wanted

Discussion