Menu

#157 stats wanted

None
open
nobody
None
1
2013-03-15
2013-02-09
No

As per Skype call:
Numbers of annotations made total/ number of annotations per paper per year

Also, It would be useful if we could get an overview of the phenotype annotation
i) How many phenotype annotations have we made so far (and using how many papers)
We may have further questions about phenotype data later.

Discussion

1 2 > >> (Page 1 of 2)
  • Midori Harris

    Midori Harris - 2013-02-12

    For the manuscript, we'd also like a breakdown of number of annotation for each of the top-level terms, and for any other term that has 500 or more annotations, counting both direct and indirect (i.e. inherited by transitivity).

    These are the top-level terms:

    altered effect on growth medium FYPO:0001155
    biological process phenotype FYPO:0000300
    cell phenotype FYPO:0000002
    cell population phenotype FYPO:0000003
    molecular function phenotype FYPO:0000652
    normal phenotype FYPO:0000257
    *abnormal phenotype [this one doesn't exist yet; I'll put the ID here when it is in] FYPO:0001985

    Does that all make sense?

    cheers,
    m

     

    Last edit: Midori Harris 2013-02-14
  • Midori Harris

    Midori Harris - 2013-02-14

    p.s. for the annotation counts, can you let me know the date of the FYPO version you used?

    (so I can do other stats from the same date)

    ta
    m

     
  • Kim Rutherford

    Kim Rutherford - 2013-02-20

    I've made a start on this. Here's the annotation counts by year and type:
    Dropbox/pombase/Chado/queries/annotation_counts_by_year.tsv

    I'll do the phenotype query next.

     
  • Kim Rutherford

    Kim Rutherford - 2013-02-20

    The are 9606 FYPO annotations in v32 from 858 publications.

    Queries:

    select count(distinct pub.uniquename)
      from feature_cvterm fc, pub, cvterm t, cv
     where t.cv_id = cv.cv_id
       and fc.cvterm_id = t.cvterm_id and fc.pub_id = pub.pub_id
       and cv.name = 'fission_yeast_phenotype';
    
    select count(fc.feature_cvterm_id)
      from feature_cvterm fc, pub, cvterm t, cv
     where t.cv_id = cv.cv_id and fc.cvterm_id = t.cvterm_id
       and fc.pub_id = pub.pub_id and cv.name = 'fission_yeast_phenotype';
    
     
  • Kim Rutherford

    Kim Rutherford - 2013-02-20

    Querying the indirect children of the top level terms for FYPO may prove tricky.

    We currently use a script from GMOD called "gmod_make_cvtermpath.pl" to find all the direct and indirect children and parents of each term (transitive closure?) and then store the results in Chado. The result go in a table called "cvtermpath". That works OK for GO, probably, as it is mostly used on GO.

    According to this long thread:
    http://gmod.827538.n3.nabble.com/filling-cvtermpath-td1824246.html
    that script takes a few shortcuts and is likely to produce incorrect results for ontologies that use some of the more complicated sorts of relations. Perhaps FYPO has some of those relations.

    For now, the loader only runs gmod_make_cvtermpath.pl for the GO ontologies as we haven't needed the indirect children in Chado for anything else (so far). The main use of the indirect children at the moment is the code that filters the GO annotation to remove annotation where there is a more specific annotation. We don't do that for FYPO (yet?).

    I'll change the loader to run the GMOD script for FYPO too so that the indirect children are available to query, but there's a chance that there will be subtle (or unsubtle) problems.

    The longer term solution is to use Oort to calculate the parents and children using a fancy ontology reasoner, then load the results. It would probably faster too (the GMOD script takes hours per ontology). I thought there was a ticket about that but I can't find it. I'll add one.

     
  • Valerie Wood

    Valerie Wood - 2013-02-20

    These data seem to be the dates the annotations were made, rather than the date of the publication they were from
    Dropbox/pombase/Chado/queries/annotation_counts_by_year.tsv

    This wasn't very explicit was it ;)

    Numbers of annotations made total/ number of annotations per paper per year

    What I was thinking was
    Do this query for each year:
    Number of publications / number of annotations supported by those publications
    considering only the data from the curation tool (i.e. the "fully" curated papers)
    (we might need to make adjustments for a few HTP papers, but I can probably do this manually, or identify the papers and filter them later)

    So for example
    For each year
    i) Get the number of papers from the year && curated in the curation tool
    ii) Get the number of annotations to that paper (can do the breakdowm by GO/PRO/FYPO too, I didn't consider that but will also be useful)
    iii) divide i by ii

    This is to demonstrate how much more data the later papers contain and how this is increasing, rather than the fact that we can capture more data types now.

    Val

     
  • Valerie Wood

    Valerie Wood - 2013-02-20

    The are 9606 FYPO annotations in v32 from 858 publications.

    Can we get the numbers from V 33 instead (much more data)

    Val

     
  • Kim Rutherford

    Kim Rutherford - 2013-02-20

    These data seem to be the dates the annotations were made, rather
    than the date of the publication they were from

    That makes more sense than the annotation date.

    Unfortunately the publication dates aren't in Chado so I can't easily
    query all the annotation and dates.

    considering only the data from the curation tool (i.e. the "fully"
    curated papers)

    If you just need the data from the curation tool, that's easier. The
    publication dates are stored in the curation tool database.

    I'll give it a go.

     
  • Kim Rutherford

    Kim Rutherford - 2013-02-20

    For the latest load (which will be v33 more or less) we have 10197 annotations from 908 papers.

     
  • Valerie Wood

    Valerie Wood - 2013-02-20

    That's impressive!
    (~5000 are from the genome deletion paper, but its still a lot!)
    908 papers seems a bit odd, but I think this is ALL papers use so far for any curation not just the ones which we have done phenotypes for, but that's fine

     
  • Kim Rutherford

    Kim Rutherford - 2013-02-20

    Why is 908 odd? The query is definitely returning the count of papers where there is at least one FYPO annotation in Chado.

    There are 2406 papers in Chado that have annotation.

     
  • Antonia Lock

    Antonia Lock - 2013-02-20

    It's because we ave only done tool curation of 500 odd papers. I guess the rest will come from the embl files?

     
  • Valerie Wood

    Valerie Wood - 2013-02-20

    Ah right, it seemed a lot because there are 530 with annotation int eh curation tool, and these will not all have phenotype date. which means that the old annotations in artemis came from 300 publications. I didn't think it was that many, but it could easily be now I think about it...

     
  • Midori Harris

    Midori Harris - 2013-02-20

    Going back to the much earlier comment about paths etc. ...

    FYPO does have some "more complicated" relations, but not many so far. It is a very good idea to switch to Oort/reasoning/etc. for dealing with paths, but for the most immediate need (i.e. the FYPO manuscript) I think we'll be OK even if the counts aren't totally precise. We really just need to give a decent, fairly accurate idea of how much we've used the ontology for annotation, and the numbers will be shockingly out of date by the time the paper sees the light of day no matter what.

    m

     
  • Kim Rutherford

    Kim Rutherford - 2013-02-20

    OK, the next load will have the paths filled in for FYPO, so we can see how it looks.

     
  • Kim Rutherford

    Kim Rutherford - 2013-02-22

    The new load is done including the "paths" for FYPO

    Here are the FYPO top level counts from the latest Chado (2013-02-19):

    altered effect on growth medium -   15
    biological process phenotype - 3436
    cell phenotype - 9984
    cell population phenotype -   74
    molecular function phenotype -  166
    normal phenotype -  777
    abnormal phenotype - 2407
    

    The numbers look rather large for some of those. Do they look too large?

     
  • Kim Rutherford

    Kim Rutherford - 2013-02-22

    Forgot to add: I used this code for generating the queries:

    for my $name (
    "altered effect on growth medium",
    "biological process phenotype",
    "cell phenotype",
    "cell population phenotype",
    "molecular function phenotype",
    "normal phenotype",
    "abnormal phenotype",
    ) {
      print qq|select '$name', count(feature_cvterm_id) from feature_cvterm where
        cvterm_id in (select subject_id from cvtermpath where object_id in (select
        cvterm_id from cvterm where name = '$name') and pathdistance > 0 UNION
        select cvterm_id from cvterm where name = '$name');\n|;
    };
    
     
  • Midori Harris

    Midori Harris - 2013-02-22

    Actually, the numbers look pretty reasonable, assuming they're counting individual annotations (as opposed to annotated genes or alleles). I would expect them to add up to more than the 9K or 10K total annotations because of multiple paths in the ontology.

    Would it be easy to get the number of genes and/or number of alleles annotated too?

    m

     
  • Kim Rutherford

    Kim Rutherford - 2013-02-24

    Yep, those are counts of annotations not genes or alleles. I'm glad the number look OK.

    These are the counts of alleles:
    altered effect on growth medium - 14
    biological process phenotype - 1391
    cell phenotype - 6262
    cell population phenotype - 54
    molecular function phenotype - 115
    normal phenotype - 333
    abnormal phenotype - 1200

    (Made by replacing "count(feature_cvterm_id)" with "count(distinct feature_id)" in the code above)

     
  • Kim Rutherford

    Kim Rutherford - 2013-02-27

    I've queried for each term and the count of annotation to that term and all children.
    It's here: Dropbox/pombase/Chado/queries/child_annotation_counts-v33.tsv

    Is that any use?

    This is the SQL that makes the table:

    create temp table all_cvtermpath as select subject_id, object_id from
       cvtermpath where pathdistance > 0;
    insert into all_cvtermpath
       select cvterm_id as subject_id, cvterm_id object_id from cvterm;
    create index all_cvtermpath_object on all_cvtermpath(object_id);
    create index all_cvtermpath_subject on all_cvtermpath(subject_id);
    select count(distinct feature_cvterm_id), t.name, cv.name
       from feature_cvterm fc,cvterm t, all_cvtermpath ap, cv
       where t.cv_id = cv.cv_id and t.cvterm_id = ap.object_id and
         fc.cvterm_id = subject_id
       group by cv.name, t.name  order by count desc;
    
     
  • Valerie Wood

    Valerie Wood - 2013-03-02

    Is that any use?

    Interesting.
    I thought there would be more rows in this list though?
    It includes all ontologies right? and indirect/direct annotations?
    Currently, for GO alone we use 4134 term (current GAF), (direct only)

    and for phenotype It should be more than the diff (485)
    in fact if I grep on "phenotype" in this file I get 572
    maybe I am misunderstanding what the child counts are?

    v

     
  • Kim Rutherford

    Kim Rutherford - 2013-03-02

    The query should include all terms that have annotation or have a child term with annotation, so I'll investigate.

     
  • Kim Rutherford

    Kim Rutherford - 2013-03-06

    I've just run it on the database I sent to Mark today (the one with the slightly dodgy cell cycle changes). I got 15563 rows this time - much better. I must have been using an incomplete copy of the Chado database. I don't know how that could happen.

    I'll run it again once the next load (with the fixed cell cycle stuff) is done.

     
  • Valerie Wood

    Valerie Wood - 2013-03-06

    That sounds more like it....

    How easy is the average annotations per publication /per year query?
    If we get that today Antonia might be able to put a graph in her poster (I think she needs to print it on Thursday)
    Not to worry if it is tricky, we can use it next time.
    Val

     
  • Kim Rutherford

    Kim Rutherford - 2013-03-06

    The average number of annotations per paper is: 14.1447

    That number is slightly skewed by two papers that have thousands of annotations:

    Dropbox/pombase/Chado/queries/top_papers_by_annotation_count.txt
    

    (I remove the GO_REF and null "publications" from the list.)

    There are lots of "null" publications in that list, one for each annotation that doesn't have a publication. Most of those come from /controlled_curation annotations that have no db_xref.

    The number of annotations per year will take longer to work out because our Chado doesn't have any publication details except for the pubmed ID. I wasn't planning on loading the publication details unless we really need it as it's another dataset to maintain.

     
1 2 > >> (Page 1 of 2)

Log in to post a comment.