Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.

Close

#1120 complete and ref in sync for GOA human?

GOA
open
rach_huntley
GAF (1)
5
2013-11-30
2013-11-16
Chris Mungall
No

complete has annotations to post-synaptic density for the MYD88 gene. One annotation is to the ref (J3KQJ6)

$ grep MYD88 14-Oct-2013/gene_association.goa_human | grep GO:0014069
UniProtKB H0Y4G9 MYD88 GO:0014069 GO_REF:0000019 IEA Ensembl:ENSRNOP00000018341 C Myeloid differentiation primary response protein MyD88 H0Y4G9_HUMAN|MYD88 protein taxon:9606 20131012 Ensembl
UniProtKB J3KQJ6 MYD88 GO:0014069 GO_REF:0000019 IEA Ensembl:ENSRNOP00000018341 C Myeloid differentiation primary response protein MyD88 J3KQJ6_HUMAN|MYD88 protein taxon:9606 20131012 Ensembl

However this annotation is not present in the ref file

Related

Annotation issues: #1120

Discussion

  • rach_huntley
    rach_huntley
    2013-11-29

    Hi Chris,

    This is what I've found out so far.
    The annotations for the term 'postsynaptic density' (GO:0014069) are coming from rat gene Ensembl:ENSRNOP00000018341. The human ortholog is ENSG00000172936. In Ensembl this identifier has cross-references to the UniProt accessions Q99836, B4E3D6, J3KPU4, H0Y4G9, J3KQJ6, J3KQ87. Q99836 is the Swiss-Prot entry (which is the one in the ref file) and the rest are from TrEMBL, however only J3KQJ6, H0Y4G9 have a projected annotation to postsynaptic density from Ensembl:ENSRNOP00000018341.
    When querying Ensembl about why this is, they responded that they make their orthologies and project GO terms at the transcript/protein level using only the "canonical" transcript (which is usually the longest translation, but is purely an internal definition) as they want to be conservative with their projections. For your example the "canonical" transcript is MYD88-201 (ENST00000417037 / ENSP00000401399), which maps to H0Y4G9 and J3KQJ6. Therefore, only these two have the GO annotations.

    Have you looked into how many other cases like this there are?

    Rachael.

     
    • Valerie Wood
      Valerie Wood
      2013-11-29

      Wouldn't be much less confusing if Ensembl only mapped to the UniProt
      reference entry only ?
      Val

      On 29/11/2013 08:43, rach_huntley wrote:

      Hi Chris,

      This is what I've found out so far.
      The annotations for the term 'postsynaptic density' (GO:0014069) are
      coming from rat gene Ensembl:ENSRNOP00000018341. The human ortholog is
      ENSG00000172936. In Ensembl this identifier has cross-references to
      the UniProt accessions Q99836, B4E3D6, J3KPU4, H0Y4G9, J3KQJ6, J3KQ87.
      Q99836 is the Swiss-Prot entry (which is the one in the ref file) and
      the rest are from TrEMBL, however only J3KQJ6, H0Y4G9 have a projected
      annotation to postsynaptic density from Ensembl:ENSRNOP00000018341.
      When querying Ensembl about why this is, they responded that they make
      their orthologies and project GO terms at the transcript/protein level
      using only the "canonical" transcript (which is usually the longest
      translation, but is purely an internal definition) as they want to be
      conservative with their projections. For your example the "canonical"
      transcript is MYD88-201 (ENST00000417037 / ENSP00000401399), which
      maps to H0Y4G9 and J3KQJ6. Therefore, only these two have the GO
      annotations.

      Have you looked into how many other cases like this there are?

      Rachael.


      [annotation-issues:#1120]
      http://sourceforge.net/p/geneontology/annotation-issues/1120/
      complete and ref in sync for GOA human?

      Status: open
      Labels: GAF
      Created: Sat Nov 16, 2013 05:25 AM UTC by Chris Mungall
      Last Updated: Sat Nov 16, 2013 05:25 AM UTC
      Owner: rach_huntley

      complete has annotations to post-synaptic density for the MYD88 gene.
      One annotation is to the ref (J3KQJ6)

      $ grep MYD88 14-Oct-2013/gene_association.goa_human | grep GO:0014069
      UniProtKB H0Y4G9 MYD88 GO:0014069 GO_REF:0000019 IEA
      Ensembl:ENSRNOP00000018341 C Myeloid differentiation primary response
      protein MyD88 H0Y4G9_HUMAN|MYD88 protein taxon:9606 20131012 Ensembl
      UniProtKB J3KQJ6 MYD88 GO:0014069 GO_REF:0000019 IEA
      Ensembl:ENSRNOP00000018341 C Myeloid differentiation primary response
      protein MyD88 J3KQJ6_HUMAN|MYD88 protein taxon:9606 20131012 Ensembl

      However this annotation is not present in the ref file


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/geneontology/annotation-issues/1120/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       

      Related

      Annotation issues: #1120

  • rach_huntley
    rach_huntley
    2013-11-29

    Hi Val,

    I did ask if they would be able to project to a Swiss-Prot entry by default and project to the longest transcript only if no Swiss-Prot is available, but they said they only have orthologies at the level of the longest transcript and it would be dangerous to project to other transcripts in the orthologous gene as they may have very different translations and therefore functions.

    If we can get an idea of how many annotations this affects then we can see if it's worth the effort to find a way around this.

    Rachael.

     
  • Chris Mungall
    Chris Mungall
    2013-11-30

    I think Ensembl's worry is misplaced. In fact there may be a misunderstanding about generic vs specific isoforms.

    For example, the original annotation here is an IDA to RGD:735043. This is not to a specific isoform - c17 is blank in the RGD-supplied GAF. Ensembl then presumably map this to ENSRNOP00000018341 (is this documented? I assume they choose the longest transcript?). They have already made some assumption about a specific form based on a generic form. It seems odd to then worry about whether a projection to another specific form is dangerous given that an assumption is already made.

    I think the solution is simple - the UniProtKB canonical protein should be interpreted as being generic. It's never been clear to me if UniProt explicitly commit to these semantics (one advantage of PRO is that it makes this explicit), but at least for GO this is what we intend. If info is known about a specific form then it should go in c17. If Ensembl map orthologs derived from information about genes or generic gene-level proteins then it should be to the generic/canonical UniProtKB ID (and preferably the reference proteome one, if multiple are available).

     
  • Chris Mungall
    Chris Mungall
    2013-11-30

    "If we can get an idea of how many annotations this affects then we can see if it's worth the effort to find a way around this."

    I suspect it's not a huge amount, and as such this should not be a blocker.

    But I think it's useful to pursue this a little more - for me at least it's useful to know the details of the ensembl projection method, I wasn't aware of their strategy or assumptions until pulling at this particular thread (possibly I should have been aware and need to go back and read some papers).

     
  • Valerie Wood
    Valerie Wood
    2013-11-30

    I have always thought the current strategy a little odd. The annotations being projected belong on the generic isoform unless there is isoform specific information in column 17. But don't we only require one entry per loci in the GO database?

    Would the problem go away if Ensembl made the GO annotation projection at the level of the gene (single ID), instead of the transcript (multiple IDs)? (which would make more sense based on the available biological data).

     
  • Valerie Wood
    Valerie Wood
    2013-11-30

    I just read Chris's comment and I agree.
    Val