Re: [SO-devel] C/V/D/J_gene_segment ontology

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Terence,

I understand a very similar discussion took place at the RDF
Summit last year https://github.com/dbcls/rdfsummit/wiki with
EBI and DBCLS participants regarding how to best represent
"join" locations in RDF/FALDO.

https://github.com/JervenBolleman/FALDO
http://dx.doi.org/10.1101/002121

In designing FALDO being able to represent all the existing ISNDC
records in EMBL/GenBank/DDBJ was a goal, and to that end the
compound "join" and "order" locations can be represented using
RDF lists and bags (order aware, and order ignoring).

However, dealing with simple locations described by just
start/end span (and strand) makes querying the RDF much
easier. It is my understanding (second hand, not being at the
RDF Summit meeting) that the consensus was that best
practice would be to make the sub-regions of any compound
location (eg each exon in a CDS "join" location) into explicitly
named features (each of which then has a simple location,
i.e. a simple faldo:Region). The CDS is then described as
the combination of these new named parts.

I think this is very much the same issue you are describing in
GFF3: While the multi-line features have long been part of the
specification, they are complex to deal with and some tools
reject them. Turning each line of a compound location into
an independent line seems to solve this neatly.

The only catch is how to make their order explicit, but
that is a long standing GFF3 problem - see the Part tag
discussion.

(In RDF this made simpler because the elements themselves
can have added information, including their own sequence.
e.g. the exons do not have an order per se, they have an
order inside a splice variant.)

I have CC'd Jerven Bolleman (FALDO lead) and Toshiaki
Katayama (interested in using FALDO for representing the
INSDC sequences in RDF) who were at this meeting, and
should be able to clarify if I have misunderstood anything.

Regards,

Peter

On Thu, Jan 22, 2015 at 8:18 PM, Murphy, Terence (NIH/NLM/NCBI) [C]
<mur...@nc...> wrote:
> Hi All,
>
>
>
> I have a couple of questions about C/V/D/J_gene_segment features, which may
> be split over multiple intervals. Here’s an example in GenBank flatfile
> format (from NG_000002.1):
>
>
>
>      gene            9381..9859
>
>                      /gene="IGLV4-69"
>
>                      /gene_synonym="IGLV469; V5-6"
>
>                      /note="immunoglobulin lambda variable 4-69"
>
>                      /db_xref="GeneID:28784"
>
>                      /db_xref="HGNC:HGNC:5921"
>
>                      /db_xref="IMGT/GENE-DB:IGLV4-69"
>
>      CDS             join(9381..9429,9550..9859)
>
>                      /gene="IGLV4-69"
>
>                      /gene_synonym="IGLV469; V5-6"
>
>                      /exception="rearrangement required for product"
>
>                      /codon_start=1
>
>                      /db_xref="GeneID:28784"
>
>                      /db_xref="HGNC:HGNC:5921"
>
>                      /db_xref="IMGT/GENE-DB:IGLV4-69"
>
>      V_segment       join(9381..9429,9550..9859)
>
>                      /gene="IGLV4-69"
>
>                      /gene_synonym="IGLV469; V5-6"
>
>                      /standard_name="IGLV4-69"
>
>
>
> Our GFF3 writer is currently using the same-ID/multiple-rows format for
> these segment features, like this:
>
>
>
> NG_000002.1     RefSeq  gene    9381    9859    .       +       .
> ID=gene2
>
> NG_000002.1     RefSeq  CDS     9381    9429    .       +       0
> ID=cds1;Parent=gene2
>
> NG_000002.1     RefSeq  CDS     9550    9859    .       +       2
> ID=cds1;Parent=gene2
>
> NG_000002.1     RefSeq  V_gene_segment  9381    9429    .       +       .
> ID=id562;Parent=gene2;part=1/2
>
> NG_000002.1     RefSeq  V_gene_segment  9550    9859    .       +       .
> ID=id562;Parent=gene2;part=2/2
>
>
>
> I know some parsers expect IDs to be unique per row except for CDS rows, so
> this could be a problem. We’re looking at switching to using a parent/child
> format using exon features, mimicking mRNAs, like this:
>
>
>
> NG_000002.1     RefSeq  gene    9381    9859    .       +       .
> ID=gene2
>
> NG_000002.1     RefSeq  CDS     9381    9429    .       +       0
> ID=cds1;Parent=gene2
>
> NG_000002.1     RefSeq  CDS     9550    9859    .       +       2
> ID=cds1;Parent=gene2
>
> NG_000002.1     RefSeq  V_gene_segment  9381    9859    .       +       .
> ID=id564;Parent=gene2
>
> NG_000002.1     RefSeq  exon    9381    9429    .       +       .
> ID=id565;Parent=id564
>
> NG_000002.1     RefSeq  exon    9550    9859    .       +       .
> ID=id566;Parent=id564
>
>
>
> I expect this would solve the duplicate ID problem.
>
>
>
> I have two questions:
>
> 1)      From my read of the SO terms it doesn’t look like exon would be a
> valid child for gene segments. Am I mis-reading the SO definitions? Should
> they be changed to allow exon part_of *_gene_segment?
>
> 2)      Similarly, the CDS feature in this case seems like it should be a
> child of the V_gene_segment, and again CDS doesn’t appear to be a valid
> child. Should that be added as well?
>
>
>
>
>
> -Terence
>
>
>
> -----
>
> Terence Murphy, Ph.D.
>
> Staff Scientist
>
> NCBI/NLM/NIH/DHHS
>
>
>
> --
>
> Have you seen NIH’s new genomes FTP site, including >45k assemblies?
>
> http://www.ncbi.nlm.nih.gov/news/08-26-2014-new-genomes-FTP-live/
>
>
>
>
> ------------------------------------------------------------------------------
> New Year. New Location. New Benefits. New Data Center in Ashburn, VA.
> GigeNET is offering a free month of service with a new server in Ashburn.
> Choose from 2 high performing configs, both with 100TB of bandwidth.
> Higher redundancy.Lower latency.Increased capacity.Completely compliant.
> http://p.sf.net/sfu/gigenet
> _______________________________________________
> SOng-devel mailing list
> SOn...@li...
> https://lists.sourceforge.net/lists/listinfo/song-devel
>