|
From: Peter C. <p.j...@go...> - 2015-01-23 13:25:10
|
Hi Terence, I understand a very similar discussion took place at the RDF Summit last year https://github.com/dbcls/rdfsummit/wiki with EBI and DBCLS participants regarding how to best represent "join" locations in RDF/FALDO. https://github.com/JervenBolleman/FALDO http://dx.doi.org/10.1101/002121 In designing FALDO being able to represent all the existing ISNDC records in EMBL/GenBank/DDBJ was a goal, and to that end the compound "join" and "order" locations can be represented using RDF lists and bags (order aware, and order ignoring). However, dealing with simple locations described by just start/end span (and strand) makes querying the RDF much easier. It is my understanding (second hand, not being at the RDF Summit meeting) that the consensus was that best practice would be to make the sub-regions of any compound location (eg each exon in a CDS "join" location) into explicitly named features (each of which then has a simple location, i.e. a simple faldo:Region). The CDS is then described as the combination of these new named parts. I think this is very much the same issue you are describing in GFF3: While the multi-line features have long been part of the specification, they are complex to deal with and some tools reject them. Turning each line of a compound location into an independent line seems to solve this neatly. The only catch is how to make their order explicit, but that is a long standing GFF3 problem - see the Part tag discussion. (In RDF this made simpler because the elements themselves can have added information, including their own sequence. e.g. the exons do not have an order per se, they have an order inside a splice variant.) I have CC'd Jerven Bolleman (FALDO lead) and Toshiaki Katayama (interested in using FALDO for representing the INSDC sequences in RDF) who were at this meeting, and should be able to clarify if I have misunderstood anything. Regards, Peter On Thu, Jan 22, 2015 at 8:18 PM, Murphy, Terence (NIH/NLM/NCBI) [C] <mur...@nc...> wrote: > Hi All, > > > > I have a couple of questions about C/V/D/J_gene_segment features, which may > be split over multiple intervals. Here’s an example in GenBank flatfile > format (from NG_000002.1): > > > > gene 9381..9859 > > /gene="IGLV4-69" > > /gene_synonym="IGLV469; V5-6" > > /note="immunoglobulin lambda variable 4-69" > > /db_xref="GeneID:28784" > > /db_xref="HGNC:HGNC:5921" > > /db_xref="IMGT/GENE-DB:IGLV4-69" > > CDS join(9381..9429,9550..9859) > > /gene="IGLV4-69" > > /gene_synonym="IGLV469; V5-6" > > /exception="rearrangement required for product" > > /codon_start=1 > > /db_xref="GeneID:28784" > > /db_xref="HGNC:HGNC:5921" > > /db_xref="IMGT/GENE-DB:IGLV4-69" > > V_segment join(9381..9429,9550..9859) > > /gene="IGLV4-69" > > /gene_synonym="IGLV469; V5-6" > > /standard_name="IGLV4-69" > > > > Our GFF3 writer is currently using the same-ID/multiple-rows format for > these segment features, like this: > > > > NG_000002.1 RefSeq gene 9381 9859 . + . > ID=gene2 > > NG_000002.1 RefSeq CDS 9381 9429 . + 0 > ID=cds1;Parent=gene2 > > NG_000002.1 RefSeq CDS 9550 9859 . + 2 > ID=cds1;Parent=gene2 > > NG_000002.1 RefSeq V_gene_segment 9381 9429 . + . > ID=id562;Parent=gene2;part=1/2 > > NG_000002.1 RefSeq V_gene_segment 9550 9859 . + . > ID=id562;Parent=gene2;part=2/2 > > > > I know some parsers expect IDs to be unique per row except for CDS rows, so > this could be a problem. We’re looking at switching to using a parent/child > format using exon features, mimicking mRNAs, like this: > > > > NG_000002.1 RefSeq gene 9381 9859 . + . > ID=gene2 > > NG_000002.1 RefSeq CDS 9381 9429 . + 0 > ID=cds1;Parent=gene2 > > NG_000002.1 RefSeq CDS 9550 9859 . + 2 > ID=cds1;Parent=gene2 > > NG_000002.1 RefSeq V_gene_segment 9381 9859 . + . > ID=id564;Parent=gene2 > > NG_000002.1 RefSeq exon 9381 9429 . + . > ID=id565;Parent=id564 > > NG_000002.1 RefSeq exon 9550 9859 . + . > ID=id566;Parent=id564 > > > > I expect this would solve the duplicate ID problem. > > > > I have two questions: > > 1) From my read of the SO terms it doesn’t look like exon would be a > valid child for gene segments. Am I mis-reading the SO definitions? Should > they be changed to allow exon part_of *_gene_segment? > > 2) Similarly, the CDS feature in this case seems like it should be a > child of the V_gene_segment, and again CDS doesn’t appear to be a valid > child. Should that be added as well? > > > > > > -Terence > > > > ----- > > Terence Murphy, Ph.D. > > Staff Scientist > > NCBI/NLM/NIH/DHHS > > > > -- > > Have you seen NIH’s new genomes FTP site, including >45k assemblies? > > http://www.ncbi.nlm.nih.gov/news/08-26-2014-new-genomes-FTP-live/ > > > > > ------------------------------------------------------------------------------ > New Year. New Location. New Benefits. New Data Center in Ashburn, VA. > GigeNET is offering a free month of service with a new server in Ashburn. > Choose from 2 high performing configs, both with 100TB of bandwidth. > Higher redundancy.Lower latency.Increased capacity.Completely compliant. > http://p.sf.net/sfu/gigenet > _______________________________________________ > SOng-devel mailing list > SOn...@li... > https://lists.sourceforge.net/lists/listinfo/song-devel > |