|
From: Murphy, T. (NIH/NLM/N. [C] <mur...@nc...> - 2015-01-28 01:30:19
|
Hi Karen, > We could reverse the relation to be has_part if every X_gene_segment has_part exon I think that's roughly what I had in mind, although I'm hesitant to start strictly requiring an exon part for every X_gene_segment. I haven't spent much time working through the SO tree. Is that the relationship of exon to mRNA? That's not clear to me from this page (likely due to my ignorance): http://www.sequenceontology.org/browser/current_svn/term/SO:0000234 -Terence ________________________________ From: Karen Eilbeck [kei...@ge...] Sent: Thursday, January 22, 2015 3:41 PM To: SO developers Subject: Re: [SO-devel] C/V/D/J_gene_segment ontology Hi Terrance Great suggestions and questions. Off the top of my head, without diving into the ontology, exon part_of x_gene_segment? would read as every exon part of some X_gene_segment, which is not true. We could reverse the relation to be has_part if every X_gene_segment has_part exon We could also subtype exon, X_exon part_of X_gene_segment I need to dive into this some more. I will get back to you. --Karen On Jan 22, 2015, at 1:18 PM, Murphy, Terence (NIH/NLM/NCBI) [C] wrote: Hi All, I have a couple of questions about C/V/D/J_gene_segment features, which may be split over multiple intervals. Here’s an example in GenBank flatfile format (from NG_000002.1): gene<http://www.ncbi.nlm.nih.gov/nuccore/18860922?from=9381&to=9859&sat=4&sat_key=133151457> 9381..9859 /gene="IGLV4-69" /gene_synonym="IGLV469; V5-6" /note="immunoglobulin lambda variable 4-69" /db_xref="GeneID:28784<http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=Retrieve&dopt=full_report&list_uids=28784>" /db_xref="HGNC:HGNC:5921<http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=HGNC:5921>" /db_xref="IMGT/GENE-DB:IGLV4-69<http://www.imgt.org/IMGT_GENE-DB/GENElect?species=Homo+sapiens&query=2+IGLV4-69>" CDS<http://www.ncbi.nlm.nih.gov/nuccore/18860922?itemid=6&sat=4&sat_key=133151457> join(9381..9429,9550..9859) /gene="IGLV4-69" /gene_synonym="IGLV469; V5-6" /exception="rearrangement required for product" /codon_start=1 /db_xref="GeneID:28784<http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=Retrieve&dopt=full_report&list_uids=28784>" /db_xref="HGNC:HGNC:5921<http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=HGNC:5921>" /db_xref="IMGT/GENE-DB:IGLV4-69<http://www.imgt.org/IMGT_GENE-DB/GENElect?species=Homo+sapiens&query=2+IGLV4-69>" V_segment<http://www.ncbi.nlm.nih.gov/nuccore/18860922?itemid=7&sat=4&sat_key=133151457> join(9381..9429,9550..9859) /gene="IGLV4-69" /gene_synonym="IGLV469; V5-6" /standard_name="IGLV4-69" Our GFF3 writer is currently using the same-ID/multiple-rows format for these segment features, like this: NG_000002.1 RefSeq gene 9381 9859 . + . ID=gene2 NG_000002.1 RefSeq CDS 9381 9429 . + 0 ID=cds1;Parent=gene2 NG_000002.1 RefSeq CDS 9550 9859 . + 2 ID=cds1;Parent=gene2 NG_000002.1 RefSeq V_gene_segment 9381 9429 . + . ID=id562;Parent=gene2;part=1/2 NG_000002.1 RefSeq V_gene_segment 9550 9859 . + . ID=id562;Parent=gene2;part=2/2 I know some parsers expect IDs to be unique per row except for CDS rows, so this could be a problem. We’re looking at switching to using a parent/child format using exon features, mimicking mRNAs, like this: NG_000002.1 RefSeq gene 9381 9859 . + . ID=gene2 NG_000002.1 RefSeq CDS 9381 9429 . + 0 ID=cds1;Parent=gene2 NG_000002.1 RefSeq CDS 9550 9859 . + 2 ID=cds1;Parent=gene2 NG_000002.1 RefSeq V_gene_segment 9381 9859 . + . ID=id564;Parent=gene2 NG_000002.1 RefSeq exon 9381 9429 . + . ID=id565;Parent=id564 NG_000002.1 RefSeq exon 9550 9859 . + . ID=id566;Parent=id564 I expect this would solve the duplicate ID problem. I have two questions: 1) From my read of the SO terms it doesn’t look like exon would be a valid child for gene segments. Am I mis-reading the SO definitions? Should they be changed to allow exon part_of *_gene_segment? 2) Similarly, the CDS feature in this case seems like it should be a child of the V_gene_segment, and again CDS doesn’t appear to be a valid child. Should that be added as well? -Terence ----- Terence Murphy, Ph.D. Staff Scientist NCBI/NLM/NIH/DHHS -- Have you seen NIH’s new genomes FTP site, including >45k assemblies? http://www.ncbi.nlm.nih.gov/news/08-26-2014-new-genomes-FTP-live/ ------------------------------------------------------------------------------ New Year. New Location. New Benefits. New Data Center in Ashburn, VA. GigeNET is offering a free month of service with a new server in Ashburn. Choose from 2 high performing configs, both with 100TB of bandwidth. Higher redundancy.Lower latency.Increased capacity.Completely compliant. http://p.sf.net/sfu/gigenet_______________________________________________ SOng-devel mailing list SOn...@li...<mailto:SOn...@li...> https://lists.sourceforge.net/lists/listinfo/song-devel Karen Eilbeck Associate Professor Department of Biomedical Informatics, University of Utah |