|
From: Cook, M. <ME...@St...> - 2007-04-11 15:02:44
|
Lincoln and Jannick and fellow GFF/SO wanderers, Just recently, I've been going through this exact set of issues myself as applied to correctly inferring SO compliant features for splice_donor_site and splice_acceptor_site given a gene model. I hope the following simplified example is useful to understanding the issue, and I would appreciate your comments as to whether you agree how I interpret the GFF3 and SO specifications in this regard. EXAMPLE =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D Given this simplified gene model containing two exon each being 3bp long: 123456789 EEEIIIEEE >>>--->>> and given these SO definitions: =09 splice_donor_site: The junction between the 3 prime end of an exon and the following intron. <http://www.broad.mit.edu/annotation/genome/ontology/Sequence_Ontology/T erm.html?sp=3DSSO%3A0000163> splice_acceptor_site: The junction between the 3 prime end of an intron and the following exon. <http://www.broad.mit.edu/annotation/genome/ontology/Sequence_Ontology/T erm.html?sp=3DSSO%3A0000164> =09 ...we should encode the gene as:=20 exon(1,3,+)=20 splice_donor_site(3,3,+) intron(4,6,+)=20 splice_acceptor_site(6,6,+) exon(7,9,+) HOWEVER, if the gene codes the other way, viz. 123456789 EEEIIIEEE <<<---<<< ...we should encode it as:=20 exon(7,9,-) splice_donor_site(6,6,-) intron(4,6,-)=20 splice_acceptor_site(3,3,-) exon(1,3,-)=20 Note that the coordinates of the exon and intron are the same in both encodings, only the strand is different; AND, the coordinates of the splice sites are also the same between encodings, due to understanding "to the right of the indicated base in the direction of the landmark." as "1 plus the indicated base, in interbase coordinates" It is this understanding that I am trying to clarify by this example, and would in particular appreciate confirmation that the splice sites should NOT be encoded in the second model as: splice_donor_site(7,7,-) splice_acceptor_site(4,4,+) Thanks, Malcolm Cook Stowers Institute for Medical Research - Kansas City, Missouri ________________________________ From: son...@li... [mailto:son...@li...] On Behalf Of Lincoln Stein Sent: Wednesday, April 11, 2007 8:46 AM To: Jannick D. Bendtsen Cc: SO developers; ls...@cs... Subject: Re: [SO-devel] GFF 3 =09 =09 Hi, =09 Zero-length and one-length features are both represented using coordinates start=3D=3Dend. One uses the ontology to determine whether = this is a zero length feature or a one-length feature. A zero-length feature will inherit from the "junction" parent class, while a one-length feature will inherit from the "region" parent class.=20 =09 I would much rather use interbase coordinates (in which the numbers refer to the positions between bases), but legacy requires GFF3 to use base coordinates. =09 Lincoln =09 =09 On 4/11/07, Jannick D. Bendtsen <jbe...@cl...> wrote:=20 Dear Lincoln, =09 I'm trying to parse a GFF file and reading http://www.sequenceontology.org/gff3.shtml left me with a few questions which I hope you will take the time to answer.=20 =09 -- snip -- Columns 4 & 5: "start" and "end" The start and end of the feature, in 1-based integer coordinates, relative to the landmark given in column 1. Start is always less than or equal to end. For zero-length features,=20 such as insertion sites, start equals end and the implied site is to the right of the indicated base in the direction of the landmark. -- snip -- =09 From this it is clear that insertion sites, cleavage sites etc. can be=20 mapped onto a sequence simply by this =09 ctg123 . gene 1000 1000 . + . ID=3Dgene00001;Name=3DEDEN =09 But what is the region syntax for just covering one position? =09 1000 1001 will cover two positions?? =09 Thanks for your help. =09 Best wishes Jannick =09 -- ______________________________ =09 Jannick D. Bendtsen Senior scientific officer =09 CLC bio Gustav Wieds Vej 10 8000 Aarhus C Denmark =09 www.clcbio.com =09 jbe...@cl... =09 Contact numbers: Telephone: +45 70 22 32 44 Fax: +45 70 22 55 19 Mobile: +45 51 20 96 94=20 =09 CLC bio A/S Disclaimer: =09 ------------------------------------------------------------------- Any information contained in this e-mail and/or any attachments is confidential, and only intended for reception and use by the=20 specified person. If you are not the intended recipient, please return the email to the sender and delete it afterwards. In this case any copying, forwarding, printing, disclosure and use is strictly prohibited.=20 =09 ------------------------------------------------------------------- =09 =09 --=20 Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724=20 (516) 367-8380 (voice) (516) 367-8389 (fax) FOR URGENT MESSAGES & SCHEDULING,=20 PLEASE CONTACT MY ASSISTANT,=20 SANDRA MICHELSEN, AT mic...@cs...=20 |