From: Jayaraman, P. <pja...@mc...> - 2010-10-15 00:33:47
|
This is from build 36.2 and build 36.3 It does detect that it is a dicistronic gene, but then it messes up the entire gff3 file by replacing all "exon" tags to "mRNA" tags, with multiple IDs(sometimes more than one of these features has the same set of Ids)like this: NT_010799 GenBank gene 344463 344957 . + . ID=LOC654170;Dbxref=GeneID:654170;Note=Derived by automated computational analysis using gene prediction method: GNOMON. Supporting evidence includes similarity to: 2 Proteins;gene=LOC654170 NT_010799 GenBank mRNA 344463 344957 . + . ID=LOC654170.t01,LOC654170.t02;Parent=LOC654170,LOC654170;Dbxref=GI:1692 11091,GeneID:654170;Note=Derived by automated computational analysis using gene prediction method: GNOMON. Supporting evidence includes similarity to: 2 Proteins;gene=LOC654170;product=similar to hCG1643342;transcript_id=XM_001723743.1 NT_010799 GenBank mRNA 344463 344957 . + . ID=LOC654170.t01,LOC654170.t02;Parent=LOC654170,LOC654170;Dbxref=GI:1692 11091,GeneID:654170;Note=Derived by automated computational analysis using gene prediction method: GNOMON. Supporting evidence includes similarity to: 2 Proteins;gene=LOC654170;product=similar to hCG1643342;transcript_id=XM_001723743.1 NT_010799 GenBank CDS 344463 344957 . + . ID=LOC654170.p01,LOC654170.p02,LOC654170.p02;Parent=LOC654170.t01,LOC654 170.t02,LOC654170.t02;Dbxref=GI:169211092,GeneID:654170;codon_start=1;ge ne=LOC654170;product=similar to hCG1643342;protein_id=XP_001723795.1 NT_010799 GenBank CDS 344463 344957 . + . ID=LOC654170.p01,LOC654170.p02,LOC654170.p02;Parent=LOC654170.t01,LOC654 170.t02,LOC654170.t02;Dbxref=GI:169211092,GeneID:654170;codon_start=1;ge ne=LOC654170;product=similar to hCG1643342;protein_id=XP_001723795.1 NT_010799 GenBank CDS 344463 344957 . + . ID=LOC654170.p01,LOC654170.p02,LOC654170.p02;Parent=LOC654170.t01,LOC654 170.t02,LOC654170.t02;Dbxref=GI:169211092,GeneID:654170;codon_start=1;ge ne=LOC654170;product=similar to hCG1643342;protein_id=XP_001723795.1 Is this the expected behavior? Pushkala Jayaraman Programmer/Analyst Rat Genome Database Human and Molecular Genetics Center Medical College of Wisconsin Email: pja...@mc... Work: 414-955-2229 www.rgd.mcw.edu -----Original Message----- From: Chris Mungall [mailto:CJM...@lb...] Sent: Thursday, October 14, 2010 5:45 PM To: Jayaraman, Pushkala Cc: Don Gilbert; Scott Cain; gmod-devel list Subject: Re: [GMOD-devel] bp_genbank2gff3- Unflattening error - solution/quickfix [removed gbrowse from cc] this is a dicistronic gene. ideally the unflattener would detect this and make the cassette gene extend over. this is from hs17? Which version? When I look here http://www.ncbi.nlm.nih.gov/nuccore/NT_010799.15?from=9047686&to=9066094 &report=genbank&strand=true the genes all contain the mRNA - however, there are other oddities such as CCL14 and CCL15 being co-located. On Oct 14, 2010, at 3:27 PM, Jayaraman, Pushkala wrote: > Hello, > I think I have found a quickfix to the problem without changing the > code.. > I'm guessing Don already had addressed this in an earlier post. > > > The Unflattener.pm module reports error where it finds strange data or > finds tags it is not yet taught to parse correctly.. > I noticed that when it reports errors for a certain .gbk file, it also > ends up messing up the entire file format.. i.e when it finds a > dicistronic gene, i.e a gene with a read through mRNA that spans more > than the gene, it reports an error with a SEVERITY value. The gff file > that it creates will have ID=XXX.t01,XXX.t02;Parent=XXX,XXX; etc.. for > an mRNA feature. And it also seems to list out all exon features as > mRNA > and give them all the same IDs. Even the CDS seem to get more than 2 > IDs > out of which the 2nd and third ID is repeated.. > > There is an option in the bp_genbank2gff3.pl script that allows > users to > set the error_threshold. If you set the error_threshold relatively > high > i.e >2 then it ensures that the Unflatterner.pm doesn't report any > errors and reports the converted gbk to gff3 file as is. > > This seems to be a more common case in the human .gbk files. So a > quick-fix is to set the option -e 3 so that the gff3 files can be > correctly parsed... > > Bp_genbank2gff3.pl -e 3 ***.gbk > Just wanted to let you guys know... didn't want anyone else to break > their head over this and wonder why their gff3 files are turning out > all > weird.. > I wasn't able to post this on the BioPerl forum as my mail is still > awaiting moderator approval.. > > Thanks, > Pushkala Jayaraman > Programmer/Analyst > Rat Genome Database > Human and Molecular Genetics Center > Medical College of Wisconsin > Email: pja...@mc... > Work: 414-955-2229 > www.rgd.mcw.edu > > > -----Original Message----- > From: Don Gilbert [mailto:gil...@cr...] > Sent: Thursday, October 07, 2010 4:09 PM > To: Jayaraman, Pushkala; sc...@sc... > Cc: gmo...@li...; gmo...@li... > Subject: Re: [Gmod-gbrowse] [GMOD-devel] FW: bp_genbank2gff3- > Unflattening error > > Pushkala, > > Scott seems right, your Genbank entry may be biologically correct, but > it is > computationally in error because it says one mRNA extends beyond its > enclosing gene boundaries: > > mRNA: complement(join(9047672...9065992)) /gene="CCL14" > gene: complement(9047672..9050719) /gene="CCL14" > ^^^^^^^ shorter than mRNA > > versus this gene span that encloses above mRNA: > gene: complement(9047672..9065992) /gene="CCL14-CCL15" > > - Don > -- d.gilbert--bioinformatics--indiana-u--bloomington-in-47405 > -- gil...@in...--http://marmot.bio.indiana.edu/ > > ------------------------------------------------------------------------ ------ > Download new Adobe(R) Flash(R) Builder(TM) 4 > The new Adobe(R) Flex(R) 4 and Flash(R) Builder(TM) 4 (formerly > Flex(R) Builder(TM)) enable the development of rich applications > that run > across multiple browsers and platforms. Download your free trials > today! > http://p.sf.net/sfu/adobe-dev2dev > _______________________________________________ > Gmod-devel mailing list > Gmo...@li... > https://lists.sourceforge.net/lists/listinfo/gmod-devel |