From: Susan W. <ax...@me...> - 2008-11-06 19:40:47
|
Scott, Thanks for the explanation. I've seen similar stuff loading custom gff files into the Bio::DB::SeqFeature::Store mysql schema. I'm going to try the complete Chado flybase genome in postgres dumpfile format. Susan On Nov 6, 2008, at 12:27 PM, Scott Cain wrote: > Hi Susan, > > The sequence-region line is ignored by the loader; since it isn't > typed (ie, the loader doesn't know if it is a contig, a chromosome or > what), it can't be loaded into Chado. > > The problem is the the reference sequence line (the one that has > ID=2L) is not before features that appear on 2L. While this is valid > GFF3, the Chado loader can't deal with it. There is a GFF > preprocessor (gmod_gff3_preprocessor.pl) that will fix problems like > this, although if the only problem with the file is that the ID=2L > line needs to be moved to the top, it would be faster to do that > manually. Other problems that might occur that require the > preprocessor is when child features appear before their parent (like > when an exon appears before the mRNA that it is a part of). > > Scott > > On Thu, Nov 6, 2008 at 2:06 PM, Susan Wilson <ax...@me...> wrote: >> Back again. >> >> Sorry to have to have my hand held through all this, but I think >> there >> is still a problem with the gff file: >> >> $ perl gmod_bulk_load_gff3.pl --recreate_cache --dbname dev_chado_01c >> --dbxref GeneID --organism fromdata --gff /oracle/flybase-dmel_r5.9/ >> dmel-2L-r5.12.gff >> (Re)creating the uniquename cache in the database... >> Creating table... >> Populating table... >> Creating indexes...Done. >> Preparing data for inserting into the dev_chado_01c database >> (This may take a while ...) >> Unable to find srcfeature 2L in the database. >> Perhaps you need to rerun your data load with the '--recreate_cache' >> option. at /oracle/genbank2chado/lib/Bio/GMOD/DB/Adapter.pm line 3887 >> >> Bio >> ::GMOD >> ::DB >> ::Adapter >> ::src_second_chance('Bio::GMOD::DB::Adapter=HASH(0x89b419c)', >> 'Bio::SeqFeature::Annotated=HASH(0x89a2074)') called at >> gmod_bulk_load_gff3.pl line 692 >> Issuing rollback() for database handle being DESTROY'd without >> explicit disconnect(). >> >> $ head dmel-2L-r5.12.gff >> ##gff-version 3 >> ##sequence-region 2L -204333 23011544 >> 2L FlyBase chromosome_band -204333 1326937 . >> + . ID=band-21_chromosome_band;Name=band-21 >> 2L FlyBase chromosome_band -204333 22221 . >> + . ID=band-21A_chromosome_band;Name=band-21A >> 2L FlyBase chromosome_band -204333 -153714 . >> + . ID=band-21A1_chromosome_band;Name=band-21A1 >> 2L FlyBase chromosome_band -153713 -101818 . >> + . ID=band-21A2_chromosome_band;Name=band-21A2 >> 2L FlyBase chromosome_band -101817 -66427 . >> + . ID=band-21A3_chromosome_band;Name=band-21A3 >> 2L FlyBase chromosome_band -66426 -22869 . >> + . ID=band-21A4_chromosome_band;Name=band-21A4 >> 2L FlyBase chromosome_band -22868 22221 . >> + . ID=band-21A5_chromosome_band;Name=band-21A5 >> 2L FlyBase chromosome_arm 1 >> 23011544 . . . ID=2L;Dbxref=GB:AE014134 >> >> I tried single # on the sequence-region line. Tried deleting the >> seqence-region line. Same difference.... >> >> Susan >> >> On Nov 6, 2008, at 11:37 AM, Scott Cain wrote: >> >>> Hi Susan, >>> >>> There are two problems: most immediately, the bulk load script >>> doesn't >>> uncompress files for you, so you'll need to ungzip the file: >>> >>> gzip -d dmel-all-r5.12.gff.gz >>> >>> Second, the bulk loader doesn't deal well with really huge files >>> (like >>> a whole fly genome), so it would be best to use the individual arm >>> files and load them separately. >>> >>> Scott >>> >>> >>> On Thu, Nov 6, 2008 at 1:32 PM, axiom7 <ax...@me...> wrote: >>>> >>>> Hi again, >>>> >>>> I downloaded dmel-all-r5.12.gff.gz from flybase, but now I have the >>>> following problem: >>>> >>>> perl gmod_bulk_load_gff3.pl --dbname dev_chado_01c --dbxref GeneID >>>> --organism fromdata --gff /oracle/flybase-dmel_r5.9/dmel-all- >>>> r5.12.gff.gz >>>> Preparing data for inserting into the dev_chado_01c database >>>> (This may take a while ...) >>>> Use of uninitialized value in pattern match (m//) at >>>> gmod_bulk_load_gff3.pl >>>> line 661, <GEN0> line 1. >>>> Use of uninitialized value in pattern match (m//) at >>>> gmod_bulk_load_gff3.pl >>>> line 679, <GEN0> line 1. >>>> no cvterm for at /oracle/genbank2chado/lib/Bio/GMOD/DB/Adapter.pm >>>> line >>>> 3911, <GEN0> line 1. >>>> Issuing rollback() for database handle being DESTROY'd without >>>> explicit >>>> disconnect(). >>>> >>>> My GMOD_ROOT is created by following instructions at >>>> http://gmod.org/wiki/Chado: >>>> >>>> cvs -d:pserver:ano...@gm...:/cvsroot/gmod >>>> login >>>> >>>> Enter blank password. Then do: >>>> >>>> cvs -d:pserver:ano...@gm...:/cvsroot/gmod co >>>> schema >>>> >>>> and then doing a make;make install >>>> >>>> Susan >>>> >>>> axiom7 wrote: >>>>> >>>>> Hi, >>>>> >>>>> I have filed the anomaly in the gmod project as you suggested. I >>>>> didn't >>>>> use the flybase data source, as I was following the directions >>>>> from gmod >>>>> for the genbank2chado package. I will try the other source(s) you >>>>> suggested and get back to you. >>>>> >>>>> Thanks Scott. >>>>> Susan >>>>> >>>>> >>>>> Scott Cain-3 wrote: >>>>>> >>>>>> Hi Susan, >>>>>> >>>>>> I can certainly see what is wrong; the fix is another matter: >>>>>> GFF3 >>>>>> lines are only allowed to have a single ID, but the mRNA line you >>>>>> pointed to has two: CG17683.t01 and CG17683.t06. Why this >>>>>> happened is >>>>>> not clear to me; I would have to assume a bug in >>>>>> bp_genebank2gff3.pl. >>>>>> If you could file this as a bug in the gmod project (as part of >>>>>> Chado), I should be able to look at it in the next few days: >>>>>> >>>>>> https://sourceforge.net/tracker2/?group_id=27707&atid=391291 >>>>>> >>>>>> On another track, why aren't you using the Dmel GFF3 from >>>>>> flybase: >>>>>> >>>>>> >>>>>> ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r5.9_FB2008_06/gff/ >>>>>> >>>>>> (Full disclosure: I haven't tried to load the flybase GFF into a >>>>>> Chado >>>>>> instance recently, so I can't comment on whether it will really >>>>>> work >>>>>> on not--but it has a much better chance). Or, using the flybase >>>>>> database dump of Chado: >>>>>> >>>>>> ftp://ftp.flybase.net/releases/current/psql/ >>>>>> >>>>>> Scott >>>>>> >>>>>> >>>>>> On Thu, Nov 6, 2008 at 11:07 AM, axiom7 <ax...@me...> wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I have downloaded the Drosophila melanogaster *.gbk.gz files >>>>>>> from >>>>>>> bio-mirror.net/biomirror/ncbigenomes/Drosophila_melanogaster and >>>>>>> run >>>>>>> bp_genebank2gff3.pl on them to create the *.gbk.gz.gff files. >>>>>>> However, >>>>>>> the >>>>>>> load fails immediately: >>>>>>> >>>>>>> perl bin/gmod_bulk_load_gff3.pl --dbname dev_chado_01c -dbxref >>>>>>> GeneID >>>>>>> --organism fromdata --gff >>>>>>> data/Drosophila_melanogaster/CHR_2/NT_033778.gbk.gz.gff >>>>>>> (Re)creating the uniquename cache in the database... >>>>>>> Creating table... >>>>>>> Populating table... >>>>>>> Creating indexes...Done. >>>>>>> Preparing data for inserting into the dev_chado_01c database >>>>>>> (This may take a while ...) >>>>>>> Organism Drosophila melanogaster from data >>>>>>> >>>>>>> ------------- EXCEPTION: Bio::Root::Exception ------------- >>>>>>> MSG: Error in line: >>>>>>> NT_033778 GenBank mRNA 18442 18629 . >>>>>>> + . >>>>>>> ID >>>>>>> = >>>>>>> CG17683 >>>>>>> .t01 >>>>>>> ,CG17683 >>>>>>> .t06 >>>>>>> ;Parent >>>>>>> = >>>>>>> CG17683 >>>>>>> ,CG17683;locus_tag=Dmel_CG17683;gene=CG17683;product=CG17683-RA >>>>>>> %2C >>>>>>> transcript variant >>>>>>> A;Dbxref=GI:116007463,FLYBASE:FBgn0040002,GeneID: >>>>>>> 3355011;transcript_id=NM_001042963.1 >>>>>>> >>>>>>> A feature may have at most one ID value >>>>>>> STACK: Error::throw >>>>>>> STACK: Bio::Root::Root::throw >>>>>>> /oracle/genbank2chado/lib/Bio/Root/Root.pm:359 >>>>>>> STACK: Bio::FeatureIO::gff::_handle_feature >>>>>>> /oracle/genbank2chado/lib/Bio/FeatureIO/gff.pm:696 >>>>>>> STACK: Bio::FeatureIO::gff::next_feature >>>>>>> /oracle/genbank2chado/lib/Bio/FeatureIO/gff.pm:165 >>>>>>> STACK: bin/gmod_bulk_load_gff3.pl:819 >>>>>>> ----------------------------------------------------------- >>>>>>> Issuing rollback() for database handle being DESTROY'd without >>>>>>> explicit >>>>>>> disconnect(). >>>>>>> >>>>>>> The "head" command on the file is as follows, which shows the >>>>>>> script >>>>>>> failing >>>>>>> on the first mRNA line: >>>>>>> >>>>>>> head data/Drosophila_melanogaster/CHR_2/NT_033778.gbk.gz.gff >>>>>>> ##gff-version 3 >>>>>>> # sequence-region NT_033778 1 21146708 >>>>>>> # conversion-by bp_genbank2gff3.pl >>>>>>> # organism Drosophila melanogaster >>>>>>> # date 14-MAY-2008 >>>>>>> # Note Drosophila melanogaster chromosome 2R. >>>>>>> NT_033778 GenBank chromosome 1 >>>>>>> 21146708 . >>>>>>> + >>>>>>> . ID=NT_033778;mol_type=genomic >>>>>>> DNA;date=14-MAY-2008;comment1=REVIEWED >>>>>>> REFSEQ: This record has been curated by FlyBase. The reference >>>>>>> sequence >>>>>>> was >>>>>>> derived from AE013599. On Oct 10%2C 2006 this sequence version >>>>>>> replaced >>>>>>> gi:56407907. COMPLETENESS: full length. ;Note=Drosophila >>>>>>> melanogaster >>>>>>> chromosome >>>>>>> 2R.;Alias=2R;chromosome=2R;Dbxref=taxon:7227;organism=Drosophila >>>>>>> melanogaster >>>>>>> NT_033778 GenBank region 1 1285689 . >>>>>>> + . >>>>>>> ID=GenBank:region:NT_033778:1:1285689;Note=Heterochromatic >>>>>>> sequence >>>>>>> NT_033778 GenBank gene 18442 20468 . >>>>>>> + . >>>>>>> ID=CG17683;locus_tag=Dmel_CG17683;gene=CG17683;Note=CG17683%3B >>>>>>> Annotated >>>>>>> by >>>>>>> Drosophila Heterochromatin Genome Project%2C Lawrence Berkeley >>>>>>> National >>>>>>> Lab%2C http://www.dhgp.org;Dbxref=FLYBASE:FBgn0040002,GeneID: >>>>>>> 3355011 >>>>>>> NT_033778 GenBank mRNA 18442 18629 . >>>>>>> + . >>>>>>> ID >>>>>>> = >>>>>>> CG17683 >>>>>>> .t01 >>>>>>> ,CG17683 >>>>>>> .t06 >>>>>>> ;Parent >>>>>>> = >>>>>>> CG17683 >>>>>>> ,CG17683;locus_tag=Dmel_CG17683;gene=CG17683;product=CG17683-RA >>>>>>> %2C >>>>>>> transcript variant >>>>>>> A;Dbxref=GI:116007463,FLYBASE:FBgn0040002,GeneID: >>>>>>> 3355011;transcript_id=NM_001042963.1 >>>>>>> >>>>>>> I obtained the scripts from >>>>>>> rsync://eugenes.org/argos/gmod/web/gmod/genbank2chado: >>>>>>> >>>>>>> head bin/bp_genbank2gff3.pl >>>>>>> #!/usr/bin/perl -w >>>>>>> >>>>>>> #$Id: genbank2gff3.PLS,v 1.11 2007/03/19 16:42:05 bosborne Exp >>>>>>> $; >>>>>>> >>>>>>> >>>>>>> head bin/gmod_bulk_load_gff3.pl >>>>>>> #!/usr/bin/perl >>>>>>> >>>>>>> >>>>>>> =item dgg notes, 2007 march >>>>>>> >>>>>>> Can anybody see what is wrong with this? >>>>>>> >>>>>>> Thanks. >>>>>>> Susan >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> View this message in context: >>>>>>> http://www.nabble.com/gmod_bulk_load_gff3-of-Drosophila-melanogaster-fails-tp20364068p20364068.html >>>>>>> Sent from the gmod-devel mailing list archive at Nabble.com. >>>>>>> >>>>>>> >>>>>>> ------------------------------------------------------------------------- >>>>>>> This SF.Net email is sponsored by the Moblin Your Move >>>>>>> Developer's >>>>>>> challenge >>>>>>> Build the coolest Linux based applications with Moblin SDK & win >>>>>>> great >>>>>>> prizes >>>>>>> Grand prize is a trip for two to an Open Source event anywhere >>>>>>> in the >>>>>>> world >>>>>>> http://moblin-contest.org/redirect.php?banner_id=100&url=/ >>>>>>> _______________________________________________ >>>>>>> Gmod-devel mailing list >>>>>>> Gmo...@li... >>>>>>> https://lists.sourceforge.net/lists/listinfo/gmod-devel >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> ------------------------------------------------------------------------ >>>>>> Scott Cain, Ph. D. scott at >>>>>> scottcain >>>>>> dot net >>>>>> GMOD Coordinator (http://gmod.org/) >>>>>> 216-392-3087 >>>>>> Ontario Institute for Cancer Research >>>>>> >>>>>> ------------------------------------------------------------------------- >>>>>> This SF.Net email is sponsored by the Moblin Your Move >>>>>> Developer's >>>>>> challenge >>>>>> Build the coolest Linux based applications with Moblin SDK & win >>>>>> great >>>>>> prizes >>>>>> Grand prize is a trip for two to an Open Source event anywhere in >>>>>> the >>>>>> world >>>>>> http://moblin-contest.org/redirect.php?banner_id=100&url=/ >>>>>> _______________________________________________ >>>>>> Gmod-devel mailing list >>>>>> Gmo...@li... >>>>>> https://lists.sourceforge.net/lists/listinfo/gmod-devel >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> -- >>>> View this message in context: http://www.nabble.com/gmod_bulk_load_gff3-of-Drosophila-melanogaster-fails-tp20364068p20367047.html >>>> Sent from the gmod-devel mailing list archive at Nabble.com. >>>> >>>> >>>> ------------------------------------------------------------------------- >>>> This SF.Net email is sponsored by the Moblin Your Move Developer's >>>> challenge >>>> Build the coolest Linux based applications with Moblin SDK & win >>>> great prizes >>>> Grand prize is a trip for two to an Open Source event anywhere in >>>> the world >>>> http://moblin-contest.org/redirect.php?banner_id=100&url=/ >>>> _______________________________________________ >>>> Gmod-devel mailing list >>>> Gmo...@li... >>>> https://lists.sourceforge.net/lists/listinfo/gmod-devel >>>> >>> >>> >>> >>> -- >>> ------------------------------------------------------------------------ >>> Scott Cain, Ph. D. scott at >>> scottcain dot net >>> GMOD Coordinator (http://gmod.org/) 216-392-3087 >>> Ontario Institute for Cancer Research >> >> >> ------------------------------------------------------------------------- >> This SF.Net email is sponsored by the Moblin Your Move Developer's >> challenge >> Build the coolest Linux based applications with Moblin SDK & win >> great prizes >> Grand prize is a trip for two to an Open Source event anywhere in >> the world >> http://moblin-contest.org/redirect.php?banner_id=100&url=/ >> _______________________________________________ >> Gmod-devel mailing list >> Gmo...@li... >> https://lists.sourceforge.net/lists/listinfo/gmod-devel >> > > > > -- > ------------------------------------------------------------------------ > Scott Cain, Ph. D. scott at > scottcain dot net > GMOD Coordinator (http://gmod.org/) 216-392-3087 > Ontario Institute for Cancer Research |