From: Giles V. <gv...@sa...> - 2009-10-14 15:47:34
|
Hi, I am looking to possibly speed up the generation of GFFs for consumption by the JBrowse prepare scripts, and wanted to check with you first to see what attributes (tags in the last column) are absolutely necessary for JBbrowse. I am guessing : - ID - Name - Parent - Derives_from would that be correct? Regards, Giles -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. |
From: Giles V. <gv...@sa...> - 2009-10-14 23:18:35
|
Further to this question I decided to investigate. Adapting the SQL code from one of the Chado views (gff3view), to generate a custom mini- GFF exporter without any annotations. A typical set of lines looks like this: Lbr.chr1 chado gene 1272 4166 . - . ID=3775871;Name=LbrM01_V2.0010 Lbr.chr1 chado exon 1272 4166 . - . ID=3775872;Name=LbrM01_V2.0010:exon:1;Parent=LbrM01_V2.0010:mRNA Lbr.chr1 chado mRNA 1272 4166 . - . ID=3775873;Name=LbrM01_V2.0010:mRNA;Parent=LbrM01_V2.0010 and JBrowse appears to handle this fine for my purposes. I haven't done any benchmarks, but it's significantly faster than exporting the complete GFF. In terms of size alone I am it's gone from 60M to 36M for the same dataset. If anyone would like to know more on this topic please feel free to get in touch. Regards, Giles On 14 Oct 2009, at 16:48, Giles Velarde wrote: > Hi, > > I am looking to possibly speed up the generation of GFFs for > consumption by the JBrowse prepare scripts, and wanted to check with > you first to see what attributes (tags in the last column) are > absolutely necessary for JBbrowse. I am guessing : > > - ID > - Name > - Parent > - Derives_from > > would that be correct? > > Regards, > Giles -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. |
From: Mitch S. <mit...@be...> - 2009-10-15 01:28:23
|
This is cool, sorry for the slow reply on my part. I wasn't sure about "Derives_from" that you mentioned in your first email. I'm not aware of any reason jbrowse would use that, but I don't know the Bio::DB::SeqFeature::Store code in detail, which would be the place to look to answer this question definitively. Also, if the user is using the "extraData" jbrowse option, then whatever is being extracted from the perl objects that way would be relevant, although I suppose the user already knows if that's the case. Out of curiosity, are you generating GFF from a chado database, or is it another schema? Regards, Mitch On 10/14/2009 04:19 PM, Giles Velarde wrote: > Further to this question I decided to investigate. Adapting the SQL > code from one of the Chado views (gff3view), to generate a custom > mini-GFF exporter without any annotations. A typical set of lines > looks like this: > > Lbr.chr1 chado gene 1272 4166 . - . ID=3775871;Name=LbrM01_V2.0010 > Lbr.chr1 chado exon 1272 4166 . - . > ID=3775872;Name=LbrM01_V2.0010:exon:1;Parent=LbrM01_V2.0010:mRNA > Lbr.chr1 chado mRNA 1272 4166 . - . > ID=3775873;Name=LbrM01_V2.0010:mRNA;Parent=LbrM01_V2.0010 > > and JBrowse appears to handle this fine for my purposes. I haven't > done any benchmarks, but it's significantly faster than exporting the > complete GFF. In terms of size alone I am it's gone from 60M to 36M > for the same dataset. > > If anyone would like to know more on this topic please feel free to > get in touch. > > Regards, > Giles > > > > > On 14 Oct 2009, at 16:48, Giles Velarde wrote: > >> Hi, >> >> I am looking to possibly speed up the generation of GFFs for >> consumption by the JBrowse prepare scripts, and wanted to check with >> you first to see what attributes (tags in the last column) are >> absolutely necessary for JBbrowse. I am guessing : >> >> - ID >> - Name >> - Parent >> - Derives_from >> >> would that be correct? >> >> Regards, >> Giles > > > -- The Wellcome Trust Sanger Institute is operated by Genome Research > Limited, a charity registered in England with number 1021457 and a > compa ny registered in England with number 2742969, whose registered > office is 2 15 Euston Road, London, NW1 2BE. > > > ------------------------------------------------------------------------------ > Come build with us! The BlackBerry(R) Developer Conference in SF, CA > is the only developer event you need to attend this year. Jumpstart your > developing skills, take BlackBerry mobile applications to market and stay > ahead of the curve. Join us from November 9 - 12, 2009. Register now! > http://p.sf.net/sfu/devconference > > > _______________________________________________ > Gmod-ajax mailing list > Gmo...@li... > https://lists.sourceforge.net/lists/listinfo/gmod-ajax > |
From: Giles V. <gv...@sa...> - 2009-10-15 10:52:13
|
On 15 Oct 2009, at 02:28, Mitch Skinner wrote: > This is cool, sorry for the slow reply on my part. I wasn't sure > about "Derives_from" that you mentioned in your first email. I'm > not aware of any reason jbrowse would use that, but I don't know the > Bio::DB::SeqFeature::Store code in detail, which would be the place > to look to answer this question definitively. > I completely forgot about Derives_from when I was working on it, and have stuck relevant upward relationships (derives_from and part_of) into the Parent qualifier for the time being. > Also, if the user is using the "extraData" jbrowse option, then > whatever is being extracted from the perl objects that way would be > relevant, although I suppose the user already knows if that's the > case. > I do use that, but only pulling the Name, so that's fine for me. > Out of curiosity, are you generating GFF from a chado database, or > is it another schema? > Yes I am using this to pull data out of GeneDB here at Sanger, which is a Chado database. I have put it up here : http://github.com/gv1/chado2miniGFF It's mostly SQLs, inside light Python wrapper. We usually use Artemis to bulk-export GFF, but in this case we don't need all the annotations for the time being, and even if we do use some in the future ( for use in extraData) the approach would be to selectively put them in. Regards, Giles > Regards, > Mitch > > On 10/14/2009 04:19 PM, Giles Velarde wrote: >> >> Further to this question I decided to investigate. Adapting the SQL >> code from one of the Chado views (gff3view), to generate a custom >> mini-GFF exporter without any annotations. A typical set of lines >> looks like this: >> >> Lbr.chr1 chado gene 1272 4166 . - . ID=3775871;Name=LbrM01_V2.0010 >> Lbr.chr1 chado exon 1272 4166 . - . >> ID=3775872;Name=LbrM01_V2.0010:exon:1;Parent=LbrM01_V2.0010:mRNA >> Lbr.chr1 chado mRNA 1272 4166 . - . >> ID=3775873;Name=LbrM01_V2.0010:mRNA;Parent=LbrM01_V2.0010 >> >> and JBrowse appears to handle this fine for my purposes. I haven't >> done any benchmarks, but it's significantly faster than exporting >> the complete GFF. In terms of size alone I am it's gone from 60M to >> 36M for the same dataset. >> >> If anyone would like to know more on this topic please feel free to >> get in touch. >> >> Regards, >> Giles >> >> >> >> >> On 14 Oct 2009, at 16:48, Giles Velarde wrote: >> >>> Hi, >>> >>> I am looking to possibly speed up the generation of GFFs for >>> consumption by the JBrowse prepare scripts, and wanted to check >>> with you first to see what attributes (tags in the last column) >>> are absolutely necessary for JBbrowse. I am guessing : >>> >>> - ID >>> - Name >>> - Parent >>> - Derives_from >>> >>> >>> would that be correct? >>> >>> Regards, >>> Giles >> >> >> -- The Wellcome Trust Sanger Institute is operated by Genome >> Research Limited, a charity registered in England with number >> 1021457 and a compa ny registered in England with number 2742969, >> whose registered office is 2 15 Euston Road, London, NW1 2BE. >> >> ------------------------------------------------------------------------------ >> Come build with us! The BlackBerry(R) Developer Conference in SF, CA >> is the only developer event you need to attend this year. Jumpstart >> your >> developing skills, take BlackBerry mobile applications to market >> and stay >> ahead of the curve. Join us from November 9 - 12, 2009. Register now! >> http://p.sf.net/sfu/devconference >> >> _______________________________________________ >> Gmod-ajax mailing list >> Gmo...@li... >> https://lists.sourceforge.net/lists/listinfo/gmod-ajax >> > -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. |
From: Mitch S. <mit...@be...> - 2009-10-15 23:22:14
|
Giles Velarde wrote: > Yes I am using this to pull data out of GeneDB here at Sanger, which > is a Chado database. I have put it up here : > > http://github.com/gv1/chado2miniGFF > > It's mostly SQLs, inside light Python wrapper. > > We usually use Artemis to bulk-export GFF, but in this case we don't > need all the annotations for the time being, and even if we do use > some in the future ( for use in extraData) the approach would be to > selectively put them in. Given the difficulty you had with Bio::DB::Das::Chado, and given that I'd like to be able to generate jbrowse json from the UCSC database (and, potentially, other databases) I've been kicking around the idea of having something like an sql-to-json.pl. You could give it a database connection and an sql query (and probably some perl callbacks for munging the results) and it would generate jbrowse json for you. Simple cases would be really easy; the jbrowse json is already sort of tabular. I'm not sure yet about how to deal with subfeatures though, given that different databases deal with those quite differently. FWIW, Mitch |
From: Scott C. <sc...@sc...> - 2009-10-16 03:10:43
|
Hi Mitch and Giles, I must have missed the conversation about Bio::DB::Das::Chado; what were the nature of the problems? I can't say I'm terribly surprised that there were problems, since GeneDB has been using Chado for a long time and so Chado and the gbrowse adaptor have no doubt evolved away from the way GeneDB is using it. Anyway, is there anything I can do? Scott On Thu, Oct 15, 2009 at 7:21 PM, Mitch Skinner <mit...@be...> wrote: > Giles Velarde wrote: >> Yes I am using this to pull data out of GeneDB here at Sanger, which >> is a Chado database. I have put it up here : >> >> http://github.com/gv1/chado2miniGFF >> >> It's mostly SQLs, inside light Python wrapper. >> >> We usually use Artemis to bulk-export GFF, but in this case we don't >> need all the annotations for the time being, and even if we do use >> some in the future ( for use in extraData) the approach would be to >> selectively put them in. > > Given the difficulty you had with Bio::DB::Das::Chado, and given that > I'd like to be able to generate jbrowse json from the UCSC database > (and, potentially, other databases) I've been kicking around the idea of > having something like an sql-to-json.pl. You could give it a database > connection and an sql query (and probably some perl callbacks for > munging the results) and it would generate jbrowse json for you. > > Simple cases would be really easy; the jbrowse json is already sort of > tabular. I'm not sure yet about how to deal with subfeatures though, > given that different databases deal with those quite differently. > > FWIW, > Mitch > > ------------------------------------------------------------------------------ > Come build with us! The BlackBerry(R) Developer Conference in SF, CA > is the only developer event you need to attend this year. Jumpstart your > developing skills, take BlackBerry mobile applications to market and stay > ahead of the curve. Join us from November 9 - 12, 2009. Register now! > http://p.sf.net/sfu/devconference > _______________________________________________ > Gmod-ajax mailing list > Gmo...@li... > https://lists.sourceforge.net/lists/listinfo/gmod-ajax > -- ------------------------------------------------------------------------ Scott Cain, Ph. D. scott at scottcain dot net GMOD Coordinator (http://gmod.org/) 216-392-3087 Ontario Institute for Cancer Research |
From: Mitch S. <mit...@be...> - 2009-10-16 03:46:15
|
I was talking about this thread: http://www.nabble.com/getting-JBrowse-to-run-off-a-Chado-database-where-feature.name-field-is-not-used-td25491225.html That thread ended a little inconclusively, I thought; Scott had suggested a database change ("set feature.name = feature.uniquename") and then Giles started working on genedb->gff->jbrowse rather than straight genedb->jbrowse. And then performance became an issue; it sounds like Giles solved it in his case, but in general I hope to avoid making people do as much work as he has done. Part of the reason I was thinking about having an sql-to-json.pl (that would take an sql query as a parameter) is this site-specific variation in chado usage. I'm not sure how much of it there is, though. Other reasons were: * allowing other kinds of queries than the ones supported by Bio::DasI * not having to write Bio::DB::Das::UCSC (unless someone has already written this?) * not having the bio object intermediate (when I profiled a BED->json conversion, the biggest CPU user was Bio::Root::RootI::_rearrange) On the other hand, you could argue (as Chris Mungall did to me today) that it would be better to work on improving the middleware rather than trying to avoid it. And then there's the reasons that people have wanted middleware in the past (not writing m*n mappings between m data sources and n consumers, but m+n mappings to/from the middleware). I think that would be an interesting discussion to have; each of those m*n mappings can be simpler than the mappings to and from the middleware, and also take advantage of unique features of the source and destination. Mitch Scott Cain wrote: > Hi Mitch and Giles, > > I must have missed the conversation about Bio::DB::Das::Chado; what > were the nature of the problems? I can't say I'm terribly surprised > that there were problems, since GeneDB has been using Chado for a long > time and so Chado and the gbrowse adaptor have no doubt evolved away > from the way GeneDB is using it. > > Anyway, is there anything I can do? > > Scott > > > On Thu, Oct 15, 2009 at 7:21 PM, Mitch Skinner > <mit...@be...> wrote: > >> Giles Velarde wrote: >> >>> Yes I am using this to pull data out of GeneDB here at Sanger, which >>> is a Chado database. I have put it up here : >>> >>> http://github.com/gv1/chado2miniGFF >>> >>> It's mostly SQLs, inside light Python wrapper. >>> >>> We usually use Artemis to bulk-export GFF, but in this case we don't >>> need all the annotations for the time being, and even if we do use >>> some in the future ( for use in extraData) the approach would be to >>> selectively put them in. >>> >> Given the difficulty you had with Bio::DB::Das::Chado, and given that >> I'd like to be able to generate jbrowse json from the UCSC database >> (and, potentially, other databases) I've been kicking around the idea of >> having something like an sql-to-json.pl. You could give it a database >> connection and an sql query (and probably some perl callbacks for >> munging the results) and it would generate jbrowse json for you. >> >> Simple cases would be really easy; the jbrowse json is already sort of >> tabular. I'm not sure yet about how to deal with subfeatures though, >> given that different databases deal with those quite differently. >> >> FWIW, >> Mitch >> >> ------------------------------------------------------------------------------ >> Come build with us! The BlackBerry(R) Developer Conference in SF, CA >> is the only developer event you need to attend this year. Jumpstart your >> developing skills, take BlackBerry mobile applications to market and stay >> ahead of the curve. Join us from November 9 - 12, 2009. Register now! >> http://p.sf.net/sfu/devconference >> _______________________________________________ >> Gmod-ajax mailing list >> Gmo...@li... >> https://lists.sourceforge.net/lists/listinfo/gmod-ajax >> >> > > > > |
From: Giles V. <gv...@sa...> - 2009-10-18 15:46:16
Attachments:
0001-quick-fix-to-help-generate-larger-refseqs.patch
|
On 16 Oct 2009, at 04:45, Mitch Skinner wrote: > I was talking about this thread: > > http://www.nabble.com/getting-JBrowse-to-run-off-a-Chado-database-where-feature.name-field-is-not-used-td25491225.html > > That thread ended a little inconclusively, I thought; Scott had > suggested a database change ("set feature.name = > feature.uniquename") and then Giles started working on genedb->gff- > >jbrowse rather than straight genedb->jbrowse. Mitch, Scott, Many thanks for looking into this. Yes, sorry about not ending that thread. Mitch summed it up quite well. I did modify try the modifications as you suggested, Scott, and this did get the script to go a bit further, but then it got stuck somewhere a little further down the line. At the same time we are developing considerable infrastructure around GFF, and so decided to look at that too, and this yielded tangible results more easily. Scott, thanks for you offer to look into it. It doesn't look like we'll need to be using that adapter for now, mainly because the GFF middle-ware approach is working fine for us and producing all those auto-exported GFF files can have other unforeseen uses. But if you're keen on expanding the scope of the adapter, I could point you to our public read-only snapshot for further testing and would be happy to help you with it. > And then performance became an issue; it sounds like Giles solved it > in his case, but in general I hope to avoid making people do as much > work as he has done. > Speaking of which, I ran into another performance issue, this time inside prepare-refseqs.pl. It has to do with preparing refseqs for several 1000s of GFF files. When I looked into it I saw that the refSeqs.js was being opened, parsed, appended to, and closed for every GFF. As this file got larger this opening and appending step got more expensive. I submit a patch that fixes it for me. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. |
From: Giles V. <gv...@sa...> - 2009-10-19 09:18:09
|
Apologies for the boilerplate that got inserted in the middle of my last email, which is where I dragged the attachment into. Regards, Giles On 18 Oct 2009, at 16:47, Giles Velarde wrote: > > On 16 Oct 2009, at 04:45, Mitch Skinner wrote: > >> I was talking about this thread: >> >> http://www.nabble.com/getting-JBrowse-to-run-off-a-Chado-database-where-feature.name-field-is-not-used-td25491225.html >> >> That thread ended a little inconclusively, I thought; Scott had >> suggested a database change ("set feature.name = >> feature.uniquename") and then Giles started working on genedb->gff- >> >jbrowse rather than straight genedb->jbrowse. > > > Mitch, Scott, > > Many thanks for looking into this. > > Yes, sorry about not ending that thread. Mitch summed it up quite > well. I did modify try the modifications as you suggested, Scott, > and this did get the script to go a bit further, but then it got > stuck somewhere a little further down the line. At the same time we > are developing considerable infrastructure around GFF, and so > decided to look at that too, and this yielded tangible results more > easily. > > Scott, thanks for you offer to look into it. It doesn't look like > we'll need to be using that adapter for now, mainly because the GFF > middle-ware approach is working fine for us and producing all those > auto-exported GFF files can have other unforeseen uses. But if > you're keen on expanding the scope of the adapter, I could point you > to our public read-only snapshot for further testing and would be > happy to help you with it. > > >> And then performance became an issue; it sounds like Giles solved >> it in his case, but in general I hope to avoid making people do as >> much work as he has done. >> > > Speaking of which, I ran into another performance issue, this time > inside prepare-refseqs.pl. It has to do with preparing refseqs for > several 1000s of GFF files. When I looked into it I saw that the > refSeqs.js was being opened, parsed, appended to, and closed for > every GFF. As this file got larger this opening and appending step > got more expensive. > > I submit a patch that fixes it for me. > > -- > The Wellcome Trust Sanger Institute is operated by Genome > ResearchLimited, a charity registered in England with number 1021457 > and acompany registered in England with number 2742969, whose > registeredoffice is 215 Euston Road, London, NW1 2BE. > <0001-quick-fix-to-help-generate-larger-refseqs.patch>In order not > to change any original functionality, I have added a couple of new > parameters: > > -gffs, which allows you to supply a comma-separated list of files > -gfffolder, which allows you to supply a folder with gff files in > there (it does not check to see if they really are GFFs) > > This allows prepare-refseqs.pl to process more than one GFF file at > once, which was a serious bottleneck for me. As before, I am not > expecting you to necessarily apply the patch. It's just for you to > see how I got around the problem. I am sure you can come up with a > more elegant way! :-) > > > > Oh, and another optimization I have had to do is deploy an instance > of JBrowse per organism. biodb-to-json.pl seems to fall over when > the refseqs file gets too large. This happens to help keep the > chromosome/contig-picking drop-down box populated with a sensible > amount of choices, and it seems to make sense to me to have a > separate page per organism. > > >> Part of the reason I was thinking about having an sql-to-json.pl >> (that would take an sql query as a parameter) is this site-specific >> variation in chado usage. I'm not sure how much of it there is, >> though. >> >> Other reasons were: >> * allowing other kinds of queries than the ones supported by >> Bio::DasI >> * not having to write Bio::DB::Das::UCSC (unless someone has >> already written this?) >> * not having the bio object intermediate (when I profiled a BED- >> >json conversion, the biggest CPU user was >> Bio::Root::RootI::_rearrange) >> > > I must say I have had to continually fight the temptation to cut > through the middleman and just generate the JSons directly! So, I do > indeed see advantages in your suggestion. Of course, you would need > to defensively validate other people's SQL results, which is never > fun. Alternatively, a simple abstracted/documented programmatic API > for generating the JSON might be a good way, e.g. : > > my $jbrowse = new JBrowseConfig ( $jsonConfigurations ); > my $track = $jbrowse->addTrack({ name => "contig01", etc. } ); > > foreach my $feature ($myCustomFeatureImplementationList) > { > $jbrowse->addFeature( $feature->toJSon() ); > } > > $jbrowse->generate_refseqs(); > $jbrowse->generate_jsons(); > $jbrowse->generate_names(); > > Looking at the bin scripts, you're nearly there right now. > > >> On the other hand, you could argue (as Chris Mungall did to me >> today) that it would be better to work on improving the middleware >> rather than trying to avoid it. And then there's the reasons that >> people have wanted middleware in the past (not writing m*n mappings >> between m data sources and n consumers, but m+n mappings to/from >> the middleware). I think that would be an interesting discussion >> to have; each of those m*n mappings can be simpler than the >> mappings to and from the middleware, and also take advantage of >> unique features of the source and destination. >> > > If by middleware, if this includes parsing formats like GFF, then I > would tend to agree with him. It's easier for me to justify my time > developing around standards, because of the potential for reuse. The > minimal GFF really does seem like a good solution for us: they're > still valid files, and they take an order of magnitude's less time > to produce (because we have so much annotation in our db). > > Thanks again, guys. I really appreciate you're willingness to help. > In our particular case we have some rather phenomenal scaling > issues, so it was never going to be easy, so rest assured I'll be in > touch again if/when I run into more interesting eventualities. > > Kind Regards, > Giles > >> Mitch >> >> Scott Cain wrote: >>> Hi Mitch and Giles, >>> >>> I must have missed the conversation about Bio::DB::Das::Chado; what >>> were the nature of the problems? I can't say I'm terribly surprised >>> that there were problems, since GeneDB has been using Chado for a >>> long >>> time and so Chado and the gbrowse adaptor have no doubt evolved away >>> from the way GeneDB is using it. >>> >>> Anyway, is there anything I can do? >>> >>> Scott >>> >>> >>> On Thu, Oct 15, 2009 at 7:21 PM, Mitch Skinner >>> <mit...@be...> wrote: >>> >>>> Giles Velarde wrote: >>>> >>>>> Yes I am using this to pull data out of GeneDB here at Sanger, >>>>> which >>>>> is a Chado database. I have put it up here : >>>>> >>>>> http://github.com/gv1/chado2miniGFF >>>>> >>>>> It's mostly SQLs, inside light Python wrapper. >>>>> >>>>> We usually use Artemis to bulk-export GFF, but in this case we >>>>> don't >>>>> need all the annotations for the time being, and even if we do >>>>> use >>>>> some in the future ( for use in extraData) the approach would be >>>>> to >>>>> selectively put them in. >>>>> >>>> Given the difficulty you had with Bio::DB::Das::Chado, and given >>>> that >>>> I'd like to be able to generate jbrowse json from the UCSC database >>>> (and, potentially, other databases) I've been kicking around the >>>> idea of >>>> having something like an sql-to-json.pl. You could give it a >>>> database >>>> connection and an sql query (and probably some perl callbacks for >>>> munging the results) and it would generate jbrowse json for you. >>>> >>>> Simple cases would be really easy; the jbrowse json is already >>>> sort of >>>> tabular. I'm not sure yet about how to deal with subfeatures >>>> though, >>>> given that different databases deal with those quite differently. >>>> >>>> FWIW, >>>> Mitch >>>> >>>> ------------------------------------------------------------------------------ >>>> Come build with us! The BlackBerry(R) Developer Conference in SF, >>>> CA >>>> is the only developer event you need to attend this year. >>>> Jumpstart your >>>> developing skills, take BlackBerry mobile applications to market >>>> and stay >>>> ahead of the curve. Join us from November 9 - 12, 2009. Register >>>> now! >>>> http://p.sf.net/sfu/devconference >>>> _______________________________________________ >>>> Gmod-ajax mailing list >>>> Gmo...@li... >>>> https://lists.sourceforge.net/lists/listinfo/gmod-ajax >>>> >>>> >>> >>> >>> >>> >> > > ------------------------------------------------------------------------------ > Come build with us! The BlackBerry(R) Developer Conference in SF, CA > is the only developer event you need to attend this year. Jumpstart > your > developing skills, take BlackBerry mobile applications to market and > stay > ahead of the curve. Join us from November 9 - 12, 2009. Register now! > http://p.sf.net/sfu/devconference_______________________________________________ > Gmod-ajax mailing list > Gmo...@li... > https://lists.sourceforge.net/lists/listinfo/gmod-ajax -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. |
From: Scott C. <sc...@sc...> - 2009-10-21 20:23:55
|
Hi Giles, Don't feel bad about the untimely death of that thread; I seem to do that more than my fair share. I would be interested in trying to get the GBrowse Chado adaptor working with your publicly facing readonly database, though not until November. I'm booked solid until then. I'm glad you have something working, and having a GFF3 dumper is a good thing anyway. Scott On Sun, Oct 18, 2009 at 11:47 AM, Giles Velarde <gv...@sa...> wrote: > > On 16 Oct 2009, at 04:45, Mitch Skinner wrote: > >> I was talking about this thread: >> >> >> http://www.nabble.com/getting-JBrowse-to-run-off-a-Chado-database-where-feature.name-field-is-not-used-td25491225.html >> >> That thread ended a little inconclusively, I thought; Scott had suggested >> a database change ("set feature.name = feature.uniquename") and then Giles >> started working on genedb->gff->jbrowse rather than straight >> genedb->jbrowse. > > > Mitch, Scott, > > Many thanks for looking into this. > > Yes, sorry about not ending that thread. Mitch summed it up quite well. I > did modify try the modifications as you suggested, Scott, and this did get > the script to go a bit further, but then it got stuck somewhere a little > further down the line. At the same time we are developing considerable > infrastructure around GFF, and so decided to look at that too, and this > yielded tangible results more easily. > > Scott, thanks for you offer to look into it. It doesn't look like we'll need > to be using that adapter for now, mainly because the GFF middle-ware > approach is working fine for us and producing all those auto-exported GFF > files can have other unforeseen uses. But if you're keen on expanding the > scope of the adapter, I could point you to our public read-only snapshot for > further testing and would be happy to help you with it. > > >> And then performance became an issue; it sounds like Giles solved it in >> his case, but in general I hope to avoid making people do as much work as he >> has done. >> > > Speaking of which, I ran into another performance issue, this time inside > prepare-refseqs.pl. It has to do with preparing refseqs for several 1000s of > GFF files. When I looked into it I saw that the refSeqs.js was being opened, > parsed, appended to, and closed for every GFF. As this file got larger this > opening and appending step got more expensive. > > I submit a patch that fixes it for me. > > -- > The Wellcome Trust Sanger Institute is operated by Genome ResearchLimited, a > charity registered in England with number 1021457 and acompany registered in > England with number 2742969, whose registeredoffice is 215 Euston Road, > London, NW1 2BE. > > In order not to change any original functionality, I have added a couple of > new parameters: > > -gffs, which allows you to supply a comma-separated list of files > -gfffolder, which allows you to supply a folder with gff files in there (it > does not check to see if they really are GFFs) > > This allows prepare-refseqs.pl to process more than one GFF file at once, > which was a serious bottleneck for me. As before, I am not expecting you to > necessarily apply the patch. It's just for you to see how I got around the > problem. I am sure you can come up with a more elegant way! :-) > > > > Oh, and another optimization I have had to do is deploy an instance of > JBrowse per organism. biodb-to-json.pl seems to fall over when the refseqs > file gets too large. This happens to help keep the chromosome/contig-picking > drop-down box populated with a sensible amount of choices, and it seems to > make sense to me to have a separate page per organism. > > >> Part of the reason I was thinking about having an sql-to-json.pl (that >> would take an sql query as a parameter) is this site-specific variation in >> chado usage. I'm not sure how much of it there is, though. >> >> Other reasons were: >> * allowing other kinds of queries than the ones supported by Bio::DasI >> * not having to write Bio::DB::Das::UCSC (unless someone has already >> written this?) >> * not having the bio object intermediate (when I profiled a BED->json >> conversion, the biggest CPU user was Bio::Root::RootI::_rearrange) >> > > I must say I have had to continually fight the temptation to cut through the > middleman and just generate the JSons directly! So, I do indeed see > advantages in your suggestion. Of course, you would need to defensively > validate other people's SQL results, which is never fun. Alternatively, a > simple abstracted/documented programmatic API for generating the JSON might > be a good way, e.g. : > > my $jbrowse = new JBrowseConfig ( $jsonConfigurations ); > my $track = $jbrowse->addTrack({ name => "contig01", etc. } ); > > foreach my $feature ($myCustomFeatureImplementationList) > { > $jbrowse->addFeature( $feature->toJSon() ); > } > > $jbrowse->generate_refseqs(); > $jbrowse->generate_jsons(); > $jbrowse->generate_names(); > > Looking at the bin scripts, you're nearly there right now. > > >> On the other hand, you could argue (as Chris Mungall did to me today) that >> it would be better to work on improving the middleware rather than trying to >> avoid it. And then there's the reasons that people have wanted middleware >> in the past (not writing m*n mappings between m data sources and n >> consumers, but m+n mappings to/from the middleware). I think that would be >> an interesting discussion to have; each of those m*n mappings can be simpler >> than the mappings to and from the middleware, and also take advantage of >> unique features of the source and destination. >> > > If by middleware, if this includes parsing formats like GFF, then I would > tend to agree with him. It's easier for me to justify my time developing > around standards, because of the potential for reuse. The minimal GFF really > does seem like a good solution for us: they're still valid files, and they > take an order of magnitude's less time to produce (because we have so much > annotation in our db). > > Thanks again, guys. I really appreciate you're willingness to help. In our > particular case we have some rather phenomenal scaling issues, so it was > never going to be easy, so rest assured I'll be in touch again if/when I run > into more interesting eventualities. > > Kind Regards, > Giles > >> Mitch >> >> Scott Cain wrote: >>> >>> Hi Mitch and Giles, >>> >>> I must have missed the conversation about Bio::DB::Das::Chado; what >>> were the nature of the problems? I can't say I'm terribly surprised >>> that there were problems, since GeneDB has been using Chado for a long >>> time and so Chado and the gbrowse adaptor have no doubt evolved away >>> from the way GeneDB is using it. >>> >>> Anyway, is there anything I can do? >>> >>> Scott >>> >>> >>> On Thu, Oct 15, 2009 at 7:21 PM, Mitch Skinner >>> <mit...@be...> wrote: >>> >>>> Giles Velarde wrote: >>>> >>>>> Yes I am using this to pull data out of GeneDB here at Sanger, which >>>>> is a Chado database. I have put it up here : >>>>> >>>>> http://github.com/gv1/chado2miniGFF >>>>> >>>>> It's mostly SQLs, inside light Python wrapper. >>>>> >>>>> We usually use Artemis to bulk-export GFF, but in this case we don't >>>>> need all the annotations for the time being, and even if we do use >>>>> some in the future ( for use in extraData) the approach would be to >>>>> selectively put them in. >>>>> >>>> Given the difficulty you had with Bio::DB::Das::Chado, and given that >>>> I'd like to be able to generate jbrowse json from the UCSC database >>>> (and, potentially, other databases) I've been kicking around the idea of >>>> having something like an sql-to-json.pl. You could give it a database >>>> connection and an sql query (and probably some perl callbacks for >>>> munging the results) and it would generate jbrowse json for you. >>>> >>>> Simple cases would be really easy; the jbrowse json is already sort of >>>> tabular. I'm not sure yet about how to deal with subfeatures though, >>>> given that different databases deal with those quite differently. >>>> >>>> FWIW, >>>> Mitch >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> Come build with us! The BlackBerry(R) Developer Conference in SF, CA >>>> is the only developer event you need to attend this year. Jumpstart your >>>> developing skills, take BlackBerry mobile applications to market and >>>> stay >>>> ahead of the curve. Join us from November 9 - 12, 2009. Register now! >>>> http://p.sf.net/sfu/devconference >>>> _______________________________________________ >>>> Gmod-ajax mailing list >>>> Gmo...@li... >>>> https://lists.sourceforge.net/lists/listinfo/gmod-ajax >>>> >>>> >>> >>> >>> >>> >> > > > -- ------------------------------------------------------------------------ Scott Cain, Ph. D. scott at scottcain dot net GMOD Coordinator (http://gmod.org/) 216-392-3087 Ontario Institute for Cancer Research |
From: Mitch S. <mit...@be...> - 2009-10-19 22:17:32
|
Giles Velarde wrote: > Speaking of which, I ran into another performance issue, this time > inside prepare-refseqs.pl. It has to do with preparing refseqs for > several 1000s of GFF files. When I looked into it I saw that the > refSeqs.js was being opened, parsed, appended to, and closed for every > GFF. As this file got larger this opening and appending step got more > expensive. > > I submit a patch that fixes it for me. The patch looks good to me; I was just about to apply it, but I wanted to ask some questions first. How are your GFF files organized? Do you have a GFF file per organism, per refseq, or something else? How many organisms do you have, and how many refseqs do they have? Do your JBrowse instances have sequence data? If you're using GFF files with prepare-refseqs.pl then one option is to create one GFF file per organism with all of the sequence-region lines in it (and nothing else). Running prepare-refseqs.pl on that file should be pretty fast. Did you write that patch before you started making per-organism JBrowse instances? > I must say I have had to continually fight the temptation to cut > through the middleman and just generate the JSons directly! So, I do > indeed see advantages in your suggestion. Of course, you would need to > defensively validate other people's SQL results, which is never fun. > Alternatively, a simple abstracted/documented programmatic API for > generating the JSON might be a good way, e.g. : > > my $jbrowse = new JBrowseConfig ( $jsonConfigurations ); > my $track = $jbrowse->addTrack({ name => "contig01", etc. } ); > > foreach my $feature ($myCustomFeatureImplementationList) > { > $jbrowse->addFeature( $feature->toJSon() ); > } > > $jbrowse->generate_refseqs(); > $jbrowse->generate_jsons(); > $jbrowse->generate_names(); > > Looking at the bin scripts, you're nearly there right now. A simple API for generating the JSON has been the goal, but yeah, it should probably be simpler/documented. If anyone is interested in using that API, feel free to speak up about your use case and how you want the API to work. Regards, Mitch |
From: Giles V. <gv...@sa...> - 2009-10-20 10:58:09
|
On 19 Oct 2009, at 23:17, Mitch Skinner wrote: > Giles Velarde wrote: >> Speaking of which, I ran into another performance issue, this time >> inside prepare-refseqs.pl. It has to do with preparing refseqs for >> several 1000s of GFF files. When I looked into it I saw that the >> refSeqs.js was being opened, parsed, appended to, and closed for >> every GFF. As this file got larger this opening and appending step >> got more expensive. >> >> I submit a patch that fixes it for me. > > The patch looks good to me; I was just about to apply it, but I > wanted to ask some questions first. > > How are your GFF files organized? Do you have a GFF file per > organism, per refseq, or something else? How many organisms do you > have, and how many refseqs do they have? Do your JBrowse instances > have sequence data? > In our case we export one GFF per top-level source feature in the database. This can be a chromosome, but is more often a contig (and there are lots of those). They could be concatenated into one big GFF of course (I think GFF allows for that?). > If you're using GFF files with prepare-refseqs.pl then one option is > to create one GFF file per organism with all of the sequence-region > lines in it (and nothing else). Running prepare-refseqs.pl on that > file should be pretty fast. > Good idea, makes sense as that's the only thing that gets parsed at that point (though I didn't know that at the time when I first started looking). > Did you write that patch before you started making per-organism > JBrowse instances? > No. Once I wrote that patch I got past prepare-refseqs.pl, but then found it fell over later. I can't remember where exactly but I think it was in biodb-json. Regards, Giles -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. |
From: Mitch S. <mit...@be...> - 2009-10-21 03:51:02
|
I thought about this some more and your use case seems like a reasonable one to me. I tried out your patch in my working copy, and I noticed two things: 1. There's no documentation on those options in the help message 2. The "if" statement that prints out the usage message doesn't check for $gfffolder If you prepare another patch that addresses those things, I'll apply it to jbrowse master. Thanks for the patch, Mitch Giles Velarde wrote: > In our case we export one GFF per top-level source feature in the > database. This can be a chromosome, but is more often a contig (and > there are lots of those). They could be concatenated into one big GFF > of course (I think GFF allows for that?). > >> If you're using GFF files with prepare-refseqs.pl then one option is >> to create one GFF file per organism with all of the sequence-region >> lines in it (and nothing else). Running prepare-refseqs.pl on that >> file should be pretty fast. >> > > Good idea, makes sense as that's the only thing that gets parsed at > that point (though I didn't know that at the time when I first started > looking). |
From: Giles V. <gv...@sa...> - 2009-10-20 11:07:49
|
> >> Did you write that patch before you started making per-organism >> JBrowse instances? >> > > No. Once I wrote that patch I got past prepare-refseqs.pl, but then > found it fell over later. I can't remember where exactly but I think > it was in biodb-json. It just occured to me that I have just isolated an error in our Database where there was an orphan mRNA being exported into the GFF. It happens to occur in the organism that was failing, and it causes : ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: Smp_196060 doesn't have a primary id STACK: Error::throw STACK: Bio::Root::Root::throw /software/pathogen/external/lib/perl/Bio/ Root/Root.pm:328 STACK: Bio::DB::SeqFeature::Store::GFF3Loader::build_object_tree_in_tables / software/pathogen/external/lib/perl/lib/site_perl/5.8.8/Bio/DB/ SeqFeature/Store/GFF3Loader.pm:668 STACK: Bio::DB::SeqFeature::Store::GFF3Loader::build_object_tree / software/pathogen/external/lib/perl/lib/site_perl/5.8.8/Bio/DB/ SeqFeature/Store/GFF3Loader.pm:647 STACK: Bio::DB::SeqFeature::Store::GFF3Loader::finish_load /software/ pathogen/external/lib/perl/lib/site_perl/5.8.8/Bio/DB/SeqFeature/Store/ GFF3Loader.pm:318 STACK: Bio::DB::SeqFeature::Store::Loader::load_fh /software/pathogen/ external/lib/perl/lib/site_perl/5.8.8/Bio/DB/SeqFeature/Store/ Loader.pm:322 STACK: Bio::DB::SeqFeature::Store::Loader::load /software/pathogen/ external/lib/perl/lib/site_perl/5.8.8/Bio/DB/SeqFeature/Store/ Loader.pm:219 STACK: Bio::DB::SeqFeature::Store::memory::post_init /software/ pathogen/external/lib/perl/lib/site_perl/5.8.8/Bio/DB/SeqFeature/Store/ memory.pm:166 STACK: Bio::DB::SeqFeature::Store::new /software/pathogen/external/lib/ perl/lib/site_perl/5.8.8/Bio/DB/SeqFeature/Store.pm:366 STACK: /nfs/pathdata2/GFF/jbrowse_deployments_test/bin/biodb-to- json.pl:44 ----------------------------------------------------------- Could not open database: ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: Smp_196060 doesn't have a primary id STACK: Error::throw STACK: Bio::Root::Root::throw /software/pathogen/external/lib/perl/Bio/ Root/Root.pm:328 STACK: Bio::DB::SeqFeature::Store::GFF3Loader::build_object_tree_in_tables / software/pathogen/external/lib/perl/lib/site_perl/5.8.8/Bio/DB/ SeqFeature/Store/GFF3Loader.pm:668 STACK: Bio::DB::SeqFeature::Store::GFF3Loader::build_object_tree / software/pathogen/external/lib/perl/lib/site_perl/5.8.8/Bio/DB/ SeqFeature/Store/GFF3Loader.pm:647 STACK: Bio::DB::SeqFeature::Store::GFF3Loader::finish_load /software/ pathogen/external/lib/perl/lib/site_perl/5.8.8/Bio/DB/SeqFeature/Store/ GFF3Loader.pm:318 STACK: Bio::DB::SeqFeature::Store::Loader::load_fh /software/pathogen/ external/lib/perl/lib/site_perl/5.8.8/Bio/DB/SeqFeature/Store/ Loader.pm:322 STACK: Bio::DB::SeqFeature::Store::Loader::load /software/pathogen/ external/lib/perl/lib/site_perl/5.8.8/Bio/DB/SeqFeature/Store/ Loader.pm:219 STACK: Bio::DB::SeqFeature::Store::memory::post_init /software/ pathogen/external/lib/perl/lib/site_perl/5.8.8/Bio/DB/SeqFeature/Store/ memory.pm:166 STACK: Bio::DB::SeqFeature::Store::new /software/pathogen/external/lib/ perl/lib/site_perl/5.8.8/Bio/DB/SeqFeature/Store.pm:366 STACK: /nfs/pathdata2/GFF/jbrowse_deployments_test/bin/biodb-to- json.pl:44 ----------------------------------------------------------- For some reason I missed that in the logs originally. So I think what's happened is that an error in the exported GFF was causing biodb to die. By separating things out into different organisms, I was able to get deployments for all the rest of our datasets, and then isolate what's going wrong with this one. Regards, Giles -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. |
From: Mitch S. <mit...@be...> - 2009-10-21 03:57:52
|
Thanks for following up on this; it doesn't seem to me that this is a JBrowse bug, though. It's definitely happening deep within bioperl code; the JBrowse code line is just the one that creates the database object. I'm not really sure how to address this; I can't even think of a more useful error message for JBrowse to give. Mitch Giles Velarde wrote: >> >>> Did you write that patch before you started making per-organism >>> JBrowse instances? >>> >> >> No. Once I wrote that patch I got past prepare-refseqs.pl, but then >> found it fell over later. I can't remember where exactly but I think >> it was in biodb-json. > > > It just occured to me that I have just isolated an error in our > Database where there was an orphan mRNA being exported into the GFF. > It happens to occur in the organism that was failing, and it causes : > > ------------- EXCEPTION: Bio::Root::Exception ------------- > MSG: Smp_196060 doesn't have a primary id > STACK: Error::throw > STACK: Bio::Root::Root::throw > /software/pathogen/external/lib/perl/Bio/Root/Root.pm:328 > STACK: > Bio::DB::SeqFeature::Store::GFF3Loader::build_object_tree_in_tables > /software/pathogen/external/lib/perl/lib/site_perl/5.8.8/Bio/DB/SeqFeature/Store/GFF3Loader.pm:668 > STACK: Bio::DB::SeqFeature::Store::GFF3Loader::build_object_tree > /software/pathogen/external/lib/perl/lib/site_perl/5.8.8/Bio/DB/SeqFeature/Store/GFF3Loader.pm:647 > STACK: Bio::DB::SeqFeature::Store::GFF3Loader::finish_load > /software/pathogen/external/lib/perl/lib/site_perl/5.8.8/Bio/DB/SeqFeature/Store/GFF3Loader.pm:318 > STACK: Bio::DB::SeqFeature::Store::Loader::load_fh > /software/pathogen/external/lib/perl/lib/site_perl/5.8.8/Bio/DB/SeqFeature/Store/Loader.pm:322 > STACK: Bio::DB::SeqFeature::Store::Loader::load > /software/pathogen/external/lib/perl/lib/site_perl/5.8.8/Bio/DB/SeqFeature/Store/Loader.pm:219 > STACK: Bio::DB::SeqFeature::Store::memory::post_init > /software/pathogen/external/lib/perl/lib/site_perl/5.8.8/Bio/DB/SeqFeature/Store/memory.pm:166 > STACK: Bio::DB::SeqFeature::Store::new > /software/pathogen/external/lib/perl/lib/site_perl/5.8.8/Bio/DB/SeqFeature/Store.pm:366 > STACK: /nfs/pathdata2/GFF/jbrowse_deployments_test/bin/biodb-to-json.pl:44 > ----------------------------------------------------------- > Could not open database: > ------------- EXCEPTION: Bio::Root::Exception ------------- > MSG: Smp_196060 doesn't have a primary id > STACK: Error::throw > STACK: Bio::Root::Root::throw > /software/pathogen/external/lib/perl/Bio/Root/Root.pm:328 > STACK: > Bio::DB::SeqFeature::Store::GFF3Loader::build_object_tree_in_tables > /software/pathogen/external/lib/perl/lib/site_perl/5.8.8/Bio/DB/SeqFeature/Store/GFF3Loader.pm:668 > STACK: Bio::DB::SeqFeature::Store::GFF3Loader::build_object_tree > /software/pathogen/external/lib/perl/lib/site_perl/5.8.8/Bio/DB/SeqFeature/Store/GFF3Loader.pm:647 > STACK: Bio::DB::SeqFeature::Store::GFF3Loader::finish_load > /software/pathogen/external/lib/perl/lib/site_perl/5.8.8/Bio/DB/SeqFeature/Store/GFF3Loader.pm:318 > STACK: Bio::DB::SeqFeature::Store::Loader::load_fh > /software/pathogen/external/lib/perl/lib/site_perl/5.8.8/Bio/DB/SeqFeature/Store/Loader.pm:322 > STACK: Bio::DB::SeqFeature::Store::Loader::load > /software/pathogen/external/lib/perl/lib/site_perl/5.8.8/Bio/DB/SeqFeature/Store/Loader.pm:219 > STACK: Bio::DB::SeqFeature::Store::memory::post_init > /software/pathogen/external/lib/perl/lib/site_perl/5.8.8/Bio/DB/SeqFeature/Store/memory.pm:166 > STACK: Bio::DB::SeqFeature::Store::new > /software/pathogen/external/lib/perl/lib/site_perl/5.8.8/Bio/DB/SeqFeature/Store.pm:366 > STACK: /nfs/pathdata2/GFF/jbrowse_deployments_test/bin/biodb-to-json.pl:44 > ----------------------------------------------------------- > > For some reason I missed that in the logs originally. > > So I think what's happened is that an error in the exported GFF was > causing biodb to die. By separating things out into different > organisms, I was able to get deployments for all the rest of our > datasets, and then isolate what's going wrong with this one. > > Regards, > Giles > > > > > -- The Wellcome Trust Sanger Institute is operated by Genome Research > Limited, a charity registered in England with number 1021457 and a > compa ny registered in England with number 2742969, whose registered > office is 2 15 Euston Road, London, NW1 2BE. |