From: Aaron J. M. <am...@pc...> - 2005-07-15 13:33:39
|
Got it, thanks. Something else to keep on our minds while we implement central dogma handling (that two instances of a Gene may in fact be the same physical instance represented in two coordinate spaces). Note that we will ultimately have at least three coordinate spaces (contig<->scaffold<->chromosome), and possibly central dogma related coordinate mappings (protein <-> mRNA <-> DNA). -Aaron On Jul 14, 2005, at 6:13 PM, Chris Stoeckert wrote: > No, these are different features because they are spans on > different sequences (one scaffold and one virtual) so you won't get > two locations based on this for the same na_feature_id. NAFeature > has the na_sequence_id which tells you whether it is the scaffold > or virtual sequence. If these are Gene, RNA, or Protein features > then you can say that they are the same conceptual feature through > the central dogma and instance tables. If they are features like > Exon, then you could infer this as you say by parent_id, source_id, > etc. > > Chris > > On Jul 14, 2005, at 5:52 PM, Aaron J. Mackey wrote: > > >> >> Exactly. No logic is required, because we simply copy any and all >> NALocation objects attached to the sequences and generate new >> NALocation objects that point to the virtual sequence, with new >> coordinate/strand, but all other foreign keys remain the same >> (i.e. children of the same feature). >> >> Hmm, that means that if you blindly pull locations for a given >> feature, you will get two locations, not just one (so you'll need >> to specify which reference sequence you wish to obtain the >> location on). >> >> -Aaron >> >> On Jul 14, 2005, at 5:41 PM, Chris Stoeckert wrote: >> >> >> >>> Let's see if I understand your proposal. Generate features and >>> locations based on the static scaffold sequence coordinates. Then >>> at the end of the pipeline generate the same (conceptual) >>> features with locations based on the virtual sequence >>> coordinates. That makes sense to me. The advantage is that you >>> have both, one that is stable (scaffold) and one that can be >>> regenerated as needed (virtual) but stored for convenience. I >>> don't really see a disadvantage - sure it's twice as many rows >>> but if you materialize a view you adding these anyway. >>> >>> Chris >>> >>> On Jul 14, 2005, at 3:50 PM, Aaron J. Mackey wrote: >>> >>> >>> >>> >>>> >>>> As we struggle to use GUS the "right way", this is throwing us >>>> for a loop. On the one hand, our GUS client applications want >>>> to see features in the coordinate system of the assembly (i.e. >>>> the virtual sequence) -- on the other hand, it makes sense from >>>> a data integrity viewpoint to only load/store feature >>>> coordinates with respect to the static underlying scaffold >>>> coordinates, since the scaffold-to-chromosome mapping (as >>>> defined by DoTS.SequencePiece) may change over time. >>>> >>>> One option is to instantiate a read-only materialized view of >>>> the NALocation for clients to use. >>>> >>>> A second option (which we've just discussed, and people seem to >>>> like) is for the InsertVirtualSequenceFromMapping plugin we just >>>> wrote to (re)generate duplicate versions of all NALocations >>>> attached to a given SequencePiece in the new coordinate system >>>> (requiring the virtual sequence building to be the last step in >>>> our pipeline, instead of the first). >>>> >>>> -Aaron >>>> >>>> On Jul 14, 2005, at 2:53 PM, Chris Stoeckert wrote: >>>> >>>> >>>> >>>> >>>> >>>>> Hi Aaron, >>>>> I don't have a strong argument for either way. In terms of >>>>> coordinate mapping utilities, I'm not aware of one so certainly >>>>> would welcome yours (but if others know of ones please speak up). >>>>> >>>>> Chris >>>>> >>>>> On Jul 14, 2005, at 11:13 AM, Aaron J. Mackey wrote: >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> >>>>>> Thanks Chris, I got it. >>>>>> >>>>>> If we are going to start hanging features off these, should we >>>>>> hang them off the virtual chromosome sequence entries, or the >>>>>> scaffold entries in externalnasequence? Would it make sense >>>>>> to "codify" this usage with associate PL/SQL code to >>>>>> reconstruct virtual sequence and associated features in the >>>>>> virtual coordinate space? I guess one way to do this would be >>>>>> to have Virtual*Feature read-only views (and thus target >>>>>> everything to the "real" coordinate system such that future >>>>>> rebuilds of the virtual sequence would not require >>>>>> recalculation of feature locations)? >>>>>> >>>>>> Relatedly, is there coordinate mapping code already in some >>>>>> GUS utility module (if not, I'm happy to contribute mine, >>>>>> based on BioPerl's powerful Bio::Coordinate::Map framework)? >>>>>> >>>>>> -Aaron >>>>>> >>>>>> On Jul 14, 2005, at 11:05 AM, Chris Stoeckert wrote: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> Hi Aaron, >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> 1) VirtualSequence has a required sequence_version attribute >>>>>>>> - what is this for? Is this redundant to >>>>>>>> external_database_release_id? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> This is a superclass attribute inherited by all NASequence >>>>>>> views. My recollection is that individual GenBank sequence >>>>>>> entries have version tags at the end of accessions as in >>>>>>> "DQ094190.1" for Toxoplasma gondii ATP-binding cassette >>>>>>> protein subfamily B member 3 (found in VERSION field). >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> 2) VirtualSequence has a clob for storing the assembled >>>>>>>> sequence (I suspect), but the Perl object layer doesn't use >>>>>>>> this slot, instead rebuilding the sequence from the sequence >>>>>>>> pieces. Am I correct in this usage, or should I not, in >>>>>>>> fact, be storing the assembled sequence in VirtualSequence? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> Again this is a superclass attribute. I think using it is >>>>>>> optional. Reasons not to use it are that the virtual sequence >>>>>>> is hard to represent as a single entity (e.g., contains gaps) >>>>>>> or is very large and has a significant overhead cost of >>>>>>> storing what can be easily regenerated (and avoid >>>>>>> denormalization). Reasons to use are for convenience and >>>>>>> efficiency of retrieving the sequence without the need to >>>>>>> rebuild. >>>>>>> >>>>>>> Chris >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> -Aaron >>>>>>>> >>>>>>>> -- >>>>>>>> Aaron J. Mackey, Ph.D. >>>>>>>> Project Manager, ApiDB Bioinformatics Resource Center >>>>>>>> Penn Genomics Institute, University of Pennsylvania >>>>>>>> email: am...@pc... >>>>>>>> office: 215-898-1205 >>>>>>>> fax: 215-746-6697 >>>>>>>> postal: Penn Genomics Institute >>>>>>>> Goddard Labs 212 >>>>>>>> 415 S. University Avenue >>>>>>>> Philadelphia, PA 19104-6017 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> ------------------------------------------------------- >>>>>>>> This SF.Net email is sponsored by the 'Do More With Dual!' >>>>>>>> webinar happening >>>>>>>> July 14 at 8am PDT/11am EDT. We invite you to explore the >>>>>>>> latest in dual >>>>>>>> core and dual graphics technology at this free one hour >>>>>>>> event hosted by HP,AMD, and NVIDIA. To register visit >>>>>>>> http://www.hp.com/go/dualwebinar >>>>>>>> _______________________________________________ >>>>>>>> Gusdev-gusdev mailing list >>>>>>>> Gus...@li... >>>>>>>> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Aaron J. Mackey, Ph.D. >>>>>> Project Manager, ApiDB Bioinformatics Resource Center >>>>>> Penn Genomics Institute, University of Pennsylvania >>>>>> email: am...@pc... >>>>>> office: 215-898-1205 >>>>>> fax: 215-746-6697 >>>>>> postal: Penn Genomics Institute >>>>>> Goddard Labs 212 >>>>>> 415 S. University Avenue >>>>>> Philadelphia, PA 19104-6017 >>>>>> >>>>>> >>>>>> >>>>>> ------------------------------------------------------- >>>>>> SF.Net email is sponsored by: Discover Easy Linux Migration >>>>>> Strategies >>>>>> from IBM. Find simple to follow Roadmaps, straightforward >>>>>> articles, >>>>>> informative Webcasts and more! Get everything you need to get >>>>>> up to >>>>>> speed, fast. http://ads.osdn.com/? >>>>>> ad_id=7477&alloc_id=16492&op=click >>>>>> _______________________________________________ >>>>>> Gusdev-gusdev mailing list >>>>>> Gus...@li... >>>>>> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> -- >>>> Aaron J. Mackey, Ph.D. >>>> Project Manager, ApiDB Bioinformatics Resource Center >>>> Penn Genomics Institute, University of Pennsylvania >>>> email: am...@pc... >>>> office: 215-898-1205 >>>> fax: 215-746-6697 >>>> postal: Penn Genomics Institute >>>> Goddard Labs 212 >>>> 415 S. University Avenue >>>> Philadelphia, PA 19104-6017 >>>> >>>> >>>> >>>> >>> >>> >>> >>> ------------------------------------------------------- >>> SF.Net email is sponsored by: Discover Easy Linux Migration >>> Strategies >>> from IBM. Find simple to follow Roadmaps, straightforward articles, >>> informative Webcasts and more! Get everything you need to get up to >>> speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click >>> _______________________________________________ >>> Gusdev-gusdev mailing list >>> Gus...@li... >>> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev >>> >>> >>> >> >> -- >> Aaron J. Mackey, Ph.D. >> Project Manager, ApiDB Bioinformatics Resource Center >> Penn Genomics Institute, University of Pennsylvania >> email: am...@pc... >> office: 215-898-1205 >> fax: 215-746-6697 >> postal: Penn Genomics Institute >> Goddard Labs 212 >> 415 S. University Avenue >> Philadelphia, PA 19104-6017 >> > -- Aaron J. Mackey, Ph.D. Project Manager, ApiDB Bioinformatics Resource Center Penn Genomics Institute, University of Pennsylvania email: am...@pc... office: 215-898-1205 fax: 215-746-6697 postal: Penn Genomics Institute Goddard Labs 212 415 S. University Avenue Philadelphia, PA 19104-6017 |