From: Aaron J. M. <am...@pc...> - 2005-07-16 14:38:00
|
We solved it by just copying the feature trees directly at the NAFeatureImp table ... 5 or so lines of code, and no big deal. Remember that we may want to actually do this coordinate mapping along multiple alternative coordinate systems, so the post-processing is really more of an entirely separate InsertNewCoordinateSystem.pm plugin that takes (source, target) tuples that identify a given mapping found in SequencePiece. Better plugin name suggestions welcome. -Aaron steve wrote: > i agree that we need to project copies of the scaffold features onto the > virtual chromosome. > > but, i want to point out that this may be a bit tricky if done as a > post-process. the reason is that a "feature" spans multiple tables. > so, the copying of a feature means the traversal of a tree its child > objects. how does the post-process program know what that tree is? > > one way is for the programmer of that program to use human knowledge of > the schema to produce the possible tree, and traverse it. > > another way is to use schema information to generate the tree. > > an alternative approach would be: > 1. create the virtual sequence as a pre-process that does not write > features. > 2. any plugin that writes features has the option to take a virtual > sequence. if given that, it would read all the virtual sequence's > pieces to determine their offset. it would use their source_id to > correlate them with the input, and as the features are created do the > project them simultaneously on both the piece and the virtual sequence. > > that sounds kind of complicated, so probably the post-process is better. > > its kind of late and i'm kind of foggy... > > steve > > > > > > > Chris Stoeckert wrote: > >> No, these are different features because they are spans on different >> sequences (one scaffold and one virtual) so you won't get two >> locations based on this for the same na_feature_id. NAFeature has the >> na_sequence_id which tells you whether it is the scaffold or virtual >> sequence. If these are Gene, RNA, or Protein features then you can >> say that they are the same conceptual feature through the central >> dogma and instance tables. If they are features like Exon, then you >> could infer this as you say by parent_id, source_id, etc. >> >> Chris >> >> On Jul 14, 2005, at 5:52 PM, Aaron J. Mackey wrote: >> >>> >>> Exactly. No logic is required, because we simply copy any and all >>> NALocation objects attached to the sequences and generate new >>> NALocation objects that point to the virtual sequence, with new >>> coordinate/strand, but all other foreign keys remain the same (i.e. >>> children of the same feature). >>> >>> Hmm, that means that if you blindly pull locations for a given >>> feature, you will get two locations, not just one (so you'll need to >>> specify which reference sequence you wish to obtain the location on). >>> >>> -Aaron >>> >>> On Jul 14, 2005, at 5:41 PM, Chris Stoeckert wrote: >>> >>> >>>> Let's see if I understand your proposal. Generate features and >>>> locations based on the static scaffold sequence coordinates. Then >>>> at the end of the pipeline generate the same (conceptual) features >>>> with locations based on the virtual sequence coordinates. That >>>> makes sense to me. The advantage is that you have both, one that is >>>> stable (scaffold) and one that can be regenerated as needed >>>> (virtual) but stored for convenience. I don't really see a >>>> disadvantage - sure it's twice as many rows but if you materialize >>>> a view you adding these anyway. >>>> >>>> Chris >>>> >>>> On Jul 14, 2005, at 3:50 PM, Aaron J. Mackey wrote: >>>> >>>> >>>> >>>>> >>>>> As we struggle to use GUS the "right way", this is throwing us for >>>>> a loop. On the one hand, our GUS client applications want to see >>>>> features in the coordinate system of the assembly (i.e. the >>>>> virtual sequence) -- on the other hand, it makes sense from a data >>>>> integrity viewpoint to only load/store feature coordinates with >>>>> respect to the static underlying scaffold coordinates, since the >>>>> scaffold-to-chromosome mapping (as defined by DoTS.SequencePiece) >>>>> may change over time. >>>>> >>>>> One option is to instantiate a read-only materialized view of the >>>>> NALocation for clients to use. >>>>> >>>>> A second option (which we've just discussed, and people seem to >>>>> like) is for the InsertVirtualSequenceFromMapping plugin we just >>>>> wrote to (re)generate duplicate versions of all NALocations >>>>> attached to a given SequencePiece in the new coordinate system >>>>> (requiring the virtual sequence building to be the last step in >>>>> our pipeline, instead of the first). >>>>> >>>>> -Aaron >>>>> >>>>> On Jul 14, 2005, at 2:53 PM, Chris Stoeckert wrote: >>>>> >>>>> >>>>> >>>>> >>>>>> Hi Aaron, >>>>>> I don't have a strong argument for either way. In terms of >>>>>> coordinate mapping utilities, I'm not aware of one so certainly >>>>>> would welcome yours (but if others know of ones please speak up). >>>>>> >>>>>> Chris >>>>>> >>>>>> On Jul 14, 2005, at 11:13 AM, Aaron J. Mackey wrote: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> Thanks Chris, I got it. >>>>>>> >>>>>>> If we are going to start hanging features off these, should we >>>>>>> hang them off the virtual chromosome sequence entries, or the >>>>>>> scaffold entries in externalnasequence? Would it make sense to >>>>>>> "codify" this usage with associate PL/SQL code to reconstruct >>>>>>> virtual sequence and associated features in the virtual >>>>>>> coordinate space? I guess one way to do this would be to have >>>>>>> Virtual*Feature read-only views (and thus target everything to >>>>>>> the "real" coordinate system such that future rebuilds of the >>>>>>> virtual sequence would not require recalculation of feature >>>>>>> locations)? >>>>>>> >>>>>>> Relatedly, is there coordinate mapping code already in some GUS >>>>>>> utility module (if not, I'm happy to contribute mine, based on >>>>>>> BioPerl's powerful Bio::Coordinate::Map framework)? >>>>>>> >>>>>>> -Aaron >>>>>>> >>>>>>> On Jul 14, 2005, at 11:05 AM, Chris Stoeckert wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Hi Aaron, >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> 1) VirtualSequence has a required sequence_version attribute - >>>>>>>>> what is this for? Is this redundant to >>>>>>>>> external_database_release_id? >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> This is a superclass attribute inherited by all NASequence >>>>>>>> views. My recollection is that individual GenBank sequence >>>>>>>> entries have version tags at the end of accessions as in >>>>>>>> "DQ094190.1" for Toxoplasma gondii ATP-binding cassette protein >>>>>>>> subfamily B member 3 (found in VERSION field). >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> 2) VirtualSequence has a clob for storing the assembled >>>>>>>>> sequence (I suspect), but the Perl object layer doesn't use >>>>>>>>> this slot, instead rebuilding the sequence from the sequence >>>>>>>>> pieces. Am I correct in this usage, or should I not, in fact, >>>>>>>>> be storing the assembled sequence in VirtualSequence? >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> Again this is a superclass attribute. I think using it is >>>>>>>> optional. Reasons not to use it are that the virtual sequence >>>>>>>> is hard to represent as a single entity (e.g., contains gaps) >>>>>>>> or is very large and has a significant overhead cost of storing >>>>>>>> what can be easily regenerated (and avoid denormalization). >>>>>>>> Reasons to use are for convenience and efficiency of retrieving >>>>>>>> the sequence without the need to rebuild. >>>>>>>> >>>>>>>> Chris >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> -Aaron >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Aaron J. Mackey, Ph.D. >>>>>>>>> Project Manager, ApiDB Bioinformatics Resource Center >>>>>>>>> Penn Genomics Institute, University of Pennsylvania >>>>>>>>> email: am...@pc... >>>>>>>>> office: 215-898-1205 >>>>>>>>> fax: 215-746-6697 >>>>>>>>> postal: Penn Genomics Institute >>>>>>>>> Goddard Labs 212 >>>>>>>>> 415 S. University Avenue >>>>>>>>> Philadelphia, PA 19104-6017 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> ------------------------------------------------------- >>>>>>>>> This SF.Net email is sponsored by the 'Do More With Dual!' >>>>>>>>> webinar happening >>>>>>>>> July 14 at 8am PDT/11am EDT. We invite you to explore the >>>>>>>>> latest in dual >>>>>>>>> core and dual graphics technology at this free one hour event >>>>>>>>> hosted by HP,AMD, and NVIDIA. To register visit http:// >>>>>>>>> www.hp.com/go/dualwebinar >>>>>>>>> _______________________________________________ >>>>>>>>> Gusdev-gusdev mailing list >>>>>>>>> Gus...@li... >>>>>>>>> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Aaron J. Mackey, Ph.D. >>>>>>> Project Manager, ApiDB Bioinformatics Resource Center >>>>>>> Penn Genomics Institute, University of Pennsylvania >>>>>>> email: am...@pc... >>>>>>> office: 215-898-1205 >>>>>>> fax: 215-746-6697 >>>>>>> postal: Penn Genomics Institute >>>>>>> Goddard Labs 212 >>>>>>> 415 S. University Avenue >>>>>>> Philadelphia, PA 19104-6017 >>>>>>> >>>>>>> >>>>>>> >>>>>>> ------------------------------------------------------- >>>>>>> SF.Net email is sponsored by: Discover Easy Linux Migration >>>>>>> Strategies >>>>>>> from IBM. Find simple to follow Roadmaps, straightforward articles, >>>>>>> informative Webcasts and more! Get everything you need to get up to >>>>>>> speed, fast. http://ads.osdn.com/? >>>>>>> ad_id=7477&alloc_id=16492&op=click >>>>>>> _______________________________________________ >>>>>>> Gusdev-gusdev mailing list >>>>>>> Gus...@li... >>>>>>> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> -- >>>>> Aaron J. Mackey, Ph.D. >>>>> Project Manager, ApiDB Bioinformatics Resource Center >>>>> Penn Genomics Institute, University of Pennsylvania >>>>> email: am...@pc... >>>>> office: 215-898-1205 >>>>> fax: 215-746-6697 >>>>> postal: Penn Genomics Institute >>>>> Goddard Labs 212 >>>>> 415 S. University Avenue >>>>> Philadelphia, PA 19104-6017 >>>>> >>>>> >>>>> >>>> >>>> >>>> >>>> ------------------------------------------------------- >>>> SF.Net email is sponsored by: Discover Easy Linux Migration Strategies >>>> from IBM. Find simple to follow Roadmaps, straightforward articles, >>>> informative Webcasts and more! Get everything you need to get up to >>>> speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click >>>> _______________________________________________ >>>> Gusdev-gusdev mailing list >>>> Gus...@li... >>>> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev >>>> >>>> >>> >>> -- >>> Aaron J. Mackey, Ph.D. >>> Project Manager, ApiDB Bioinformatics Resource Center >>> Penn Genomics Institute, University of Pennsylvania >>> email: am...@pc... >>> office: 215-898-1205 >>> fax: 215-746-6697 >>> postal: Penn Genomics Institute >>> Goddard Labs 212 >>> 415 S. University Avenue >>> Philadelphia, PA 19104-6017 >>> >> >> >> >> ------------------------------------------------------- >> SF.Net email is sponsored by: Discover Easy Linux Migration Strategies >> from IBM. Find simple to follow Roadmaps, straightforward articles, >> informative Webcasts and more! Get everything you need to get up to >> speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click >> _______________________________________________ >> Gusdev-gusdev mailing list >> Gus...@li... >> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev > > |