Re: [GUSDEV] using VirtualSequence for scaffolding assemblies (not EST assemblies!)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

We solved it by just copying the feature trees directly at the 
NAFeatureImp table ... 5 or so lines of code, and no big deal.

Remember that we may want to actually do this coordinate mapping along 
multiple alternative coordinate systems, so the post-processing is 
really more of an entirely separate InsertNewCoordinateSystem.pm plugin 
that takes (source, target) tuples that identify a given mapping found 
in SequencePiece.  Better plugin name suggestions welcome.

-Aaron

steve wrote:
> i agree that we need to project copies of the scaffold features onto the 
> virtual chromosome.
> 
> but,  i want to point out that this may be a bit tricky if done as a 
> post-process.    the reason is that a "feature" spans multiple tables.   
> so, the copying of a feature means the traversal of a tree its child 
> objects.   how does the post-process program know what that tree is?
> 
> one way is for the programmer of that program to use human knowledge of 
> the schema to produce the possible tree, and traverse it.
> 
> another way is to use schema information to generate the tree.
> 
> an alternative approach would be:
>  1. create the virtual sequence as a pre-process that does not write 
> features.
>  2. any plugin that writes features has the option to take a virtual 
> sequence.   if given that, it would read all the virtual sequence's 
> pieces to determine their offset.   it would use their source_id to 
> correlate them with the input, and as the features are created do the 
> project them simultaneously on both the piece and the virtual sequence.
> 
> that sounds kind of complicated, so probably the post-process is better.
> 
> its kind of late and i'm kind of foggy...
> 
> steve
> 
> 
> 
> 
> 
> 
> Chris Stoeckert wrote:
> 
>> No, these are different features because they are spans on different  
>> sequences (one scaffold and one virtual) so you won't get two  
>> locations based on this for the same na_feature_id. NAFeature has the  
>> na_sequence_id which tells you whether it is the scaffold or virtual  
>> sequence. If these are Gene, RNA, or Protein features then you can  
>> say that they are the same conceptual feature through the central  
>> dogma and instance tables. If they are features like Exon, then you  
>> could infer this as you say by parent_id, source_id, etc.
>>
>> Chris
>>
>> On Jul 14, 2005, at 5:52 PM, Aaron J. Mackey wrote:
>>
>>>
>>> Exactly.  No logic is required, because we simply copy any and all  
>>> NALocation objects attached to the sequences and generate new  
>>> NALocation objects that point to the virtual sequence, with new  
>>> coordinate/strand, but all other foreign keys remain the same (i.e.  
>>> children of the same feature).
>>>
>>> Hmm, that means that if you blindly pull locations for a given  
>>> feature, you will get two locations, not just one (so you'll need  to 
>>> specify which reference sequence you wish to obtain the location  on).
>>>
>>> -Aaron
>>>
>>> On Jul 14, 2005, at 5:41 PM, Chris Stoeckert wrote:
>>>
>>>
>>>> Let's see if I understand your proposal. Generate features and  
>>>> locations based on the static scaffold sequence coordinates. Then  
>>>> at the end of the pipeline generate the same (conceptual) features  
>>>> with locations based on the virtual sequence coordinates. That  
>>>> makes sense to me. The advantage is that you have both, one that  is 
>>>> stable (scaffold) and one that can be regenerated as needed  
>>>> (virtual) but stored for convenience. I don't really see a  
>>>> disadvantage - sure it's twice as many rows but if you materialize  
>>>> a view you adding these anyway.
>>>>
>>>> Chris
>>>>
>>>> On Jul 14, 2005, at 3:50 PM, Aaron J. Mackey wrote:
>>>>
>>>>
>>>>
>>>>>
>>>>> As we struggle to use GUS the "right way", this is throwing us  for 
>>>>> a loop.  On the one hand, our GUS client applications want to  see 
>>>>> features in the coordinate system of the assembly (i.e. the  
>>>>> virtual sequence) -- on the other hand, it makes sense from a  data 
>>>>> integrity viewpoint to only load/store feature coordinates  with 
>>>>> respect to the static underlying scaffold coordinates, since  the 
>>>>> scaffold-to-chromosome mapping (as defined by  DoTS.SequencePiece) 
>>>>> may change over time.
>>>>>
>>>>> One option is to instantiate a read-only materialized view of the  
>>>>> NALocation for clients to use.
>>>>>
>>>>> A second option (which we've just discussed, and people seem to  
>>>>> like) is for the InsertVirtualSequenceFromMapping plugin we just  
>>>>> wrote to (re)generate duplicate versions of all NALocations  
>>>>> attached to a given SequencePiece in the new coordinate system  
>>>>> (requiring the virtual sequence building to be the last step in  
>>>>> our pipeline, instead of the first).
>>>>>
>>>>> -Aaron
>>>>>
>>>>> On Jul 14, 2005, at 2:53 PM, Chris Stoeckert wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Hi Aaron,
>>>>>> I don't have a strong argument for either way. In terms of  
>>>>>> coordinate mapping utilities, I'm not aware of one so certainly  
>>>>>> would welcome yours (but if others know of ones please speak up).
>>>>>>
>>>>>> Chris
>>>>>>
>>>>>> On Jul 14, 2005, at 11:13 AM, Aaron J. Mackey wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Thanks Chris, I got it.
>>>>>>>
>>>>>>> If we are going to start hanging features off these, should we  
>>>>>>> hang them off the virtual chromosome sequence entries, or the  
>>>>>>> scaffold entries in externalnasequence?  Would it make sense to  
>>>>>>> "codify" this usage with associate PL/SQL code to reconstruct  
>>>>>>> virtual sequence and associated features in the virtual  
>>>>>>> coordinate space?  I guess one way to do this would be to have  
>>>>>>> Virtual*Feature read-only views (and thus target everything to  
>>>>>>> the "real" coordinate system such that future rebuilds of the  
>>>>>>> virtual sequence would not require recalculation of feature  
>>>>>>> locations)?
>>>>>>>
>>>>>>> Relatedly, is there coordinate mapping code already in some GUS  
>>>>>>> utility module (if not, I'm happy to contribute mine, based on  
>>>>>>> BioPerl's powerful Bio::Coordinate::Map framework)?
>>>>>>>
>>>>>>> -Aaron
>>>>>>>
>>>>>>> On Jul 14, 2005, at 11:05 AM, Chris Stoeckert wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Hi Aaron,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> 1) VirtualSequence has a required sequence_version attribute  - 
>>>>>>>>> what is this for?  Is this redundant to  
>>>>>>>>> external_database_release_id?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> This is a superclass attribute inherited by all NASequence  
>>>>>>>> views. My recollection is that individual GenBank sequence  
>>>>>>>> entries have version tags  at the end of accessions as in  
>>>>>>>> "DQ094190.1" for Toxoplasma gondii ATP-binding cassette  protein 
>>>>>>>> subfamily B member 3 (found in VERSION field).
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> 2) VirtualSequence has a clob for storing the assembled  
>>>>>>>>> sequence (I suspect), but the Perl object layer doesn't use  
>>>>>>>>> this slot, instead rebuilding the sequence from the sequence  
>>>>>>>>> pieces.  Am I correct in this usage, or should I not, in  fact, 
>>>>>>>>> be storing the assembled sequence in VirtualSequence?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> Again this is a superclass attribute. I think using it is  
>>>>>>>> optional. Reasons not to use it are that the virtual sequence  
>>>>>>>> is hard to represent as a single entity (e.g., contains gaps)  
>>>>>>>> or is very large and has a significant overhead cost of  storing 
>>>>>>>> what can be easily regenerated (and avoid  denormalization). 
>>>>>>>> Reasons to use are for convenience and  efficiency of retrieving 
>>>>>>>> the sequence without the need to  rebuild.
>>>>>>>>
>>>>>>>> Chris
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> -Aaron
>>>>>>>>>
>>>>>>>>> -- 
>>>>>>>>> Aaron J. Mackey, Ph.D.
>>>>>>>>> Project Manager, ApiDB Bioinformatics Resource Center
>>>>>>>>> Penn Genomics Institute, University of Pennsylvania
>>>>>>>>> email:  am...@pc...
>>>>>>>>> office: 215-898-1205
>>>>>>>>> fax:    215-746-6697
>>>>>>>>> postal: Penn Genomics Institute
>>>>>>>>>         Goddard Labs 212
>>>>>>>>>         415 S. University Avenue
>>>>>>>>>         Philadelphia, PA  19104-6017
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -------------------------------------------------------
>>>>>>>>> This SF.Net email is sponsored by the 'Do More With Dual!'  
>>>>>>>>> webinar happening
>>>>>>>>> July 14 at 8am PDT/11am EDT. We invite you to explore the  
>>>>>>>>> latest in dual
>>>>>>>>> core and dual graphics technology at this free one hour event  
>>>>>>>>> hosted by HP,AMD, and NVIDIA.  To register visit http:// 
>>>>>>>>> www.hp.com/go/dualwebinar
>>>>>>>>> _______________________________________________
>>>>>>>>> Gusdev-gusdev mailing list
>>>>>>>>> Gus...@li...
>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> -- 
>>>>>>> Aaron J. Mackey, Ph.D.
>>>>>>> Project Manager, ApiDB Bioinformatics Resource Center
>>>>>>> Penn Genomics Institute, University of Pennsylvania
>>>>>>> email:  am...@pc...
>>>>>>> office: 215-898-1205
>>>>>>> fax:    215-746-6697
>>>>>>> postal: Penn Genomics Institute
>>>>>>>         Goddard Labs 212
>>>>>>>         415 S. University Avenue
>>>>>>>         Philadelphia, PA  19104-6017
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -------------------------------------------------------
>>>>>>> SF.Net email is sponsored by: Discover Easy Linux Migration  
>>>>>>> Strategies
>>>>>>> from IBM. Find simple to follow Roadmaps, straightforward  articles,
>>>>>>> informative Webcasts and more! Get everything you need to get  up to
>>>>>>> speed, fast. http://ads.osdn.com/? 
>>>>>>> ad_id=7477&alloc_id=16492&op=click
>>>>>>> _______________________________________________
>>>>>>> Gusdev-gusdev mailing list
>>>>>>> Gus...@li...
>>>>>>> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> -- 
>>>>> Aaron J. Mackey, Ph.D.
>>>>> Project Manager, ApiDB Bioinformatics Resource Center
>>>>> Penn Genomics Institute, University of Pennsylvania
>>>>> email:  am...@pc...
>>>>> office: 215-898-1205
>>>>> fax:    215-746-6697
>>>>> postal: Penn Genomics Institute
>>>>>         Goddard Labs 212
>>>>>         415 S. University Avenue
>>>>>         Philadelphia, PA  19104-6017
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> -------------------------------------------------------
>>>> SF.Net email is sponsored by: Discover Easy Linux Migration  Strategies
>>>> from IBM. Find simple to follow Roadmaps, straightforward articles,
>>>> informative Webcasts and more! Get everything you need to get up to
>>>> speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
>>>> _______________________________________________
>>>> Gusdev-gusdev mailing list
>>>> Gus...@li...
>>>> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev
>>>>
>>>>
>>>
>>> -- 
>>> Aaron J. Mackey, Ph.D.
>>> Project Manager, ApiDB Bioinformatics Resource Center
>>> Penn Genomics Institute, University of Pennsylvania
>>> email:  am...@pc...
>>> office: 215-898-1205
>>> fax:    215-746-6697
>>> postal: Penn Genomics Institute
>>>         Goddard Labs 212
>>>         415 S. University Avenue
>>>         Philadelphia, PA  19104-6017
>>>
>>
>>
>>
>> -------------------------------------------------------
>> SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
>> from IBM. Find simple to follow Roadmaps, straightforward articles,
>> informative Webcasts and more! Get everything you need to get up to
>> speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
>> _______________________________________________
>> Gusdev-gusdev mailing list
>> Gus...@li...
>> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev
> 
>