Re: [GUSDEV] using VirtualSequence for scaffolding assemblies (not EST assemblies!)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Let's see if I understand your proposal. Generate features and  
locations based on the static scaffold sequence coordinates. Then at  
the end of the pipeline generate the same (conceptual) features with  
locations based on the virtual sequence coordinates. That makes sense  
to me. The advantage is that you have both, one that is stable  
(scaffold) and one that can be regenerated as needed (virtual) but  
stored for convenience. I don't really see a disadvantage - sure it's  
twice as many rows but if you materialize a view you adding these  
anyway.

Chris

On Jul 14, 2005, at 3:50 PM, Aaron J. Mackey wrote:

>
> As we struggle to use GUS the "right way", this is throwing us for  
> a loop.  On the one hand, our GUS client applications want to see  
> features in the coordinate system of the assembly (i.e. the virtual  
> sequence) -- on the other hand, it makes sense from a data  
> integrity viewpoint to only load/store feature coordinates with  
> respect to the static underlying scaffold coordinates, since the  
> scaffold-to-chromosome mapping (as defined by DoTS.SequencePiece)  
> may change over time.
>
> One option is to instantiate a read-only materialized view of the  
> NALocation for clients to use.
>
> A second option (which we've just discussed, and people seem to  
> like) is for the InsertVirtualSequenceFromMapping plugin we just  
> wrote to (re)generate duplicate versions of all NALocations  
> attached to a given SequencePiece in the new coordinate system  
> (requiring the virtual sequence building to be the last step in our  
> pipeline, instead of the first).
>
> -Aaron
>
> On Jul 14, 2005, at 2:53 PM, Chris Stoeckert wrote:
>
>
>> Hi Aaron,
>> I don't have a strong argument for either way. In terms of  
>> coordinate mapping utilities, I'm not aware of one so certainly  
>> would welcome yours (but if others know of ones please speak up).
>>
>> Chris
>>
>> On Jul 14, 2005, at 11:13 AM, Aaron J. Mackey wrote:
>>
>>
>>
>>>
>>> Thanks Chris, I got it.
>>>
>>> If we are going to start hanging features off these, should we  
>>> hang them off the virtual chromosome sequence entries, or the  
>>> scaffold entries in externalnasequence?  Would it make sense to  
>>> "codify" this usage with associate PL/SQL code to reconstruct  
>>> virtual sequence and associated features in the virtual  
>>> coordinate space?  I guess one way to do this would be to have  
>>> Virtual*Feature read-only views (and thus target everything to  
>>> the "real" coordinate system such that future rebuilds of the  
>>> virtual sequence would not require recalculation of feature  
>>> locations)?
>>>
>>> Relatedly, is there coordinate mapping code already in some GUS  
>>> utility module (if not, I'm happy to contribute mine, based on  
>>> BioPerl's powerful Bio::Coordinate::Map framework)?
>>>
>>> -Aaron
>>>
>>> On Jul 14, 2005, at 11:05 AM, Chris Stoeckert wrote:
>>>
>>>
>>>
>>>
>>>> Hi Aaron,
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> 1) VirtualSequence has a required sequence_version attribute -  
>>>>> what is this for?  Is this redundant to  
>>>>> external_database_release_id?
>>>>>
>>>>>
>>>>>
>>>>>
>>>> This is a superclass attribute inherited by all NASequence  
>>>> views. My recollection is that individual GenBank sequence  
>>>> entries have version tags  at the end of accessions as in  
>>>> "DQ094190.1" for Toxoplasma gondii ATP-binding cassette protein  
>>>> subfamily B member 3 (found in VERSION field).
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> 2) VirtualSequence has a clob for storing the assembled  
>>>>> sequence (I suspect), but the Perl object layer doesn't use  
>>>>> this slot, instead rebuilding the sequence from the sequence  
>>>>> pieces.  Am I correct in this usage, or should I not, in fact,  
>>>>> be storing the assembled sequence in VirtualSequence?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> Again this is a superclass attribute. I think using it is  
>>>> optional. Reasons not to use it are that the virtual sequence is  
>>>> hard to represent as a single entity (e.g., contains gaps) or is  
>>>> very large and has a significant overhead cost of storing what  
>>>> can be easily regenerated (and avoid denormalization). Reasons  
>>>> to use are for convenience and efficiency of retrieving the  
>>>> sequence without the need to rebuild.
>>>>
>>>> Chris
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> -Aaron
>>>>>
>>>>> --
>>>>> Aaron J. Mackey, Ph.D.
>>>>> Project Manager, ApiDB Bioinformatics Resource Center
>>>>> Penn Genomics Institute, University of Pennsylvania
>>>>> email:  am...@pc...
>>>>> office: 215-898-1205
>>>>> fax:    215-746-6697
>>>>> postal: Penn Genomics Institute
>>>>>         Goddard Labs 212
>>>>>         415 S. University Avenue
>>>>>         Philadelphia, PA  19104-6017
>>>>>
>>>>>
>>>>>
>>>>> -------------------------------------------------------
>>>>> This SF.Net email is sponsored by the 'Do More With Dual!'  
>>>>> webinar happening
>>>>> July 14 at 8am PDT/11am EDT. We invite you to explore the  
>>>>> latest in dual
>>>>> core and dual graphics technology at this free one hour event  
>>>>> hosted by HP,AMD, and NVIDIA.  To register visit http:// 
>>>>> www.hp.com/go/dualwebinar
>>>>> _______________________________________________
>>>>> Gusdev-gusdev mailing list
>>>>> Gus...@li...
>>>>> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>> --
>>> Aaron J. Mackey, Ph.D.
>>> Project Manager, ApiDB Bioinformatics Resource Center
>>> Penn Genomics Institute, University of Pennsylvania
>>> email:  am...@pc...
>>> office: 215-898-1205
>>> fax:    215-746-6697
>>> postal: Penn Genomics Institute
>>>         Goddard Labs 212
>>>         415 S. University Avenue
>>>         Philadelphia, PA  19104-6017
>>>
>>>
>>>
>>> -------------------------------------------------------
>>> SF.Net email is sponsored by: Discover Easy Linux Migration  
>>> Strategies
>>> from IBM. Find simple to follow Roadmaps, straightforward articles,
>>> informative Webcasts and more! Get everything you need to get up to
>>> speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
>>> _______________________________________________
>>> Gusdev-gusdev mailing list
>>> Gus...@li...
>>> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev
>>>
>>>
>>
>>
>
> --
> Aaron J. Mackey, Ph.D.
> Project Manager, ApiDB Bioinformatics Resource Center
> Penn Genomics Institute, University of Pennsylvania
> email:  am...@pc...
> office: 215-898-1205
> fax:    215-746-6697
> postal: Penn Genomics Institute
>         Goddard Labs 212
>         415 S. University Avenue
>         Philadelphia, PA  19104-6017
>