Re: [GUSDEV] using VirtualSequence for scaffolding assemblies (not EST assemblies!)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

As we struggle to use GUS the "right way", this is throwing us for a  
loop.  On the one hand, our GUS client applications want to see  
features in the coordinate system of the assembly (i.e. the virtual  
sequence) -- on the other hand, it makes sense from a data integrity  
viewpoint to only load/store feature coordinates with respect to the  
static underlying scaffold coordinates, since the scaffold-to- 
chromosome mapping (as defined by DoTS.SequencePiece) may change over  
time.

One option is to instantiate a read-only materialized view of the  
NALocation for clients to use.

A second option (which we've just discussed, and people seem to like)  
is for the InsertVirtualSequenceFromMapping plugin we just wrote to  
(re)generate duplicate versions of all NALocations attached to a  
given SequencePiece in the new coordinate system (requiring the  
virtual sequence building to be the last step in our pipeline,  
instead of the first).

-Aaron

On Jul 14, 2005, at 2:53 PM, Chris Stoeckert wrote:

> Hi Aaron,
> I don't have a strong argument for either way. In terms of  
> coordinate mapping utilities, I'm not aware of one so certainly  
> would welcome yours (but if others know of ones please speak up).
>
> Chris
>
> On Jul 14, 2005, at 11:13 AM, Aaron J. Mackey wrote:
>
>
>>
>> Thanks Chris, I got it.
>>
>> If we are going to start hanging features off these, should we  
>> hang them off the virtual chromosome sequence entries, or the  
>> scaffold entries in externalnasequence?  Would it make sense to  
>> "codify" this usage with associate PL/SQL code to reconstruct  
>> virtual sequence and associated features in the virtual coordinate  
>> space?  I guess one way to do this would be to have  
>> Virtual*Feature read-only views (and thus target everything to the  
>> "real" coordinate system such that future rebuilds of the virtual  
>> sequence would not require recalculation of feature locations)?
>>
>> Relatedly, is there coordinate mapping code already in some GUS  
>> utility module (if not, I'm happy to contribute mine, based on  
>> BioPerl's powerful Bio::Coordinate::Map framework)?
>>
>> -Aaron
>>
>> On Jul 14, 2005, at 11:05 AM, Chris Stoeckert wrote:
>>
>>
>>
>>> Hi Aaron,
>>>
>>>
>>>
>>>
>>>> 1) VirtualSequence has a required sequence_version attribute -  
>>>> what is this for?  Is this redundant to  
>>>> external_database_release_id?
>>>>
>>>>
>>>>
>>> This is a superclass attribute inherited by all NASequence views.  
>>> My recollection is that individual GenBank sequence entries have  
>>> version tags  at the end of accessions as in "DQ094190.1" for  
>>> Toxoplasma gondii ATP-binding cassette protein subfamily B member  
>>> 3 (found in VERSION field).
>>>
>>>
>>>
>>>
>>>> 2) VirtualSequence has a clob for storing the assembled sequence  
>>>> (I suspect), but the Perl object layer doesn't use this slot,  
>>>> instead rebuilding the sequence from the sequence pieces.  Am I  
>>>> correct in this usage, or should I not, in fact, be storing the  
>>>> assembled sequence in VirtualSequence?
>>>>
>>>>
>>>>
>>>
>>> Again this is a superclass attribute. I think using it is  
>>> optional. Reasons not to use it are that the virtual sequence is  
>>> hard to represent as a single entity (e.g., contains gaps) or is  
>>> very large and has a significant overhead cost of storing what  
>>> can be easily regenerated (and avoid denormalization). Reasons to  
>>> use are for convenience and efficiency of retrieving the sequence  
>>> without the need to rebuild.
>>>
>>> Chris
>>>
>>>
>>>
>>>
>>>
>>>>
>>>> Thanks,
>>>>
>>>> -Aaron
>>>>
>>>> --
>>>> Aaron J. Mackey, Ph.D.
>>>> Project Manager, ApiDB Bioinformatics Resource Center
>>>> Penn Genomics Institute, University of Pennsylvania
>>>> email:  am...@pc...
>>>> office: 215-898-1205
>>>> fax:    215-746-6697
>>>> postal: Penn Genomics Institute
>>>>         Goddard Labs 212
>>>>         415 S. University Avenue
>>>>         Philadelphia, PA  19104-6017
>>>>
>>>>
>>>>
>>>> -------------------------------------------------------
>>>> This SF.Net email is sponsored by the 'Do More With Dual!'  
>>>> webinar happening
>>>> July 14 at 8am PDT/11am EDT. We invite you to explore the latest  
>>>> in dual
>>>> core and dual graphics technology at this free one hour event  
>>>> hosted by HP,AMD, and NVIDIA.  To register visit http:// 
>>>> www.hp.com/go/dualwebinar
>>>> _______________________________________________
>>>> Gusdev-gusdev mailing list
>>>> Gus...@li...
>>>> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>> --
>> Aaron J. Mackey, Ph.D.
>> Project Manager, ApiDB Bioinformatics Resource Center
>> Penn Genomics Institute, University of Pennsylvania
>> email:  am...@pc...
>> office: 215-898-1205
>> fax:    215-746-6697
>> postal: Penn Genomics Institute
>>         Goddard Labs 212
>>         415 S. University Avenue
>>         Philadelphia, PA  19104-6017
>>
>>
>>
>> -------------------------------------------------------
>> SF.Net email is sponsored by: Discover Easy Linux Migration  
>> Strategies
>> from IBM. Find simple to follow Roadmaps, straightforward articles,
>> informative Webcasts and more! Get everything you need to get up to
>> speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
>> _______________________________________________
>> Gusdev-gusdev mailing list
>> Gus...@li...
>> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev
>>
>

--
Aaron J. Mackey, Ph.D.
Project Manager, ApiDB Bioinformatics Resource Center
Penn Genomics Institute, University of Pennsylvania
email:  am...@pc...
office: 215-898-1205
fax:    215-746-6697
postal: Penn Genomics Institute
         Goddard Labs 212
         415 S. University Avenue
         Philadelphia, PA  19104-6017