Re: [GUSDEV] using VirtualSequence for scaffolding assemblies (not EST assemblies!)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Exactly.  No logic is required, because we simply copy any and all  
NALocation objects attached to the sequences and generate new  
NALocation objects that point to the virtual sequence, with new  
coordinate/strand, but all other foreign keys remain the same (i.e.  
children of the same feature).

Hmm, that means that if you blindly pull locations for a given  
feature, you will get two locations, not just one (so you'll need to  
specify which reference sequence you wish to obtain the location on).

-Aaron

On Jul 14, 2005, at 5:41 PM, Chris Stoeckert wrote:

> Let's see if I understand your proposal. Generate features and  
> locations based on the static scaffold sequence coordinates. Then  
> at the end of the pipeline generate the same (conceptual) features  
> with locations based on the virtual sequence coordinates. That  
> makes sense to me. The advantage is that you have both, one that is  
> stable (scaffold) and one that can be regenerated as needed  
> (virtual) but stored for convenience. I don't really see a  
> disadvantage - sure it's twice as many rows but if you materialize  
> a view you adding these anyway.
>
> Chris
>
> On Jul 14, 2005, at 3:50 PM, Aaron J. Mackey wrote:
>
>
>>
>> As we struggle to use GUS the "right way", this is throwing us for  
>> a loop.  On the one hand, our GUS client applications want to see  
>> features in the coordinate system of the assembly (i.e. the  
>> virtual sequence) -- on the other hand, it makes sense from a data  
>> integrity viewpoint to only load/store feature coordinates with  
>> respect to the static underlying scaffold coordinates, since the  
>> scaffold-to-chromosome mapping (as defined by DoTS.SequencePiece)  
>> may change over time.
>>
>> One option is to instantiate a read-only materialized view of the  
>> NALocation for clients to use.
>>
>> A second option (which we've just discussed, and people seem to  
>> like) is for the InsertVirtualSequenceFromMapping plugin we just  
>> wrote to (re)generate duplicate versions of all NALocations  
>> attached to a given SequencePiece in the new coordinate system  
>> (requiring the virtual sequence building to be the last step in  
>> our pipeline, instead of the first).
>>
>> -Aaron
>>
>> On Jul 14, 2005, at 2:53 PM, Chris Stoeckert wrote:
>>
>>
>>
>>> Hi Aaron,
>>> I don't have a strong argument for either way. In terms of  
>>> coordinate mapping utilities, I'm not aware of one so certainly  
>>> would welcome yours (but if others know of ones please speak up).
>>>
>>> Chris
>>>
>>> On Jul 14, 2005, at 11:13 AM, Aaron J. Mackey wrote:
>>>
>>>
>>>
>>>
>>>>
>>>> Thanks Chris, I got it.
>>>>
>>>> If we are going to start hanging features off these, should we  
>>>> hang them off the virtual chromosome sequence entries, or the  
>>>> scaffold entries in externalnasequence?  Would it make sense to  
>>>> "codify" this usage with associate PL/SQL code to reconstruct  
>>>> virtual sequence and associated features in the virtual  
>>>> coordinate space?  I guess one way to do this would be to have  
>>>> Virtual*Feature read-only views (and thus target everything to  
>>>> the "real" coordinate system such that future rebuilds of the  
>>>> virtual sequence would not require recalculation of feature  
>>>> locations)?
>>>>
>>>> Relatedly, is there coordinate mapping code already in some GUS  
>>>> utility module (if not, I'm happy to contribute mine, based on  
>>>> BioPerl's powerful Bio::Coordinate::Map framework)?
>>>>
>>>> -Aaron
>>>>
>>>> On Jul 14, 2005, at 11:05 AM, Chris Stoeckert wrote:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> Hi Aaron,
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> 1) VirtualSequence has a required sequence_version attribute -  
>>>>>> what is this for?  Is this redundant to  
>>>>>> external_database_release_id?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> This is a superclass attribute inherited by all NASequence  
>>>>> views. My recollection is that individual GenBank sequence  
>>>>> entries have version tags  at the end of accessions as in  
>>>>> "DQ094190.1" for Toxoplasma gondii ATP-binding cassette protein  
>>>>> subfamily B member 3 (found in VERSION field).
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> 2) VirtualSequence has a clob for storing the assembled  
>>>>>> sequence (I suspect), but the Perl object layer doesn't use  
>>>>>> this slot, instead rebuilding the sequence from the sequence  
>>>>>> pieces.  Am I correct in this usage, or should I not, in fact,  
>>>>>> be storing the assembled sequence in VirtualSequence?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> Again this is a superclass attribute. I think using it is  
>>>>> optional. Reasons not to use it are that the virtual sequence  
>>>>> is hard to represent as a single entity (e.g., contains gaps)  
>>>>> or is very large and has a significant overhead cost of storing  
>>>>> what can be easily regenerated (and avoid denormalization).  
>>>>> Reasons to use are for convenience and efficiency of retrieving  
>>>>> the sequence without the need to rebuild.
>>>>>
>>>>> Chris
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> -Aaron
>>>>>>
>>>>>> --
>>>>>> Aaron J. Mackey, Ph.D.
>>>>>> Project Manager, ApiDB Bioinformatics Resource Center
>>>>>> Penn Genomics Institute, University of Pennsylvania
>>>>>> email:  am...@pc...
>>>>>> office: 215-898-1205
>>>>>> fax:    215-746-6697
>>>>>> postal: Penn Genomics Institute
>>>>>>         Goddard Labs 212
>>>>>>         415 S. University Avenue
>>>>>>         Philadelphia, PA  19104-6017
>>>>>>
>>>>>>
>>>>>>
>>>>>> -------------------------------------------------------
>>>>>> This SF.Net email is sponsored by the 'Do More With Dual!'  
>>>>>> webinar happening
>>>>>> July 14 at 8am PDT/11am EDT. We invite you to explore the  
>>>>>> latest in dual
>>>>>> core and dual graphics technology at this free one hour event  
>>>>>> hosted by HP,AMD, and NVIDIA.  To register visit http:// 
>>>>>> www.hp.com/go/dualwebinar
>>>>>> _______________________________________________
>>>>>> Gusdev-gusdev mailing list
>>>>>> Gus...@li...
>>>>>> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Aaron J. Mackey, Ph.D.
>>>> Project Manager, ApiDB Bioinformatics Resource Center
>>>> Penn Genomics Institute, University of Pennsylvania
>>>> email:  am...@pc...
>>>> office: 215-898-1205
>>>> fax:    215-746-6697
>>>> postal: Penn Genomics Institute
>>>>         Goddard Labs 212
>>>>         415 S. University Avenue
>>>>         Philadelphia, PA  19104-6017
>>>>
>>>>
>>>>
>>>> -------------------------------------------------------
>>>> SF.Net email is sponsored by: Discover Easy Linux Migration  
>>>> Strategies
>>>> from IBM. Find simple to follow Roadmaps, straightforward articles,
>>>> informative Webcasts and more! Get everything you need to get up to
>>>> speed, fast. http://ads.osdn.com/? 
>>>> ad_id=7477&alloc_id=16492&op=click
>>>> _______________________________________________
>>>> Gusdev-gusdev mailing list
>>>> Gus...@li...
>>>> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>> --
>> Aaron J. Mackey, Ph.D.
>> Project Manager, ApiDB Bioinformatics Resource Center
>> Penn Genomics Institute, University of Pennsylvania
>> email:  am...@pc...
>> office: 215-898-1205
>> fax:    215-746-6697
>> postal: Penn Genomics Institute
>>         Goddard Labs 212
>>         415 S. University Avenue
>>         Philadelphia, PA  19104-6017
>>
>>
>
>
>
> -------------------------------------------------------
> SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
> from IBM. Find simple to follow Roadmaps, straightforward articles,
> informative Webcasts and more! Get everything you need to get up to
> speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
> _______________________________________________
> Gusdev-gusdev mailing list
> Gus...@li...
> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev
>

--
Aaron J. Mackey, Ph.D.
Project Manager, ApiDB Bioinformatics Resource Center
Penn Genomics Institute, University of Pennsylvania
email:  am...@pc...
office: 215-898-1205
fax:    215-746-6697
postal: Penn Genomics Institute
         Goddard Labs 212
         415 S. University Avenue
         Philadelphia, PA  19104-6017