Re: [GUSDEV] using VirtualSequence for scaffolding assemblies (not EST assemblies!)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Got it, thanks.  Something else to keep on our minds while we  
implement central dogma handling (that two instances of a Gene may in  
fact be the same physical instance represented in two coordinate  
spaces).  Note that we will ultimately have at least three coordinate  
spaces (contig<->scaffold<->chromosome), and possibly central dogma  
related coordinate mappings (protein <-> mRNA <-> DNA).

-Aaron

On Jul 14, 2005, at 6:13 PM, Chris Stoeckert wrote:

> No, these are different features because they are spans on  
> different sequences (one scaffold and one virtual) so you won't get  
> two locations based on this for the same na_feature_id. NAFeature  
> has the na_sequence_id which tells you whether it is the scaffold  
> or virtual sequence. If these are Gene, RNA, or Protein features  
> then you can say that they are the same conceptual feature through  
> the central dogma and instance tables. If they are features like  
> Exon, then you could infer this as you say by parent_id, source_id,  
> etc.
>
> Chris
>
> On Jul 14, 2005, at 5:52 PM, Aaron J. Mackey wrote:
>
>
>>
>> Exactly.  No logic is required, because we simply copy any and all  
>> NALocation objects attached to the sequences and generate new  
>> NALocation objects that point to the virtual sequence, with new  
>> coordinate/strand, but all other foreign keys remain the same  
>> (i.e. children of the same feature).
>>
>> Hmm, that means that if you blindly pull locations for a given  
>> feature, you will get two locations, not just one (so you'll need  
>> to specify which reference sequence you wish to obtain the  
>> location on).
>>
>> -Aaron
>>
>> On Jul 14, 2005, at 5:41 PM, Chris Stoeckert wrote:
>>
>>
>>
>>> Let's see if I understand your proposal. Generate features and  
>>> locations based on the static scaffold sequence coordinates. Then  
>>> at the end of the pipeline generate the same (conceptual)  
>>> features with locations based on the virtual sequence  
>>> coordinates. That makes sense to me. The advantage is that you  
>>> have both, one that is stable (scaffold) and one that can be  
>>> regenerated as needed (virtual) but stored for convenience. I  
>>> don't really see a disadvantage - sure it's twice as many rows  
>>> but if you materialize a view you adding these anyway.
>>>
>>> Chris
>>>
>>> On Jul 14, 2005, at 3:50 PM, Aaron J. Mackey wrote:
>>>
>>>
>>>
>>>
>>>>
>>>> As we struggle to use GUS the "right way", this is throwing us  
>>>> for a loop.  On the one hand, our GUS client applications want  
>>>> to see features in the coordinate system of the assembly (i.e.  
>>>> the virtual sequence) -- on the other hand, it makes sense from  
>>>> a data integrity viewpoint to only load/store feature  
>>>> coordinates with respect to the static underlying scaffold  
>>>> coordinates, since the scaffold-to-chromosome mapping (as  
>>>> defined by DoTS.SequencePiece) may change over time.
>>>>
>>>> One option is to instantiate a read-only materialized view of  
>>>> the NALocation for clients to use.
>>>>
>>>> A second option (which we've just discussed, and people seem to  
>>>> like) is for the InsertVirtualSequenceFromMapping plugin we just  
>>>> wrote to (re)generate duplicate versions of all NALocations  
>>>> attached to a given SequencePiece in the new coordinate system  
>>>> (requiring the virtual sequence building to be the last step in  
>>>> our pipeline, instead of the first).
>>>>
>>>> -Aaron
>>>>
>>>> On Jul 14, 2005, at 2:53 PM, Chris Stoeckert wrote:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> Hi Aaron,
>>>>> I don't have a strong argument for either way. In terms of  
>>>>> coordinate mapping utilities, I'm not aware of one so certainly  
>>>>> would welcome yours (but if others know of ones please speak up).
>>>>>
>>>>> Chris
>>>>>
>>>>> On Jul 14, 2005, at 11:13 AM, Aaron J. Mackey wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> Thanks Chris, I got it.
>>>>>>
>>>>>> If we are going to start hanging features off these, should we  
>>>>>> hang them off the virtual chromosome sequence entries, or the  
>>>>>> scaffold entries in externalnasequence?  Would it make sense  
>>>>>> to "codify" this usage with associate PL/SQL code to  
>>>>>> reconstruct virtual sequence and associated features in the  
>>>>>> virtual coordinate space?  I guess one way to do this would be  
>>>>>> to have Virtual*Feature read-only views (and thus target  
>>>>>> everything to the "real" coordinate system such that future  
>>>>>> rebuilds of the virtual sequence would not require  
>>>>>> recalculation of feature locations)?
>>>>>>
>>>>>> Relatedly, is there coordinate mapping code already in some  
>>>>>> GUS utility module (if not, I'm happy to contribute mine,  
>>>>>> based on BioPerl's powerful Bio::Coordinate::Map framework)?
>>>>>>
>>>>>> -Aaron
>>>>>>
>>>>>> On Jul 14, 2005, at 11:05 AM, Chris Stoeckert wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Hi Aaron,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> 1) VirtualSequence has a required sequence_version attribute  
>>>>>>>> - what is this for?  Is this redundant to  
>>>>>>>> external_database_release_id?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> This is a superclass attribute inherited by all NASequence  
>>>>>>> views. My recollection is that individual GenBank sequence  
>>>>>>> entries have version tags  at the end of accessions as in  
>>>>>>> "DQ094190.1" for Toxoplasma gondii ATP-binding cassette  
>>>>>>> protein subfamily B member 3 (found in VERSION field).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> 2) VirtualSequence has a clob for storing the assembled  
>>>>>>>> sequence (I suspect), but the Perl object layer doesn't use  
>>>>>>>> this slot, instead rebuilding the sequence from the sequence  
>>>>>>>> pieces.  Am I correct in this usage, or should I not, in  
>>>>>>>> fact, be storing the assembled sequence in VirtualSequence?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> Again this is a superclass attribute. I think using it is  
>>>>>>> optional. Reasons not to use it are that the virtual sequence  
>>>>>>> is hard to represent as a single entity (e.g., contains gaps)  
>>>>>>> or is very large and has a significant overhead cost of  
>>>>>>> storing what can be easily regenerated (and avoid  
>>>>>>> denormalization). Reasons to use are for convenience and  
>>>>>>> efficiency of retrieving the sequence without the need to  
>>>>>>> rebuild.
>>>>>>>
>>>>>>> Chris
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> -Aaron
>>>>>>>>
>>>>>>>> --
>>>>>>>> Aaron J. Mackey, Ph.D.
>>>>>>>> Project Manager, ApiDB Bioinformatics Resource Center
>>>>>>>> Penn Genomics Institute, University of Pennsylvania
>>>>>>>> email:  am...@pc...
>>>>>>>> office: 215-898-1205
>>>>>>>> fax:    215-746-6697
>>>>>>>> postal: Penn Genomics Institute
>>>>>>>>         Goddard Labs 212
>>>>>>>>         415 S. University Avenue
>>>>>>>>         Philadelphia, PA  19104-6017
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> -------------------------------------------------------
>>>>>>>> This SF.Net email is sponsored by the 'Do More With Dual!'  
>>>>>>>> webinar happening
>>>>>>>> July 14 at 8am PDT/11am EDT. We invite you to explore the  
>>>>>>>> latest in dual
>>>>>>>> core and dual graphics technology at this free one hour  
>>>>>>>> event hosted by HP,AMD, and NVIDIA.  To register visit  
>>>>>>>> http://www.hp.com/go/dualwebinar
>>>>>>>> _______________________________________________
>>>>>>>> Gusdev-gusdev mailing list
>>>>>>>> Gus...@li...
>>>>>>>> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Aaron J. Mackey, Ph.D.
>>>>>> Project Manager, ApiDB Bioinformatics Resource Center
>>>>>> Penn Genomics Institute, University of Pennsylvania
>>>>>> email:  am...@pc...
>>>>>> office: 215-898-1205
>>>>>> fax:    215-746-6697
>>>>>> postal: Penn Genomics Institute
>>>>>>         Goddard Labs 212
>>>>>>         415 S. University Avenue
>>>>>>         Philadelphia, PA  19104-6017
>>>>>>
>>>>>>
>>>>>>
>>>>>> -------------------------------------------------------
>>>>>> SF.Net email is sponsored by: Discover Easy Linux Migration  
>>>>>> Strategies
>>>>>> from IBM. Find simple to follow Roadmaps, straightforward  
>>>>>> articles,
>>>>>> informative Webcasts and more! Get everything you need to get  
>>>>>> up to
>>>>>> speed, fast. http://ads.osdn.com/? 
>>>>>> ad_id=7477&alloc_id=16492&op=click
>>>>>> _______________________________________________
>>>>>> Gusdev-gusdev mailing list
>>>>>> Gus...@li...
>>>>>> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Aaron J. Mackey, Ph.D.
>>>> Project Manager, ApiDB Bioinformatics Resource Center
>>>> Penn Genomics Institute, University of Pennsylvania
>>>> email:  am...@pc...
>>>> office: 215-898-1205
>>>> fax:    215-746-6697
>>>> postal: Penn Genomics Institute
>>>>         Goddard Labs 212
>>>>         415 S. University Avenue
>>>>         Philadelphia, PA  19104-6017
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> -------------------------------------------------------
>>> SF.Net email is sponsored by: Discover Easy Linux Migration  
>>> Strategies
>>> from IBM. Find simple to follow Roadmaps, straightforward articles,
>>> informative Webcasts and more! Get everything you need to get up to
>>> speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
>>> _______________________________________________
>>> Gusdev-gusdev mailing list
>>> Gus...@li...
>>> https://lists.sourceforge.net/lists/listinfo/gusdev-gusdev
>>>
>>>
>>>
>>
>> --
>> Aaron J. Mackey, Ph.D.
>> Project Manager, ApiDB Bioinformatics Resource Center
>> Penn Genomics Institute, University of Pennsylvania
>> email:  am...@pc...
>> office: 215-898-1205
>> fax:    215-746-6697
>> postal: Penn Genomics Institute
>>         Goddard Labs 212
>>         415 S. University Avenue
>>         Philadelphia, PA  19104-6017
>>
>

--
Aaron J. Mackey, Ph.D.
Project Manager, ApiDB Bioinformatics Resource Center
Penn Genomics Institute, University of Pennsylvania
email:  am...@pc...
office: 215-898-1205
fax:    215-746-6697
postal: Penn Genomics Institute
         Goddard Labs 212
         415 S. University Avenue
         Philadelphia, PA  19104-6017