[GUSDEV] NALocation and NAFeature proposal

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

gus folks-

we are encountering a type of data that we haven't had to deal with yet, 
and I think the best way to handle it is a change to the schema.

The change is:
   add:       na_sequence_id to NALocation
   remove:  na_sequence_id from NAFeature and all its subclasses.

In other words, the location of a feature specifies what sequence it 
belongs to, rather than the feature specifying that directly itself.  
This enables a feature to exist on more than one sequence.

The data we have is scaffolds and a genetic map.  We use the map to 
order and orient the scaffolds.  We  also submit the scaffolds to our 
analysis pipeline which produces features on the scaffolds

We store the scaffolds as SequencePieces, and the chromosome as a 
VirtualSequence.

We would like our presentation layer, eg GBrowse, to be able to display 
the features on the chromosome  as well as on the scaffolds, with 
correctly transformed locations.  This means that we have to project the 
SequencePiece features onto the VirtualSequence. 

We have considered many alternative ways of doing this projection (Aaron 
and I and others).  It is now clear to me that the most elegant and 
practical approach is to allow NAFeatures to have NALocations on more 
than one Sequence.  Given that schema, we can add a final analysis step 
to our pipeline that easily does the projection by creating a new set of 
NALocations that attach the NAFeatures from the SequencePieces to the 
VirtualSequence.

The downsides that I see to this approach are:
  1. a change to the schema
  2. in the case that a program wants to iterate across the features of 
a sequence without regard to their location, the query will have an 
additional join.   i think this is probably a rare case.

I would propose this as a feature enhancement to GUS Schema 3.6

Encouragments?

Objections?

thanks,
steve