From: Steve F. <sfi...@pc...> - 2005-07-17 20:37:01
|
gus folks- we are encountering a type of data that we haven't had to deal with yet, and I think the best way to handle it is a change to the schema. The change is: add: na_sequence_id to NALocation remove: na_sequence_id from NAFeature and all its subclasses. In other words, the location of a feature specifies what sequence it belongs to, rather than the feature specifying that directly itself. This enables a feature to exist on more than one sequence. The data we have is scaffolds and a genetic map. We use the map to order and orient the scaffolds. We also submit the scaffolds to our analysis pipeline which produces features on the scaffolds We store the scaffolds as SequencePieces, and the chromosome as a VirtualSequence. We would like our presentation layer, eg GBrowse, to be able to display the features on the chromosome as well as on the scaffolds, with correctly transformed locations. This means that we have to project the SequencePiece features onto the VirtualSequence. We have considered many alternative ways of doing this projection (Aaron and I and others). It is now clear to me that the most elegant and practical approach is to allow NAFeatures to have NALocations on more than one Sequence. Given that schema, we can add a final analysis step to our pipeline that easily does the projection by creating a new set of NALocations that attach the NAFeatures from the SequencePieces to the VirtualSequence. The downsides that I see to this approach are: 1. a change to the schema 2. in the case that a program wants to iterate across the features of a sequence without regard to their location, the query will have an additional join. i think this is probably a rare case. I would propose this as a feature enhancement to GUS Schema 3.6 Encouragments? Objections? thanks, steve |