Menu

#1 make it easy to handle discordant alignment hits

open
Alignment (2)
5
2006-06-30
2006-06-30
No

- when using UCSC MAF alignments, a given sequence may
actually map to multiple locations in the alignment.
It's important for users to be able to detect and
handle this case clearly and easily. This has
generated a number of bug reports that all turned out
to be multiple (contradictory) mappings in the original
MAF data, NOT pygr bugs. To avoid further confusion,
we should give users an easy way to learn about this
issue and manage it appropriately.

Alex, Namshin and I brainstormed this and came up with
the following ideas:
1. the place to detect "multiple hits" is during
mapping to the LPO. If there are overlapping LPO
mappings, then we have multiple hits. Separating these
is not a completely trivial problem, but indel group-by
rules ought to handle this decently. If we record
which sequence mappings came from which LPO mappings,
then we can separate the final result mappings as well.

2. The NLMSASlice class already has split() and
regions() methods designed specifically to address the
need to provide general-purpose methods for subdividing
a set of results based on "group-by" rules. This
should obviously be applied to the "multiple hit"
problem; create an optional argument for the split()
method to subdivide the result set into individual hits
(based on LPO overlap). split() returns a list of
slice objects, just as you'd expect (which might just
be [self] if no splitting occurs).

3. The NLMSA top-level object should have optional
arguments for controlling how multiple hits are
handled. The default should be to raise a KeyError
(with a clear explanatory error string) when multiple
hits are detected. This will force users to realize
that this issue exists in the MAF data and is not
Pygr's fault. It should also tell people how to change
the default behavior.

4. The NLMSASlice constructor will have to save a
mapping for each result to which LPO hit it came from.
This is needed to enable LPO-based hit-separation
later during split().

5. NLMSASlice.__getitem__() should allow slicing on
both target sequence intervals (as it does currently,
by returning a Seq2SeqEdge object), and on source
sequence intervals. All that's needed is to simply
generate a new slice object containing just the subset
of interval mappings that overlap the specified source
sequence interval. This is trivial, but enables
"subqueries", which would be powerful. This could save
users a lot of work.

Discussion


Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.