Hi everyone...
I came across a recent development that is useful if we decide to go
with database storage of graphics primitives - an algorithm+data
structure called Nested Containment List (NCList).
abstract:
http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btl647v1
The goal is to optimize 1-D range queries for databases storing
intervals - exactly like our queries for all primitives overlapping
some tile (that is, if we remove the y-coord query, which is useless
anyway - tiles should always be the height of the entire track,
reducing it to a 1-D problem).
The authors claim that they can improve query and database
construction time by 5 to 500 times over MySQL multi-column indexing,
binning, and PostgreSQL R-trees. Their benchmarks looks pretty damn
good, and I wonder how it would perform on our datasets.
The database update performance seems to be an unexplored issue, so
this should only be applied to static datasets, which is pretty much
the case with us - if our feature set ever changes, we need to re-do
the layout and rebuild the primitives database completely, and
layout/database building is the quick step compared to querying anyway
(that is if BioPerl can handle layout for really large chromosomes).
They have a C implementation and a Python module implementation, so
getting this to work for us might not be instantaneous. But if their
claims are correct, this could outperform in-memory storage in time.
Also, memory use for in-memory storage may become a problem for really
large chromosomes, but the NCList implementation seems to be focused
on being RAM-efficient and disk-intensive, yet still able to be
fast... I really have no idea how well a Perl program will work when
it runs out of RAM and starts swapping to disk, but the NCList tests
seem to show that their disk-intensive approach is still pretty fast.
Of course, only benchmarks on our data will tell for sure.
Oh yeah, their results in the paper are shown for the low-RAM,
disk-intensive implementation in Python. There is also a
RAM-intensive version that would probably be faster, but require
machines with more RAM... the C implementation they have will of
course be even faster, also.
Their algorithm actually would be applicable to any feature database
outside of what we're doing though, so take note.
Andrew
|