From: Andrew U. <and...@gm...> - 2007-02-14 22:01:32
|
Hi everyone... I came across a recent development that is useful if we decide to go with database storage of graphics primitives - an algorithm+data structure called Nested Containment List (NCList). abstract: http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btl647v1 The goal is to optimize 1-D range queries for databases storing intervals - exactly like our queries for all primitives overlapping some tile (that is, if we remove the y-coord query, which is useless anyway - tiles should always be the height of the entire track, reducing it to a 1-D problem). The authors claim that they can improve query and database construction time by 5 to 500 times over MySQL multi-column indexing, binning, and PostgreSQL R-trees. Their benchmarks looks pretty damn good, and I wonder how it would perform on our datasets. The database update performance seems to be an unexplored issue, so this should only be applied to static datasets, which is pretty much the case with us - if our feature set ever changes, we need to re-do the layout and rebuild the primitives database completely, and layout/database building is the quick step compared to querying anyway (that is if BioPerl can handle layout for really large chromosomes). They have a C implementation and a Python module implementation, so getting this to work for us might not be instantaneous. But if their claims are correct, this could outperform in-memory storage in time. Also, memory use for in-memory storage may become a problem for really large chromosomes, but the NCList implementation seems to be focused on being RAM-efficient and disk-intensive, yet still able to be fast... I really have no idea how well a Perl program will work when it runs out of RAM and starts swapping to disk, but the NCList tests seem to show that their disk-intensive approach is still pretty fast. Of course, only benchmarks on our data will tell for sure. Oh yeah, their results in the paper are shown for the low-RAM, disk-intensive implementation in Python. There is also a RAM-intensive version that would probably be faster, but require machines with more RAM... the C implementation they have will of course be even faster, also. Their algorithm actually would be applicable to any feature database outside of what we're doing though, so take note. Andrew |