Hi
We are working on analysis of Bioinformatics tools (related to Kmer counting) and KAnalyze is one of them. We have gone through readme file and it is very helpful. As we are doing analysis so we want to be very sure about details. It would be great if you help us validating below details.
Data structure and Sorting Algo: Array/ Sorting, Disk based, dual-pivot quick sort (Java’s Arrays. sort() )
Approach: In-Memory
The limit of k-size : Arbitrary large k-mer lengths (any ideal length)
Supports online k-mer frequency retrieval : No
Supports compressed file processing : Yes
Thanks
Tarang
Yes, your details are correct with a few caveats.
Disk based:
By default, KAnalyze fills a memory buffer (k-mer "segment"), sorts and counts it, then dumps it to disk. Then the memory buffer is cleared and the next set of k-mers are counted. Instead of dumping the buffer to disk, it can move it to another place in memory (Option --nodumpseg) and re-allocate the memory buffer. This mode is not recommended unless the machine has lots of memory to work with. In Java, this is also less efficient because new arrays are zeroed. KAnalyze uses Java NIO and sophisticated buffer transfer algorithms to make moving data as efficient as possible.
Online k-mer frequency:
I am not sure about online k-mer frequency retrieval. The first time I saw this term was in the "khmer" paper. For a tool to read k-mer counts as they are updated, it would have to use the k-mer counter as an API, and KAnalyze is an API with facilities to run pipelined components from the command-line (the "count" and "stream" pipelines). It wouldn't take much for a programmer to load the pipeline components into their own code and do whatever they wanted, such as pushing new data into it or checking counts. I leave it up to your judgement to determine what constitutes "online" or not.
Exact vs stochastic counters:
I think it is important to make the distinction between exact and stochastic k-mer counters. Exact counters, such as Jellyfish and KAnalyze, output exact counts. Stochastic k-mer counters, such as Khmer and BFCounter, give an approximation. They do remove singleton (in-solid k-mers), and that improves efficiency, but they are also subject to biases that may or may not effect an experiment. Both are quite useful, but I think the distinction should be very clear for those using the tools and for anyone publishing benchmarks comparing tools.
KAnalyze can filter low-count k-mers so that in-solid k-mers are removed from the final output.
There is a pre-count filter hook that can be extended by defining a custom class (in a JAR file), then telling KAnalyze to load the JAR and use the pre-count filter contained in it. The only reason KAnalyze does not come with this filter built in is that I have not seen any interest in it.
Other features:
In this response, I would like to selfishly take a moment to highlight some KAnalyze features that are, to my knowledge, unique.
The merge-sort KAnalyze uses demands that output be sorted, and most k-mer counters give output in a random order. This make the output easier to seach and munge.
The sort order can also be arbitrarily defined, which I took advantage of to make a data structure that can query k-mers from a disk file almost as fast as it could be done by loading the whole file into memory. This work was just published with the tool Kestrel (https://doi.org/10.1093/bioinformatics/btx753). The flexibility of the k-mer counter and the ability to use its components as an API made this robust k-mer-count variant caller possible.
Thank you for your interest and your dilligence! Good luck with your work, and please let me know if I further clarify anything for you.
If this work is to be published, feel free to cite and/or link my response in your manuscript or supplementary material.