Current Overview
(Note: this is a protype and full usage scenarios will be released after publication.)
The overall aim is to pseudo-map “many” read files, that can represent different species, timepoints, relationships (unrelated samples), compartments, or other related or unrelated factors to multiple pre-defined templates and to “quickly” generate a table of hits for each reference relative to input read each file. This can then be used as a starting point for further exploration or network/gene assembly.
This can be useful for example for searching for viral DNA within the vast quantities of sequence data generated globally across unrelated studies. If significant hits are identified dataset specific haplotypes can be subsequently reconstructed from kmer data. This would have direct relevance from research areas ranging from the monitoring of emerging viruses (of both medical and veterinary importance) to metagenomics. At a more bioinformatics service level the tool could be used for rapidly quantifying contamination during sequencing where the references could be a set of contaminants that we are concerned about.
The term pseudo-mapping implies that rather than generating exact mappings to templates a more optimal (in terms of memory and time) kmer characterization is performed and abundance scores generated. Benchmarking should be performed against tools like kalisto and traditional mappers in relation to quantifying both accuracy, speed and memory requirements.