Eric Boyden - 2024-06-12

I was looking for a java-based global short-read aligner and this software performs excellently, with lots of neat features that many other aligners lack. In descending order of priority, here are some ideas for improvement:
* Update SAM spec to v1.6 (currently 1.3/1.4), primarily to produce an updated @HD line to indicate query grouped output (GO:query). Many downstream tools (e.g. some fgbio tools) rely on this and won't work unless this is explicitly specified, and resorting just to add this annotation is time-consuming. (Also according to SAM spec, =/X CIGAR strings were added with v1.3 not v1.4 https://samtools.github.io/hts-specs/SAMv1.pdf; this option probably shouldn't be tied to sam spec at all but should be renamed.)
* killbadpairs and requirecorrectstrand should prevent paired reads from mapping to different chromosomes/contigs; currently such reads are passed through since they're technically not on "opposite" strands.
* Add functionality to move fastq read comments (everything after the first whitespace) into a SAM tag, possibly as an alternative/add-on to trimreaddescriptions. This feature is offered by several other aligners, including BWA, Bowtie2, MiniMap2, and SNAP.
* Add support for ubam input, so that realigning bams doesn't require reverting them back to fastqs (which may also be complicated to do without losing ubam metadata, although adding the ability to move fastq comments to sam tags will help).
Also FWIW, the default minid=0.76 seems to be a bit too sensitive for human PE150 data, and allows a fairly high rate of nonspecific alignments of junk reads in NTCs, at least with our data. Raising this value to 0.85 cleaned everything up, with output similar to that of other common aligners (BWA, Bowtie2). Unsurprisingly, this error rate is also more consistent with Bowtie2's default error tolerance in global mode.
Thanks!