From: Robert D. <rm...@sa...> - 2024-04-15 17:19:09
|
Samtools (and HTSlib and BCFtools) version 1.20 is now available from GitHub and SourceForge. https://github.com/samtools/htslib/releases/tag/1.20 https://github.com/samtools/samtools/releases/tag/1.20 https://github.com/samtools/bcftools/releases/tag/1.20 https://sourceforge.net/projects/samtools/ The main changes are listed below: ------------------------------------------------------------------------------ htslib - changes v1.20 ------------------------------------------------------------------------------ Updates ------- * When working on named files, bgzip now sets the modified and access times of the output files it makes to match those of the corresponding input. (PR #1727, feature request #1718. Requested by Gert Hulselmans) * It's now possible to use a -o option to specify the output file name in bgzip. (PR #1747, feature request #1726. Requested by Gert Hulselmans) * Improved error faidx error messages. (PR #1743, thanks to Nick Moore) * Faster reading of SAM array (type "B") tags. These often turn up in ONT and PacBio data. (PR #1741) * Improved validity checking of base modification tags. (PR #1749) * mpileup overlap removal now works where one read has a deletion. (PR #1751, fixes samtools/samtools#1992. Reported by Long Tian) * The S3 plugin can now find buckets via S3 access point aliases. (PR #1756, thanks to Matt Pawelczyk; fixes samtools/samtools#1984. Reported by Albert Li) * Added a --threads option (and -@ short option) to tabix. (PR #1755, feature request #1735. Requested by Dan Bolser) * tabix can now index Graph Alignment Format (GAF) files. (See https://github.com/lh3/gfatools/blob/master/doc/rGFA.md) (PR #1763, thanks to Adam Novak) Bug fixes --------- * Security fix: Prevent possible heap overflow in cram_encode_aux() on bad RG:Z tags. (PR #1737) * Security fix: Prevent attempts to call a NULL pointer if certain URL schemes are used in CRAM @SQ UR: tags. (PR #1757) * Security fix: Fixed a bug where following certain AWS S3 redirects could downgrade the connection from TLS (i.e. https://) to unencrypted http://. This could happen when using path-based URLs and AWS_DEFAULT_REGION was set to a region other that the one where the data was stored. (PR #1762, fixes #1760. Reported by andaca) * Fixed arithmetic overflow when loading very long references for CRAM. (PR #1738, fixes #1738. Reported by Shane McCarthy) * Fixed faidx and CRAM reference look-ups on compressed fasta where the .fai index file was present, but the .gzi index of compressed offsets was not. (PR #1745, fixes #1744. Reported by Theodore Li) * Fixed BCF indexing on-the-fly bug which produced invalid indexes when using multiple compression threads. (PR #1742, fixes #1740. Reported by graphenn) * Ensure that pileup destructors are called by bam_plp_destroy(), to prevent memory leaks. (PR #1749, PR #1754) * Ensure on-the-fly index timestamps are always older than the data file. Previously the files could be closed out of order, leading to warnings being printed when using the index. (PR #1753, fixes #1732. Reported by Gert Hulselmans) * To prevent data corruption when reading (strictly invalid) VCF files with duplicated FORMAT tags, all but the first copy of the data associated with the tag are now dropped with a warning. (PR #1752, PR #1761, fixes #1733. Reported by anthakki) * Fixed a bug introduced in release 1.19 (PR #1689) which broke variant record data if it tried to remove an over-long tag. (PR #1752, PR #1761) * Changed error to warning when complaining about use of the CG tag in SAM or CRAM files. (PR #1758, fixes samtools/samtools#2002) ------------------------------------------------------------------------------ samtools - changes v1.20 ------------------------------------------------------------------------------ * Added a `--max-depth` option to `bedcov`, for more control over the depth limit used when calculating the pileup. Previously this limit was set at 64000; now it is set to over 2 billion, so effectively all bases will be counted. (PR #1970, fixes #1950. Reported by ellisjj) * Added `mpileup --output-extra RLEN` to display the unclipped read length. (PR #1971, feature request #1959. Requested by Feng Tian) * Improved checking of symbolic flag names (e.g. UNMAP) passed to samtools. (PR #1981, fixes #1977. Reported by Ilya Shlyakhter) * The `samtools consensus --min-depth` option now works for the Bayesian mode as well as the simple one. (PR #1989, feature request #1982. Requested by Gautier Richard) * It's now possible to use the `samtools fastq` `-d tag:val` option multiple times, allowing matches on more than one tag/value. It also gets a `-D` option which allows the values to be listed in a file. (PR #1993, feature request #1958. Requested by Tristan Lefebure) * Added `samtools fixmate` `-M` option to sanity check base modification (`ML`, `MM`, `MN`) tags, and where necessary adjust modification data on hard-clipped records. (PR #1990) * Made `mpileup` run faster. (PR #1995) * `samtools import` now adds a `@PG` header to the files it makes. As with other sub-commands, this can be disabled by using `--no-PG`. (PR #2008. Requested by Steven Leonard) * The `samtools split` `-d` option to split by tag value now works on tags with integer values. (PR #2005, feature request #1956. Requested by Alex Leonard) * Adjusted `samtools sort -n` (by name) so that primary reads are always sorted before secondary / supplementary. (PR #2012, feature request #2010. Requested by Stijn van Dongen) * Added `samtools bedcov` `-H` option to print column headers in the output. (PR #2025. Thanks to Dr. K. D. Murray) Documentation: * Added a note that BAQ is applied before filtering and overlap removal during mpileup processing. (PR #1988, fixes #1985. Reported by Joseph Galasso) * Added 3.1 to the list of supported CRAM versions in the samtools manual page. (PR #2009. Thanks to Andrew Thrasher) * Made assorted improvements to ampliconclip, flagstat and markdup manual pages. (PR #2014) Bug Fixes: * Security fix: Fixed double free that could occur if bed file indexing failed due to running out of memory. This bug first appeared in version 1.19.1. (PR #2026) * Corrected error message printed when faidx fails to load the fai index. (PR #1987. Thanks to Nick Moore) * Fixed bug introduced in release 1.4 that caused incorrect reference bases to be printed by `samtools mpileup -a -f ref.fa` in the zero-depth regions at the end of each reference. (PR #2019, fixes #2018. Reported by Joe Georgeson) * Fixed a samtools view usage crash on MinGW when given invalid options. (PR #2030, fixes #2029. Reported by Divon Lan) Non user-visible changes and build improvements: * Added tests to ensure that CRAM compression is working properly. (PR #1969, part of fix for #1968. Reported by Clockris) ------------------------------------------------------------------------------ bcftools - changes v1.20 ------------------------------------------------------------------------------ Changes affecting the whole of bcftools, or multiple commands: * Add short option -W for --write-index. The option now accepts an optional parameter which allows to choose between TBI and CSI index format. Changes affecting specific commands: * bcftools consensus - Add new --regions-overlap option which allows to take into account overlapping deletions that start out of the fasta file target region. * bcftools isec - Add new option `-l, --file-list` to read the list of file names from a file * bcftools merge - Add new option `--force-single` to support single-file edge case (#2100) * bcftools mpileup - Add new option --indels-cns for an alternative indel calling model, which should increase the speed on long read data (thanks to using edlib) and the precision (thanks to a number of heuristics). * bcftools norm - Change the order of atomization and multiallelic splitting (when both -a,-m are given) from "atomize first, then split" to "split first, then atomize". This usually results in a simpler VCF representation. The previous behaviour can be achieved by explicitly streaming the output of the --atomize command into the --multiallelics splitting command. - Fix Type=String multiallelic splitting for Number=A,R,G tags with incorrect number of values. - Merging into multiallelic sites with `bcftools norm -m +indels` did not work. This is now fixed and the merging is now more strict about variant types, for example complex events, such as AC>TGA, are not considered as indels anymore (#2084) * bcftools reheader - Allow reading the input file from a stream with --fai (#2088) * bcftools +setGT - Support for custom genotypes based on the allele with higher depth, such as `--new-gt c:0/X` custom genotypes (#2065) * bcftools +split-vep - When only one of the tags is present, automatically choose INFO/BCSQ (the default tag name produced by `bcftools csq`) or INFO/CSQ (produced by VEP). When both tags are present, use the default INFO/CSQ. - Transcript selection by MANE, PICK, and user-defined transcripts, for example --select CANONICAL=YES --select MANE_SELECT!="" --select PolyPhen~probably_damaging - Select all matching transcripts via --select, not just one - Change automatic type parsing of VEP fields DNA_position, CDS_position, and Protein_position from Integer to String, as it can be of the form "8586-8599/9231". The type Integer can be still enforced with `-c cDNA_position:int,CDS_position:int,Protein_position:int`. - Recognize `-c field:str`, not just `-c field:string`, as advertised in the usage page - Fix a bug which made filtering expression containing missing values crash (#2098) * bcftools stats - When GT is missing but AD is present, the program determines the alternate allele from AD. However, if the AD tag has incorrect number of values, the program would exit with an error printing "Requested allele outside valid range". This is now fixed by taking into account the actual number of ALT alleles. * bcftools +tag2tag - Support for conversion from tags using localized alleles (e.g. LPL, LAD) to the family of standard tags (PL, AD) * bcftools +trio-dnm2 - Extend --strictly-novel to exclude cases where the non-Mendelian allele is the reference allele. The change is motivated by the observation that this class of variants is enriched for errors (especially for indels), and better corresponds with the option name. -- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA. |