|
From: Robert D. <rm...@sa...> - 2025-12-16 15:57:21
|
Samtools (and HTSlib and BCFtools) version 1.23 is now available from GitHub and SourceForge. https://github.com/samtools/htslib/releases/tag/1.23 https://github.com/samtools/samtools/releases/tag/1.23 https://github.com/samtools/bcftools/releases/tag/1.23 https://sourceforge.net/projects/samtools/ The main changes are listed below: ------------------------------------------------------------------------------ htslib - changes v1.23 ------------------------------------------------------------------------------ Updates ------- * HTSlib 1.22 changed the VCF reader so that it stored GT prefixed phasing information, but only for files specifying `fileformat=VCFv4.4` or higher. This caused problems when merging files with different versions, so the VCF reader will now store prefixed phasing information irrespective of the VCF version listed in the file headers. For files up to VCFv4.3, the first phasing bit will be set if all other alleles are phased, and cleared otherwise (following the rules for VCFv4.4 onwards where no explicit phasing symbol is present). This will also happen when reading BCF. When accessing GT data, it is no longer safe to assume that the phasing is set to zero even if the file reports a version earlier than VCFv4.4. Interfaces such as `bcf_gt_allele()` should always be used to access GT allele data. For compatibility, prefixed phasing will be stripped when writing VCF files with version 4.3 or earlier. (PR #1938, fixes #1932) * Add support for VCFv4.4 / VCFv4.5 "Number=" fields. (PR #1874) * Consolidate and simplify SAM header parsing. This considerably speeds up parsing files with many SQ lines. (PR #1947. PR #1953 fixes oss-fuzz issues 444492071, 444492076, 444547724, 444490034, PR #1977) * Switch from strtol to hts_str2uint in mod parsing for speed increase. (PR #1957. Thanks to Chris Wright) * Add UMI support to FASTQ input and output. See samtools/samtools#2270. (PR #1960, fixes samtools/samtools#2259. Requested by Poshi) * Removed direct access to htsFile struct members in some sample functions. (PR #1963, fixes #1961. Reported by John Marshall) * Improved operation of filters that work with header data. Filter expressions set as an `HTS_OPT_FILTER` on a BAM or CRAM iterator failed to return records matching on `rname`, `mrname`, `rnext` or `library`. (PR #1959) * Add Type to the INFO/FORMAT sanity check. This produces a warning on incorrect Type usage. (PR #1967, fixes #1937 and samtools/bcftools#2431. Reported by Jukka Matilainen) * S3 reading code now reads in `chunks` to limit the amount of data read (and therefore egress costs) from the object store when doing a range request. Also this combines the reading, writing and authorisation code into a single file. (PR #1958, fixes #1670. Reported by Stephan Drukewitz) Build Changes ------------- * Change optimisation for -fsanitize=address,undefined test build to counter slow build and high compiler memory use. (PR #1924) * Fix compilation failure on MacOS X 10.9 (and likely other very old platforms). (PR #1945, fixes #1941. Reported by Ryan Carsten Schmidt) * Fix htslib.map update due to recent change in nm behaviour. (PR #1975, fixes #1971. Reported by John Marshall). * The htscodecs submodule is updated to v1.6.5. This includes a fix to the rANS encoder when running on x86-64 hardware with some SIMD features disabled. (Fixes samtools/samtools#2256. Reported by Ran Fan) Bug fixes --------- * Fix segfault on an empty valid MM tag. (PR #1939, fixes #1936. Reported by John Marshall) * Fix bam_next_basemod + HTS_MOD_REPORT_UNCHECKED flag. (PR #1946, fixes #1943) * For the VCF rlen calculation, only use SVLEN for DEL, DUP and CNV symbolic alleles. A bug is also fixed on big-endian platforms where INFO and FORMAT values were being accessed incorrectly. (PR #1942, fixes #1940) * Correct TLEN assignment in CRAM decode. Also improve decoder when dealing with multiple secondary alignments. See also samtools/hts-specs#842. (PR #1951, fixes #1948. Reported by Matt Sexton) * Make tabix skip comments (-c) wherever they occur, not just at the start of the file. (PR #1952, fixes #1950. Reported by Victor Negîrneac) * Update htscodecs for better AVX2 / AVX512 runtime detection. (PR #1954, fixes samtools/samtools#2256. Reported by Ran Fan) * Fix embed_ref=2 on SEQ * and MD:Z tag. The combination of no sequence and MD:Z with embed_ref=2 caused the slice extents to be miscalculated, causing invalid CRAM output to be written. (PR #1964, fixes samtools/samtools#2277. Reported by fo40225) * Try to ensure CSI indexes are built with valid parameters. Adjusts the min_shift and n_lvls to cover the size of the genome. This may override the user setting of min_shift (with warning) if needed. (PR #1968, fixes #1966. Reported by Marc Sturm) * Fix bug where multi-threaded CRAM iterators could drop long alignments starting significantly before, but overlapping, the region of interest. (PR #1973, fixes samtools/samtools#2285, Reported by Nick Owens) Documentation updates --------------------- * Added support information and samtools email for security issues. (PR #1956) * Fix spelling in function name in sam.h. (PR #1972. Thanks to Jack Turpitt) ------------------------------------------------------------------------------ samtools - changes v1.23 ------------------------------------------------------------------------------ New work and changes: * New reference stats in `samtools stats`. First line in RFS section gives the total sequence count, count of regions, average GC, min, max, average and total counts. Second line onwards gives regions, lengths, GC and unknown base count. (PR #2224, implements #2139. Requested by Filipe G. Vieira) * New, faster Python version of seq_cache_populate to create and update REF_CACHE. (PR #2231. Thanks to Ruben Vorderman) * Add a minimum depth (`--min-depth`) option to `samtools coverage`. (PR #2235, implements #1563. Requested by Charles Foster) * Add an option to exclude reads (`--exclude-no-read-group`) that have no read group from `samtools view` when the `-r` (or `-R`) options are used. (PR #2271, fixes #2265. Reported by Matt Sexton) * Add UMI support to `samtools fastq` and `samtools import`. See samtools/htslib#1960. (PR #2270, fixes #2259 amd #2262. Requested by Poshi) * Optionally trim soft clips from reads in `samtools fastq` output. (PR #2233, fixes #1275. Requested by Torsten Seemann) * If sam file is sorted by tag, `samtools split` will output data sequentially to avoid having simultaneous open files. (PR #2281, fixes #2276. Requested by Clint Valentine) Documentation: * In the command help output add a link to the global options in samtools.1 page on the [HTSlib](https://www.htslib.org/) site. (PR #2258, addresses #2236. Reported by Chris Saunders) * Add a support section to README.md. This mentions the GitHub issue tracker and an email address for security issues. (PR #2267) Bug fixes: * Prevent `samtools coverage` from printing a coverage table on failure. (PR #2247, fixes #2242. Reported by Georges Kanaan) * Remove deprecated line style commands from plot-bamstats. (PR #2251, fixes #2243. Reported by Suhas Srinivasan) * Add missing sam_global_args_free calls to address (harmless) memory leaks. (PR #2274) * Fix `samtools consensus` crash when used with threads and iterators. See also samtools/htslib#1959 (PR #2269) Non user-visible changes and build improvements: * Ignore and testclean test/stat/*.fa.fai (PR #2241. Thanks to John Marshall) * Remove use of C variables starting in _ from bam_consensus.c. (PR #2250, fixes #2248. Reported by Ghanji125) * Add Replace RG check exit and add some comments to bam_addrprg.c. (PR #2254. Thanks to Martin Pollard) ------------------------------------------------------------------------------ bcftools - changes v1.23 ------------------------------------------------------------------------------ Changes affecting the whole of bcftools, or multiple commands: * The `-i/-e` filtering expressions and `-f` formatting in `query` - Add a new function `smpl_COUNT()/sCOUNT()` which returns the number of elements (#2423) Changes affecting specific commands: * bcftools annotate - Make dynamic variables read from a tab-delimited annotation file (#2151) work also for regions. For example, while the first command below was functional, the second was not (#2441) bcftools annotate -a ann.tsv.gz -c CHROM,POS,-,SCORE,~STR \ -i'TAG={STR}' -k in.vcf bcftools annotate -a ann.tsv.gz -c CHROM,BEG,END,SCORE,~STR \ -i'TAG={STR}' -k in.vcf * bcftools consensus - Fix a bug which prevented reading fasta files containing empty lines in their entirety (#2424) - Fix a bug which causes `--absent` miss some absent positions * bcftools csq - Add support for complex substitutions, such as AC>TAA * bcftools +fill-tags - Fix header formatting error for INFO/F_MISSING which must be Number=1 (#2442) - Make `-t 'F_MISSING'` work with `-S groups.txt` (#2447) * bcftools gtcheck - The program is now able to process gVCF blocks. Newly, monoallelic sites are excluded only when the site is monoallelic in both query and genotype file. The new option --keep-refs allows to always include monoallelic sites. - Fix an error in parsing -i/-e command line options where the `qry:` and `gt:` prefix was not stripped (#2432) * bcftools mpileup - Make `-d, --max-depth 0` set the depth to unlimited (#2435) * bcftools norm - Make the -i/-e filtering option work for all options, such as line merging and duplication removal (#2415) * bcftools query - Numerical functions, such as SUM(INFO/DP), would previously return the value 0 when executed on missing values. This was incorrect, newly a missing value is printed. * bcftools reheader - Add options `--samples-list` and `--samples-file` to allow renaming samples from a list of samples on command line, rather than from a file of sample names (#2383) * bcftools +split-vep - Fix the option `-A, --all-fields`, it was not working properly and could lead to a segfault (#2473) ---------------------------------------------------------------------- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA. |