From: Robert D. <rm...@sa...> - 2023-12-12 16:23:04
|
Samtools (and HTSlib and BCFtools) version 1.19 is now available from GitHub and SourceForge. https://github.com/samtools/htslib/releases/tag/1.19 https://github.com/samtools/samtools/releases/tag/1.19 https://github.com/samtools/bcftools/releases/tag/1.19 https://sourceforge.net/projects/samtools/ The main changes are listed below: ------------------------------------------------------------------------------ htslib - changes v1.19 ------------------------------------------------------------------------------ Updates ------- * A temporary work-around has been put in the VCF parser so that it is less likely to fail on rows with a large number of ALT alleles, where Number=G tags like PL can expand beyond the 2Gb limit enforced by HTSlib. For now, where this happens the offending tag will be dropped so the data can be processed, albeit without the likelihood data. In future work, the library will instead convert such tags into their local alternatives (see https://github.com/samtools/hts-specs/pull/434). * New program. Adds annot-tsv which annotates regions in a destination file with texts from overlapping regions in a source file. (PR#1619) * Change bam_parse_cigar() so that it can modify existing BAM records. This makes more useful as public API. Previously it could only handle partially formed BAM records. (PR#1651, fixes #1650. Reported by Oleksii Nikolaienko) * Add "uncompressed" to hts_format_description() where appropriate. This adds an "uncompressed" description to uncompressed files that would normally be compressed, such as BAM and BCF. (PR#1656, in relation to samtools#1884. Thanks to John Marshall) * Speed up to the VCF parser and writer. (PR#1644 and PR#1663) * Add an hclen (hard clip length) SAM filter function. (PR#1660, with reference to samtools#813) * Avoid really closing stdin/stdout in hclose()/hts_close()/et al. See discussion in PR for details. (PR#1665. Thanks to John Marshall) * Add support to handle multiple files in bgzip. (PR#1658, fixes #1642. Requested by bw2) * Enable auto-vectorisation in CRAM 3.1 codecs. Speeds decoding on some sequencing platform data. (PR#1669) * Speed up removal of lines in large headers. (PR#1662, fixes #1460. Reported by Anže Starič) * Apply seqtk PR to improve kseq.h parsing performance. Port of Fabian Klötzl's (kloetzl) lh3/seqtk#123 and attractivechaos/klib#173 to HTSlib. (PR#1674. Thanks to John Marshall) Build changes ------------- * Updated htscodecs submodule to 1.6.0. (PR#1685, PR#1717, PR#1719) * Apply the packed attribute to uint*_u types for Clang to prevent -fsanitize=alignment failures. (PR#1667. Thanks to Fangrui Song) * Fuzz testing improvements. (PR#1664) * Add C++ casts for external headers in klist.h and kseq.h. (PR#1683. See also PR#1674 and PR#1682) * Add test case compiling the public headers as C++. (PR#1682. Thanks to John Marshall) * Enable optimisation level -O3 for SAM QUAL+33 formatting. (PR#1679) * Make compiler flag detection work with zig cc. (PR#1687) * Fix unused value warnings when built with NDEBUG. (PR#1688) * Remove some disused Makefile variables, fix typos and a warning. Improve bam_parse_basemod() documentation. (PR#1705, Thanks to John Marshall) Bug fixes --------- * Fail bgzf_useek() when offset is above block limits. (PR#1668) * Fix multi-threaded on-the-fly indexing problems. (PR#1672, fixes samtools#1861 and bcftools#1985. Reported by Mark Ebbert and lacek) * Fix hfile_libcurl small seek bug. (PR#1676, fixes samtools#1918. Also may fix #1037, #1625 and samtools#1622. Reported by Alex Reynolds, Mark Walker, Arthur Gilly and skatragadda-nygc. Thanks to John Marshall) * Fix a minor memory leak in malformed CRAM EXTERNAL blocks. [fuzz] (PR#1671) * Fix a cram decode hang from block_resize(). (PR#1680. Reported by Sebastian Deorowicz) * Cram fuzzing improvements. Fixes a number of cram errors. (PR#1701, fixes #1691, #1692, #1693, #1696, #1697, #1698, #1699 and #1700. Thanks to Octavio Galland for finding and reporting all these) * Fix crypt4gh redirection. (PR#1675, fixes grbot/crypt4gh-tutorial#2. Reported by hth4) * Fix PG header linking when records make a loop. (PR#1702, fixes #1694. Reported by Octavio Galland) * Prevent issues with no-stored-sequence records in CRAM files, by ensuring they are accounted for properly in block size calculations, and by limiting the maximum query length in the CIGAR data. Originally seen as an overflow by OSS-Fuzz / UBSAN, it turned out this could lead to excessive time and memory use by HTSlib, and could result in it writing out unreadable CRAM files. (PR#1710) * Fix some illegal shifts and integer overflows found by OSS-Fuzz / UBSAN. (PR#1707, PR#1712, PR#1713) ------------------------------------------------------------------------------ samtools - changes v1.19 ------------------------------------------------------------------------------ New work and changes: * Samtools coverage: add a new --plot-depth option to draw depth (of coverage) rather than the percentage of bases covered. (PR #1910. Thanks to Pierre Lindenbaum) * Samtools merge / sort: add a lexicographical name-sort option via the -N option. The "natural" alpha-numeric sort is still available via -n. (PR #1900, fixes #1500. Reported by Steve Huang) * Samtools view: add -N ^NAME_FILE and -R ^RG_FILE options. The standard -N and -R options only output reads matching a specified list of names or read-groups. With a caret (^) prefix these may be negated to only output read not matching the specified files. (PR #1896, fixes #1895. Suggested by Feng Tian) * Cope with Htslib's change to no longer close stdout on hts_close. Htslib companion PR is samtools/htslib#1665. (PR #1909. Thanks to John Marshall) * Plot-bamstats: add a new plot of the read lengths ("RL") from samtools stats output. (PR #1922, fixes #1824. Thanks to @erboone, suggested by Martin Pollard) * Samtools split: support splitting files based on the contents of auxiliary tags. Also adds a -M option to limit the number of files split can make, to avoid accidental resource over-allocation, and fixes some issues with --write-index. (PR #1222, PR #1933, fixes #1758. Thanks to Valeriu Ohan, suggested by Scott Norton) Bug Fixes: * Samtools stats: empty barcode tags are now treated as having no barcode. (PR #1929, fixes #1926. Reported by Jukka Matilainen) * Samtools cat: add support for non-seekable streams. The file format detection blocked pipes from working before, but now files may be non-seekable such as stdin or a pipe. (PR #1930, fixes #1731. Reported by Julian Hess) * Samtools mpileup -aa (absolutely all positions) now produces an output even when given an empty input file. (PR #1939. Reported by Chang Y) * Samtools markdup: speed up optical duplicate tagging on regions with very deep data. (PR #1952) Documentation: * Samtools mpileup: add more usage examples to the man page. (PR #1913, fixes #1801) * Samtools fastq: explicitly document the order that filters apply. (PR #1907) * Samtools merge: fix example output to use an uppercase RG PL field. (PR #1917. Thanks to John Marshall. Reported by Michael Macias) * Add hclen SAM filter documentation. (PR #1902. See also samtools/htslib#1660) * Samtools consensus: remove the -5 option from documentation. This option was renamed before the consensus subcommand was merged, but accidentally left in the man page. (PR #1924) ------------------------------------------------------------------------------ bcftools - changes v1.19 ------------------------------------------------------------------------------ Changes affecting the whole of bcftools, or multiple commands: * Filtering expressions can be given a file with list of strings to match, this was previously possible only for the ID column. For example ID=@file .. selects lines with ID present in the file INFO/TAG=@file.txt .. selects lines where TAG has a string value listed in the file INFO/TAG!=@file.txt .. TAG must not have a string value listed in the file * Allow to query REF,ALT columns directly, for example -e 'REF="N"' Changes affecting specific commands: * bcftools annotate - Fix `bcftools annotate --mark-sites`, VCF sites overlapping regions in a BED file were not annotated (#1989) - Add flexibility to FILTER column transfers and allow transfers within the same file, across files, and in combination. For examples see http://samtools.github.io/bcftools/howtos/annotate.html#transfer_filter_to_info * bcftools call - Output MIN_DP rather than MinDP in gVCF mode - New `-*, --keep-unseen-allele` option to output the unobserved allele <*>, intended for gVCF. * bcftools head - New `-s, --samples` option to include the #CHROM header line with samples. * bcftools gtcheck - Add output options `-o, --output` and `-O, --output-type` - Add filtering options `-i, --include` and `-e, --exclude` - Rename the short option `-e, --error-probability` from lower case to upper case `-E, --error-probability` - Changes to the output format, replace the DC section with DCv2: - adds a new column for the number of matching genotypes - The --error-probability is newly interpreted as the probability of erroneous allele rather than genotype. In other words, the calculation of the discordance score now considers the probability of genotyping error to be different for HOM and HET genotypes, i.e. P(0/1|dsg=0) > P(1/1|dsg=0). - fixes in HWE score calculation plus output average HWE score rather than absolute HWE score - better description of fields * bcftools merge - Add `-m` modifiers to suppress the output of the unseen allele <*> or <NON_REF> at variant sites (e.g. `-m both,*`) or all sites (e.g. `-m both,**`) * bcftools mpileup - Output MIN_DP rather than MinDP in gVCF mode * bcftools norm - Add the number of joined lines to the summary output, for example Lines total/split/joined/realigned/skipped: 6/0/3/0/0 - Allow combining -m and -a with --old-rec-tag (#2020) - Symbolic <DEL> alleles caused norm to expand REF to the full length of the deletion. This was not intended and problematic for long deletions, the REF allele should list one base only (#2029) * bcftools query - Add new `-N, --disable-automatic-newline` option for pre-1.18 query formatting behavior when newline would not be added when missing - Make the automatic addition of the newline character in a more predictable way and, when missing, always put it at the end of the expression. In version 1.18 it could be added at the end of the expression (for per-site expressions) or inside the square brackets (for per-sample expressions). The new behavior is: - if the formatting expression contains a newline character, do nothing - if there is no newline character and -N, --disable-automatic-newline is given, do nothing - if there is no newline character and -N is not given, insert newline at the end of the expression See #1969 for details - Add new `-F, --print-filtered` option to output a default string for samples that would otherwise be filtered by `-i/-e` expressions. - Include sample name in the output header with `-H` whenever it makes sense (#1992) * bcftools +spit-vep - Fix on the fly filtering involving numeric subfields, e.g. `-i 'MAX_AF<0.001'` (#2039) - Interpret default column type names (--columns-types) as entire strings, rather than substrings to avoid unexpected spurious matches (i.e. internally add ^ and $ to all field names) * bcftools +trio-dnm2 - Do not flag paternal genotyping errors as de novo mutations. Specifically, when father's chrX genotype is 0/1 and mother's 0/0, 0/1 in the child will not be marked as DNM. * bcftools view - Add new `-A, --trim-unseen-allele` option to remove the unseen allele <*> or <NON_REF> at variant sites (`-A`) or all sites (`-AA`) -- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA. |