From: Robert D. <rm...@sa...> - 2023-02-21 14:39:24
|
Samtools (and HTSlib and BCFtools) version 1.17 is now available from GitHub and SourceForge. https://github.com/samtools/htslib/releases/tag/1.17 https://github.com/samtools/samtools/releases/tag/1.17 https://github.com/samtools/bcftools/releases/tag/1.17 https://sourceforge.net/projects/samtools/ The main changes are listed below: ------------------------------------------------------------------------------ htslib - changes v1.17 ------------------------------------------------------------------------------ * A new API for iterating through a BAM record's aux field. (PR#1354, addresses #1319. Thanks to John Marshall) * Text mode for bgzip. Allows bgzip to compress lines of text with block breaks at newlines. (PR#1493, thanks to Mike Lin for the initial version PR#1369) * Make tabix support CSI indices with large positions. Unlike SAM and VCF files, BED files do not set a maximum reference length which hindered CSI support. This change sets an arbitrary large size of 100G to enable it to work. (PR#1506) * Add a fai_line_length function. Exposes the internal line-wrap length. (PR#1516) * Check for invalid barcode tags in fastq output. (PR#1518, fixes samtools#1728. Reported by Poshi) * Warn if reference found in a CRAM file is not contained in the specified reference file. (PR#1517 and PR#1521, adds diagnostics for #1515. Reported by Wei WeiDeng) * Add a faidx_seq_len64 function that can return sequence lengths longer than INT_MAX. At the same time limit faidx_seq_len to INT_MAX output. Also add a fai_adjust_region to ensure given ranges do not go beyond the end of the requested sequence. (PR#1519) * Add a bcf_strerror function to give text descriptions of BCF errors. (PR#1510) * Add CRAM SQ/M5 header checking when specifying a fasta file. This is to prevent creating a CRAM that cannot be decoded again. (PR#1522. In response to samtools#1748 though not a direct fix) * Improve support for very long input lines (> 2Gbyte). This is mostly useful for tabix which does not do much interpretation of its input. (PR#1542, a partial fix for #1539) * Speed up load_ref_portion. This function has been sped up by about 7x, which speeds up low-depth CRAM decoding by about 10%. (PR#1551) * Expand CRAM API to cope with new samtools cram_size command. (PR#1546) * Merges neighbouring I and D ops into one op within pileup. This means 4M1D1D1D3M is reported as 4M3D3M. Fixing this in sam.c means not only is samtools mpileup now looking better, but any tool using the mpileup API will be getting consistent results. (PR#1552, fixes the last remaining part of samtools#139) * Update the API documentation for bgzf_mt as it refered to a previous iteration. (PR#1556, fixes #1553. Reported by Raghavendra Padmanabhan) Build changes ------------- * Use POSIX grep in testing as egrep and fgrep are considered obsolete. (PR#1509, thanks to David Seifert) * Switch to building libdefalte with cmake for Cirris CI. (PR#1511) * Ensure strings in config_vars.h are escaped correctly. (PR#1530, fixes #1527. Reported by Lucas Czech) * Easier modification of shared library permissions during install. (PR#1532, fixes #1525. Reported by StephDC) * Fix build on ancient compilers. Added -std=gnu90 to build tests so older C compilers will still be happy. (PR#1524, fixes #1523. Reported by Martin Jakt) * Switch MacOS CI tests to an ARM-based image. (PR#1536) * Cut down the number of embed_ref=2 tests that get run. (PR#1537) * Add symbol versions to libhts.so. This is to aid package developers. (PR#1560 addresses #1505, thanks to John Marshall. Reported by Stefan Bruens) * htscodecs now updated to v1.4.0. (PR#1563) * Cleaned up misleading system error reports in test_bgzf. (PR#1565) Bug fixes --------- * VCF. Fix n-squared complexity in sample line with many adjacent tabs [fuzz]. (PR#1503) * Improved bcftools detection and reporting of bgzf decode errors. (PR#1504, thanks to Lilian Janin. PR#1529 thanks to Bergur Ragnarsson, fixes #1528. PR#1554) * Prevent crash when the only FASTA entry has no sequence [fuzz]. (PR#1507) * Fixed typo in sam.h documentation. (PR#1512, thanks to kojix2) * Fix buffer read-overrun in bam_plp_insertion_mod. (PR#1520) * Fix hash keys being left behind by bcf_hdr_remove. (PR#1535, fixes #1533. Reported by Giulio Genovese in #842) * Make bcf_hdr_idinfo_exists more robust by checking id value exists. (PR#1544, fixes #1538. Reported by Giulio Genovese) * CRAM improvements. Fixed crash with multi-threaded CRAM. Fixed a bug in the codec parameter learning for CRAM 3.1 name tokeniser. Fixed Cram compression container substitution matrix generation, (PR#1558, PR#1559 and PR#1562) ------------------------------------------------------------------------------ samtools - changes v1.17 ------------------------------------------------------------------------------ New work and changes: * New samtools reset subcommand. Removes alignment information. Alignment location, CIGAR, mate mapping and flags are updated. If the alignment was in reverse direction, sequence and its quality values are reversed and complemented and the reverse flag is reset. Supplementary and secondary alignment data are discarded. (PR#1767, implements #1682. Requested by dkj) * New samtools cram-size subcommand. It writes out metrics about a CRAM file reporting aggregate sizes per block "Content ID" fields, the data-series contained within them, and the compression methods used. (PR#1777) * Added a --sanitize option to fixmate and view. This performs some sanity checks on the state of SAM record fields, fixing up common mistakes made by aligners. (PR#1698) * Permit 1 thread with samtools view. All other subcommands already allow this and it does provide a modest speed increase. (PR#1755, fixes #1743. Reported by Goran Vinterhalter) * Add CRAM_OPT_REQUIRED_FIELDS option for view -c. This is a big speed up for CRAM (maybe 5-fold), but it depends on which filtering options are being used. (PR#1776, fixes #1775. Reported by Chang Y) * New filtering options in samtools depth. The new --excl-flags option is a synonym for -G, with --incl-flags and --require-flags added to match view logic. (PR#1718, fixes #1702. Reported by Dario Beraldi) * Speed up calmd's slow handling of non-position-sorted data by adding caching. This uses more memory but is only activated when needed. (PR#1723, fixes #1595. Reported by lxwgcool) * Improve samtools consensus for platforms with instrument specific profiles, considerably helping for data with very different indel error models and providing base quality recalibration tables. On PacBio HiFi, ONT and Ultima Genomics consensus qualities are also redistributed within homopolymers and the likelihood of nearby indel errors is raised. (PR#1721, PR#1733) * Consensus --mark-ins option. This permits he consensus output to include a markup indicating the next base is an insertion. This is necessary as we need a way of outputting both consensus and also how that consensus marries up with the reference coordinates. (PR#1746) * Make faidx/fqidx output line length default to the input line length. (PR#1738, fixes #1734. Reported by John Marshall) * Speed up optical duplicate checking where data has a lot of duplicates compared to non-duplicates. (PR#1779, fixes #1771. Reported by Poshi) * For collate use TMPDIR environment variable, when looking for a temporary folder. (PR#1782, based on PR#1178 and fixes #1172. Reported by Martin Pollard) Bug Fixes: * Fix stats breakage on long deletions when given a reference. (PR#1712, fixes #1707. Reported by John Didion) * In ampliconclip, stop hard clipping from wrongly removing entire reads. (PR#1722, fixes #1717. Reported by Kevin Xu) * Fix bug in ampliconstats where references mentioned in the input file headers but not in the bed file would cause it to complain that the SAM headers were inconsistent. (PR#1727, fixes #1650. Reported by jPontix) * Fixed SEGV in samtools collate when no filename given. (PR#1724) * Changed the default UMI barcode regex in markdup. The old regex was too restrictive. This version will at least allow the default read name UMI as given in the Illumina example documentation. (PR#1737, fixes #1730. Reported by yloemie) * Fix samtools consensus buffer overrun with MD:Z handling. (PR#1745, fixes #1744. Reported by trilisser) * Fix a buffer read-overflow in mpileup and tview on sequences with seq "*". (PR#1747) * Fix view -X command line parsing that was broken in 1.15. (PR#1772, fixes #1720. Reported by Francisco Rodríguez-Algarra and Miguel Machado) * Stop samtools view -d from reporting meaningless system errors when tag validation fails. (PR#1796) Documentation: * Add a description of the samtools tview display layout to the man page. Documents . vs , and upper vs lowercase. Adds a -s sample example, and documents the -w option. (PR#1765, fixes #1759. Reported by Lucas Ferreira da Silva) * Clarify intention of samtools fasta/q in man page and soft vs hard clipping. (PR#1794, fixes #1792. Reported by Ryan Lorig-Roach) * Minor fix to wording of mpileup --rf usage and man page. (PR#1795, fixes #1791. Reported by Luka Pavageau) Non user-visible changes and build improvements: * Use POSIX grep in testing as egrep and fgrep are considered obsolete. (PR#1726, thanks to David Seifert) * Switch MacOS CI tests to an ARM-based image. (PR#1770) ------------------------------------------------------------------------------ bcftools - changes v1.17 ------------------------------------------------------------------------------ Changes affecting the whole of bcftools, or multiple commands: * The -i/-e filtering expressions - Error checks were added to prevent incorrect use of vector arithmetics. For example, when evaluating the sum of two vectors A and B, the resulting vector could contain nonsense values when the input vectors were not of the same length. The fix introduces the following logic: - evaluate to C_i = A_i + B_i when length(A)==B(A) and set length(C)=length(A) - evaluate to C_i = A_i + B_0 when length(B)=1 and set length(C)=length(A) - evaluate to C_i = A_0 + B_i when length(A)=1 and set length(C)=length(B) - throw an error when length(A)!=length(B) AND length(A)!=1 AND length(B)!=1 - Arrays in Number=R tags can be now subscripted by alleles found in FORMAT/GT. For example, FORMAT/AD[GT] > 10 .. require support of more than 10 reads for each allele FORMAT/AD[0:GT] > 10 .. same as above, but in the first sample sSUM(FORMAT/AD[GT]) > 20 .. require total sample depth bigger than 20 * The commands `consensus -H` and `+split-vep -H` - Drop unnecessary leading space in the first header column and newly print `#[1]columnName` instead of the previous `# [1]columnName` (#1856) Changes affecting specific commands: * bcftools +allele-length - Fix overflow for indels longer than 512bp and aggregate alleles equal or larger than that in the same bin (#1837) * bcftools annotate - Support sample reordering of annotation file (#1785) - Restore lost functionality of the --pair-logic option (#1808) * bcftools call - Fix a bug where too many alleles passed to `-C alleles` via `-T` caused memory corruption (#1790) - Fix a bug where indels constrained with `-C alleles -T` would sometimes be missed (#1706) * bcftools consensus - BREAKING CHANGE: the option `-I, --iupac-codes` newly outputs IUPAC codes based on FORMAT/GT of all samples. The `-s, --samples` and `-S, --samples-file` options can be used to subset samples. In order to ignore samples and consider only the REF and ALT columns (the original behavior prior to 1.17), run with `-s -` (#1828) * bcftools convert - Make variantkey conversion work for sites without an ALT allele (#1806) * bcftool csq - Fix a bug where a MNV with multiple consequences (e.g. missense + stop_gained) would report only the less severe one (#1810) - GFF file parsing was made slightly more flexible, newly ids can be just 'XXX' rather than, for example, 'gene:XXX' - New gff2gff perl script to fix GFF formatting differences * bcftools +fill-tags - More of the available annotations are now added by the `-t all` option * bcftools +fixref - New INFO/FIXREF annotation - New -m swap mode * bcftools +mendelian - The +mendelian plugin has been deprecated and replaced with +mendelian2. The function of the plugin is the same but the command line options and the output format has changed, and for this was introduced as a new plugin. * bcftools mpileup - Most of the annotations generated by mpileup are now optional via the `-a, --annotate` option and add several new (mostly experimental) annotations. - New option `--indels-2.0` for an EXPERIMENTAL indel calling model. This model aims to address some known deficiencies of the current indel calling algorithm, specifically, it uses diploid reference consensus sequence. Note that in the current version it has the potential to increase sensitivity but at the cost of decreased specificity. - Make the FS annotation (Fisher exact test strand bias) functional and remove it from the default annotations * bcftools norm - New --multi-overlaps option allows to set overlapping alleles either to the ref allele (the current default) or to a missing allele (#1764 and #1802) - Fixed a bug in `-m -` which does not split missing FORMAT values correctly and could lead to empty FORMAT fields such as `::` instead of the correct `:.:` (#1818) - The `--atomize` option previously would not split complex indels such as C>GGG. Newly these will be split into two records C>G and C>CGG (#1832) * bcftools query - Fix a rare bug where the printing of SAMPLE field with `query` was incorrectly suppressed when the `-e` option contained a sample expression while the formatting query did not. See #1783 for details. * bcftools +setGT - Add new `--new-gt X` option (#1800) - Add new `--target-gt r:FLOAT` option to randomly select a proportion of genotypes (#1850) - Fix a bug where `-t ./x` mode was advertised as selecting both phased and unphased half-missing genotypes, but was in fact selecting only unphased genotypes (#1844) * bcftools +split-vep - New options `-g, --gene-list` and `--gene-list-fields` which allow to prioritize consequences from a list of genes, or restrict output to the listed genes - New `-H, --print-header` option to print the header with `-f` - Work around a bug in the LOFTEE VEP plugin used to annotate gnomAD VCFs. There the LoF_info subfield contains commas which, in general, makes it impossible to parse the VEP subfields. The +split-vep plugin can now work with such files, replacing the offending commas with slash (/) characters. See also https://github.com/Ensembl/ensembl-vep/issues/1351 - Newly the `-c, --columns` option can be omitted when a subfield is used in `-i/-e` filtering expression. Note that `-c` may still have to be given when it is not possible to infer the type of the subfield. Note that this is an experimental feature. * bcftools stats - The per-sample stats (PSC) would not be computed when `-i/-e` filtering options and the `-s -` option were given but the expression did not include sample columns (1835) * bcftools +tag2tag - Revamp of the plugin to allow wider range of tag conversions, specifically all combinations from FORMAT/GL,PL,GP to FORMAT/GL,PL,GP,GT * bcftools +trio-dnm2 - New `-n, --strictly-novel` option to downplay alleles which violate Mendelian inheritance but are not novel - Allow to set the `--pn` and `--pns` options separately for SNVs and indels and make the indel settings more strict by default - Output missing FORMAT/VAF values in non-trio samples, rather than random nonsense values * bcftools +variant-distance - New option `-d, --direction` to choose the directionality: forward, reverse, nearest (the default) or both (#1829) -- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. |