[Samtools-devel] Release 1.17

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Samtools (and HTSlib and BCFtools) version 1.17 is now available from
GitHub and SourceForge.

https://github.com/samtools/htslib/releases/tag/1.17
https://github.com/samtools/samtools/releases/tag/1.17
https://github.com/samtools/bcftools/releases/tag/1.17 
https://sourceforge.net/projects/samtools/

The main changes are listed below:

------------------------------------------------------------------------------
htslib - changes v1.17
------------------------------------------------------------------------------

* A new API for iterating through a BAM record's aux field. (PR#1354,
   addresses #1319.  Thanks to John Marshall)

* Text mode for bgzip. Allows bgzip to compress lines of text with block
   breaks at newlines. (PR#1493, thanks to Mike Lin for the initial version
   PR#1369)

* Make tabix support CSI indices with large positions.  Unlike SAM and VCF
   files, BED files do not set a maximum reference length which hindered CSI
   support.  This change sets an arbitrary large size of 100G to enable it to
   work. (PR#1506)

* Add a fai_line_length function.  Exposes the internal line-wrap length.
   (PR#1516)

* Check for invalid barcode tags in fastq output. (PR#1518, fixes
   samtools#1728.  Reported by Poshi)

* Warn if reference found in a CRAM file is not contained in the specified
   reference file. (PR#1517 and PR#1521, adds diagnostics for #1515. Reported
   by Wei WeiDeng)

* Add a faidx_seq_len64 function that can return sequence lengths longer than
   INT_MAX.  At the same time limit faidx_seq_len to INT_MAX output.  Also add
   a fai_adjust_region to ensure given ranges do not go beyond the end of the
   requested sequence. (PR#1519)

* Add a bcf_strerror function to give text descriptions of BCF errors.
   (PR#1510)

* Add CRAM SQ/M5 header checking when specifying a fasta file.  This is
   to prevent creating a CRAM that cannot be decoded again. (PR#1522.  In
   response to samtools#1748 though not a direct fix)

* Improve support for very long input lines (> 2Gbyte).  This is mostly
   useful for tabix which does not do much interpretation of its input.
   (PR#1542, a partial fix for #1539)

* Speed up load_ref_portion.  This function has been sped up by about 7x,
   which speeds up low-depth CRAM decoding by about 10%. (PR#1551)

* Expand CRAM API to cope with new samtools cram_size command. (PR#1546)

* Merges neighbouring I and D ops into one op within pileup. This means
   4M1D1D1D3M is reported as 4M3D3M.   Fixing this in sam.c means not only is
   samtools mpileup now looking better, but any tool using the mpileup API
   will be getting consistent results. (PR#1552, fixes the last remaining
   part of samtools#139)

* Update the API documentation for bgzf_mt as it refered to a previous
   iteration. (PR#1556, fixes #1553.  Reported by Raghavendra Padmanabhan)

Build changes
-------------

* Use POSIX grep in testing as egrep and fgrep are considered obsolete.
   (PR#1509, thanks to David Seifert)

* Switch to building libdefalte with cmake for Cirris CI. (PR#1511)

* Ensure strings in config_vars.h are escaped correctly. (PR#1530, fixes
   #1527. Reported by Lucas Czech)

* Easier modification of shared library permissions during install. (PR#1532,
   fixes #1525. Reported by StephDC)

* Fix build on ancient compilers.  Added -std=gnu90 to build tests so older C
   compilers will still be happy. (PR#1524, fixes #1523.  Reported by
   Martin Jakt)

* Switch MacOS CI tests to an ARM-based image. (PR#1536)

* Cut down the number of embed_ref=2 tests that get run. (PR#1537)

* Add symbol versions to libhts.so.  This is to aid package developers.
   (PR#1560 addresses #1505, thanks to John Marshall. Reported by
   Stefan Bruens)

* htscodecs now updated to v1.4.0. (PR#1563)

* Cleaned up misleading system error reports in test_bgzf. (PR#1565)

Bug fixes
---------

* VCF. Fix n-squared complexity in sample line with many adjacent tabs
   [fuzz]. (PR#1503)

* Improved bcftools detection and reporting of bgzf decode errors. (PR#1504,
   thanks to Lilian Janin. PR#1529 thanks to Bergur Ragnarsson, fixes #1528.
   PR#1554)

* Prevent crash when the only FASTA entry has no sequence [fuzz]. (PR#1507)

* Fixed typo in sam.h documentation. (PR#1512, thanks to kojix2)

* Fix buffer read-overrun in bam_plp_insertion_mod. (PR#1520)

* Fix hash keys being left behind by bcf_hdr_remove. (PR#1535, fixes #1533.
   Reported by Giulio Genovese in #842)

* Make bcf_hdr_idinfo_exists more robust by checking id value exists.
   (PR#1544, fixes #1538.  Reported by Giulio Genovese)

* CRAM improvements. Fixed crash with multi-threaded CRAM.  Fixed a bug
   in the codec parameter learning for CRAM 3.1 name tokeniser. Fixed Cram
   compression container substitution matrix generation, (PR#1558, PR#1559
   and PR#1562)

------------------------------------------------------------------------------
samtools - changes v1.17
------------------------------------------------------------------------------

New work and changes:

* New samtools reset subcommand.  Removes alignment information.  Alignment
   location, CIGAR, mate mapping and flags are updated. If the alignment was
   in reverse direction, sequence and its quality values are reversed and
   complemented and the reverse flag is reset.  Supplementary and secondary
   alignment data are discarded. (PR#1767, implements #1682. Requested by dkj)

* New samtools cram-size subcommand.  It writes out metrics about a CRAM file
   reporting aggregate sizes per block "Content ID" fields, the data-series
   contained within them, and the compression methods used. (PR#1777)

* Added a --sanitize option to fixmate and view.  This performs some sanity
   checks on the state of SAM record fields, fixing up common mistakes made by
   aligners. (PR#1698)

* Permit 1 thread with samtools view.  All other subcommands already allow
   this and it does provide a modest speed increase. (PR#1755, fixes #1743.
   Reported by Goran Vinterhalter)

* Add CRAM_OPT_REQUIRED_FIELDS option for view -c.  This is a big speed up
   for CRAM (maybe 5-fold), but it depends on which filtering options are
   being used. (PR#1776, fixes #1775. Reported by Chang Y)

* New filtering options in samtools depth.  The new --excl-flags option is a
   synonym for -G, with --incl-flags and --require-flags added to match view
   logic. (PR#1718, fixes #1702. Reported by Dario Beraldi)

* Speed up calmd's slow handling of non-position-sorted data by adding
   caching. This uses more memory but is only activated when needed.
   (PR#1723, fixes #1595. Reported by lxwgcool)

* Improve samtools consensus for platforms with instrument specific
   profiles, considerably helping for data with very different indel error
   models and providing base quality recalibration tables. On PacBio HiFi,
   ONT and  Ultima Genomics consensus qualities are also redistributed
   within homopolymers and the likelihood of nearby indel errors is raised.
   (PR#1721, PR#1733)

* Consensus --mark-ins option.  This permits he consensus output to include a
   markup indicating the next base is an insertion. This is necessary as we
   need a way of outputting both consensus and also how that consensus marries
   up with the reference coordinates. (PR#1746)

* Make faidx/fqidx output line length default to the input line length.
   (PR#1738, fixes #1734. Reported by John Marshall)

* Speed up optical duplicate checking where data has a lot of duplicates
   compared to non-duplicates. (PR#1779, fixes #1771. Reported by Poshi)

* For collate use TMPDIR environment variable, when looking for a temporary
   folder. (PR#1782, based on PR#1178 and fixes #1172.  Reported by
   Martin Pollard)

Bug Fixes:

* Fix stats breakage on long deletions when given a reference. (PR#1712,
   fixes #1707. Reported by John Didion)

* In ampliconclip, stop hard clipping from wrongly removing entire reads.
   (PR#1722, fixes #1717. Reported by Kevin Xu)

* Fix bug in ampliconstats where references mentioned in the input file
   headers but not in the bed file would cause it to complain that the SAM
   headers were inconsistent. (PR#1727, fixes #1650. Reported by jPontix)

* Fixed SEGV in samtools collate when no filename given. (PR#1724)

* Changed the default UMI barcode regex in markdup.  The old regex was too
   restrictive.  This version will at least allow the default read name UMI
   as given in the Illumina example documentation. (PR#1737, fixes #1730.
   Reported by yloemie)

* Fix samtools consensus buffer overrun with MD:Z handling. (PR#1745, fixes
   #1744. Reported by trilisser)

* Fix a buffer read-overflow in mpileup and tview on sequences with seq "*".
   (PR#1747)

* Fix view -X command line parsing that was broken in 1.15. (PR#1772, fixes
   #1720.  Reported by Francisco Rodríguez-Algarra and Miguel Machado)

* Stop samtools view -d from reporting meaningless system errors when tag
   validation fails. (PR#1796)

Documentation:

* Add a description of the samtools tview display layout to the man page.
   Documents . vs , and upper vs lowercase. Adds a -s sample example, and
   documents the -w option. (PR#1765, fixes #1759. Reported by
   Lucas Ferreira da Silva)

* Clarify intention of samtools fasta/q in man page and soft vs hard
   clipping. (PR#1794, fixes #1792. Reported by Ryan Lorig-Roach)

* Minor fix to wording of mpileup --rf usage and man page. (PR#1795, fixes
   #1791. Reported by Luka Pavageau)

Non user-visible changes and build improvements:

* Use POSIX grep in testing as egrep and fgrep are considered obsolete.
   (PR#1726, thanks to David Seifert)

* Switch MacOS CI tests to an ARM-based image. (PR#1770)

------------------------------------------------------------------------------
bcftools - changes v1.17
------------------------------------------------------------------------------

Changes affecting the whole of bcftools, or multiple commands:

* The -i/-e filtering expressions

     - Error checks were added to prevent incorrect use of vector arithmetics.
       For example, when evaluating the sum of two vectors A and B, the
       resulting vector could contain nonsense values when the input vectors
       were not of the same length. The fix introduces the following logic:

         - evaluate to C_i = A_i + B_i when length(A)==B(A) and set
           length(C)=length(A)

         - evaluate to C_i = A_i + B_0 when length(B)=1 and set
           length(C)=length(A)

         - evaluate to C_i = A_0 + B_i when length(A)=1 and set
           length(C)=length(B)

         - throw an error when length(A)!=length(B) AND length(A)!=1 AND
           length(B)!=1

     - Arrays in Number=R tags can be now subscripted by alleles found in
       FORMAT/GT. For example,

  FORMAT/AD[GT] > 10        .. require support of more than 10 reads for
                               each allele
  FORMAT/AD[0:GT] > 10      .. same as above, but in the first sample
  sSUM(FORMAT/AD[GT]) > 20  .. require total sample depth bigger than 20

* The commands `consensus -H` and `+split-vep -H`

     - Drop unnecessary leading space in the first header column and newly
       print `#[1]columnName` instead of the previous `# [1]columnName`
       (#1856)

Changes affecting specific commands:

* bcftools +allele-length

     - Fix overflow for indels longer than 512bp and aggregate alleles equal
       or larger than that in the same bin (#1837)

* bcftools annotate

     - Support sample reordering of annotation file (#1785)

     - Restore lost functionality of the --pair-logic option (#1808)

* bcftools call

     - Fix a bug where too many alleles passed to `-C alleles` via `-T` caused
       memory corruption (#1790)

     - Fix a bug where indels constrained with `-C alleles -T` would sometimes
       be missed (#1706)

* bcftools consensus

     - BREAKING CHANGE: the option `-I, --iupac-codes` newly outputs IUPAC
       codes based on FORMAT/GT of all samples. The `-s, --samples` and `-S,
       --samples-file` options can be used to subset samples. In order to
       ignore samples and consider only the REF and ALT columns (the original
       behavior prior to 1.17), run with `-s -` (#1828)

* bcftools convert

     - Make variantkey conversion work for sites without an ALT allele (#1806)

* bcftool csq

     - Fix a bug where a MNV with multiple consequences (e.g. missense +
       stop_gained) would report only the less severe one (#1810)

     - GFF file parsing was made slightly more flexible, newly ids can be just
       'XXX' rather than, for example, 'gene:XXX'

     - New gff2gff perl script to fix GFF formatting differences

* bcftools +fill-tags

     - More of the available annotations are now added by the `-t all` option

* bcftools +fixref

     - New INFO/FIXREF annotation

     - New -m swap mode

* bcftools +mendelian

     - The +mendelian plugin has been deprecated and replaced with
       +mendelian2. The function of the plugin is the same but the command
       line options and the output format has changed, and for this was
       introduced as a new plugin.

* bcftools mpileup

     - Most of the annotations generated by mpileup are now optional via the
       `-a, --annotate` option and add several new (mostly experimental)
       annotations.

     - New option `--indels-2.0` for an EXPERIMENTAL indel calling model.
       This model aims to address some known deficiencies of the current
       indel calling algorithm, specifically, it uses diploid reference
       consensus sequence. Note that in the current version it has the
       potential to increase sensitivity but at the cost of decreased
       specificity.

     - Make the FS annotation (Fisher exact test strand bias) functional and
       remove it from the default annotations

* bcftools norm

     - New --multi-overlaps option allows to set overlapping alleles either to
       the ref allele (the current default) or to a missing allele (#1764 and
       #1802)

     - Fixed a bug in `-m -` which does not split missing FORMAT values
       correctly and could lead to empty FORMAT fields such as `::` instead
       of the correct `:.:` (#1818)

     - The `--atomize` option previously would not split complex indels such
       as C>GGG. Newly these will be split into two records C>G and C>CGG
       (#1832)

* bcftools query

     - Fix a rare bug where the printing of SAMPLE field with `query` was
       incorrectly suppressed when the `-e` option contained a sample
       expression while the formatting query did not. See #1783 for details.

* bcftools +setGT

     - Add new `--new-gt X` option (#1800)

     - Add new `--target-gt r:FLOAT` option to randomly select a proportion of
       genotypes (#1850)

     - Fix a bug where `-t ./x` mode was advertised as selecting both phased
       and unphased half-missing genotypes, but was in fact selecting only
       unphased genotypes (#1844)

* bcftools +split-vep

     - New options `-g, --gene-list` and `--gene-list-fields` which allow to
       prioritize consequences from a list of genes, or restrict output to the
       listed genes

     - New `-H, --print-header` option to print the header with `-f`

     - Work around a bug in the LOFTEE VEP plugin used to annotate gnomAD
       VCFs. There the LoF_info subfield contains commas which, in
       general, makes it impossible to parse the VEP subfields. The
       +split-vep plugin can now work with such files, replacing the
       offending commas with slash (/) characters. See also
       https://github.com/Ensembl/ensembl-vep/issues/1351

     - Newly the `-c, --columns` option can be omitted when a subfield is used
       in `-i/-e` filtering expression. Note that `-c` may still have to be
       given when it is not possible to infer the type of the subfield. Note
       that this is an experimental feature.

* bcftools stats

     - The per-sample stats (PSC) would not be computed when `-i/-e` filtering
       options and the `-s -` option were given but the expression did not
       include sample columns (1835)

* bcftools +tag2tag

     - Revamp of the plugin to allow wider range of tag conversions,
       specifically all combinations from FORMAT/GL,PL,GP to
       FORMAT/GL,PL,GP,GT

* bcftools +trio-dnm2

     - New `-n, --strictly-novel` option to downplay alleles which violate
       Mendelian inheritance but are not novel

     - Allow to set the `--pn` and `--pns` options separately for SNVs and
       indels and make the indel settings more strict by default

     - Output missing FORMAT/VAF values in non-trio samples, rather than
       random nonsense values

* bcftools +variant-distance

     - New option `-d, --direction` to choose the directionality: forward,
       reverse, nearest (the default) or both (#1829)

-- 
 The Wellcome Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.