From: Robert D. <rm...@sa...> - 2021-03-17 16:29:43
|
Samtools (and HTSlib and BCFtools) version 1.12 is now available from GitHub and SourceForge. https://sourceforge.net/projects/samtools/ https://github.com/samtools/htslib/releases/tag/1.12 https://github.com/samtools/samtools/releases/tag/1.12 https://github.com/samtools/bcftools/releases/tag/1.12 The main changes are listed below: ------------------------------------------------------------------------------ htslib - changes v1.12 ------------------------------------------------------------------------------ Features and Updates -------------------- * Added experimental CRAM 3.1 and 4.0 support. (#929) These should not be used for long term data storage as the specification still needs to be ratified by GA4GH and may be subject to changes in format. (This is highly likely for 4.0). However it may be tested using: test/test_view -t ref.fa -C -o version=3.1 in.bam -p out31.cram For smaller but slower files, try varying the compression profile with an additional "-o small". Profile choices are fast, normal, small and archive, and can be applied to all CRAM versions. * Added a general filtering syntax for alignment records in SAM/BAM/CRAM readers. (#1181, #1203) An example to find chromosome spanning read-pairs with high mapping quality: 'mqual >= 30 && mrname != rname' To find significant sized deletions: 'cigar =~ "[0-9]{2}D"' or 'rlen - qlen > 10'. To report duplicates that aren't part of a "proper pair": 'flag.dup && !flag.proper_pair' More details are in the samtools.1 man page under "FILTER EXPRESSIONS". * The knet networking code has been removed. It only supported the http and ftp protocols, and a better and safer alternative using libcurl has been available since release 1.3. If you need access to ftp:// and http:// URLs, HTSlib should be built with libcurl support. (#1200) * The old htslib/knetfile.h interfaces have been marked as deprecated. Any code still using them should be updated to use hFILE instead. (#1200) * Added an introspection API for checking some of the capabilities provided by HTSlib. (#1170) Thanks also to John Marshall for contributions. (#1222) - `hfile_list_schemes`: returns the number of schemes found - `hfile_list_plugins`: returns the number of plugins found - `hfile_has_plugin`: checks if a specific plugin is available - `hts_features`: returns a bit mask with all available features - `hts_test_feature`: test if a feature is available - `hts_feature_string`: return a string summary of enabled features * Made performance improvements to `probaln_glocal` method, which speeds up mpileup BAQ calculations. (#1188) - Caching of reused loop variables and removal of loop invariants - Code reordering to remove instruction latency. - Other refactoring and tidyups. * Added a public method for constructing a BAM record from the component pieces. Thanks to Anders Kaplan. (#1159, #1164) * Added two public methods, `sam_parse_cigar` and `bam_parse_cigar`, as part of a small CIGAR API (#1169, #1182). Thanks to Daniel Cameron for input. (#1147) * HTSlib, and the included htsfile program, will now recognise the old RAZF compressed file format. Note that while the format is detected, HTSlib is unable to read it. It is recommended that RAZF files are uncompressed with `gunzip` before using them with HTSlib. Thanks to John Marshall (#1244); and Matthew J. Oldach who reported problems with uncompressing some RAZF files (samtools/samtools#1387). * The S3 plugin now has options to force the address style. It will recognise the addressing_style and host_bucket entries in the respective aws .credentials and s3cmd .s3cfg files. There is also a new HTS_S3_ADDRESS_STYLE environment variable. Details are in the htslib-s3-plugin.7 man file (#1249). Build changes ------------- These are compiler, configuration and makefile based changes. * Added new Makefile targets for the applications that embed HTSlib and want to run its test suite or clean its generated artefacts. (#1230, #1238) * The CRAM codecs are now obtained via the htscodecs submodule, hence when cloning it is now best to use "git clone --recursive". In an existing clone, you may use "git submodule update --init" to obtain the htscodecs submodule checkout. * Updated CI test configuration to recurse HTSlib submodules. (#1359) * Added Cirrus-CI integration as a replacement for Travis, which was phased out. (#1175; #1212) * Updated the Windows image used by Appveyor to 'Visual Studio 2019'. (#1172; fixed #1166) * Fixed a buglet in configure.ac, exposed by the release 2.70 of autoconf. Thanks to John Marshall. (#1198) * Fixed plugin linking on macOS, to prevent symbol conflict when linking with a static HTSlib. Thanks to John Marshall. (#1184) * Fixed a clang++9 error in `cram_io.h`. Thanks to Pjotr Prins. (#1190) * Introduced $(ALL_CPPFLAGS) to allow for more flexibility in setting the compiler flags. Thanks to John Marshall. (#1187) * Added 'fall through' comments to prevent warnings issued by Clang on intentional fall through case statements, when building with `-Wextra flag`. Thanks to John Marshall. (#1163) * Non-configure builds now define _XOPEN_SOURCE=600 to allow them to work when the `gcc -std=c99` option is used. Thanks to John Marshall. (#1246) Bug fixes --------- * Fixed VCF `#CHROM` header parsing to only separate columns at tab characters. Thanks to Sam Morris for reporting the issue. (#1237; fixed samtools/bcftools#1408) * Fixed a crash reported in `bcf_sr_sort_set`, which expects REF to be present. (#1204; fixed samtools/bcftools#1361) * Fixed a bcf synced reader bug when filtering with a region list, and the first record for a chromosome had the same position as the last record for the previous chromosome. (#1254; fixed samtools/bcftools#1441) * Fixed a bug in the overlapping logic of mpileup, dealing with iterating over CIGAR segments. Thanks to `@wulj2` for the analysis. (#1202; fixed #1196) * Fixed a tabix bug that prevented setting the correct number of lines to be skipped in a region file. Thanks to Jim Robinson for reporting it. (#1189; fixed #1186) * Made `bam_itr_next` an alias for `sam_itr_next`, to prevent it from crashing when working with htsFile pointers. Thanks to Torbjörn Klatt for reporting it. (#1180; fixed #1179) * Fixed once per outgoing multi-threaded block `bgzf_idx_flush` assertion, to accommodate situations when a single record could span multiple blocks. Thanks to `@lacek`. (#1168; fixed samtools/samtools#1328) * Fixed assumption of pthread_t being a non-structure, as permitted by POSIX. Thanks also to John Marshall and Anders Kaplan. (#1167, #1153, #1153) * Fixed the minimum offset of a BAI index bin, to account for unmapped reads. Thanks to John Marshall for spotting the issue. (#1158; fixed #1142) * Fixed the CRLF handling in `sam_parse_worker` method. Thanks to Anders Kaplan. (#1149; fixed #1148) * Included unistd.h and errno.h directly in HTSlib files, as opposed to including them indirectly, via third party code. Thanks to Andrew Patterson (#1143) and John Marshall (#1145). ------------------------------------------------------------------------------ samtools - changes v1.12 ------------------------------------------------------------------------------ * The legacy samtools API (libbam.a, bam.h, sam.h, etc) has not been actively maintained since 2015. It is deprecated and will be removed entirely in a future SAMtools release. We recommend coding against the HTSlib API directly. * I/O errors and record parsing errors during the reading of SAM/BAM/CRAM files are now always detected. Thanks to John Marshall (#1379; fixed #101) * New make targets have been added: check-all, test-all, distclean-all, mostlyclean-all, testclean-all, which allow SAMtools installations to call corresponding Makefile targets from embedded HTSlib installations. * samtools --version now displays a summary of the compilation details and available features, including flags, used libraries and enabled plugins from HTSlib. As an alias, `samtools version` can also be used. (#1371) * samtools stats now displays the number of supplementary reads in the SN section. Also, supplementary reads are no longer considered when splitting read pairs by orientation (inward, outward, other). (#1363) * samtools stats now counts only the filtered alignments that overlap target regions, if any are specified. (#1363) * samtools view now accepts option -N, which takes a file containing read names of interest. This allows the output of only the reads with names contained in the given file. Thanks to Daniel Cameron. (#1324) * samtools view -d option now works without a tag associated value, which allows it to output all the reads with the given tag. (#1339; fixed #1317) * samtools view -d and -D options now accept integer and single character values associated with tags, not just strings. Thanks to `@dariome` and Keiran Raine for the suggestions. (#1357, #1392) * samtools view now works with the filtering expressions introduced by HTSlib. The filtering expression is passed to the program using the specific option -e or the global long option --input-fmt-option. E.g. samtools view -e 'qname =~ "#49$" && mrefid != refid && refid != -1 && mrefid != -1' align.bam looks for records with query-name ending in `#49` that have their mate aligned in a different chromosome. More details can be found in the FILTER EXPRESSIONS section of the main man page. (#1346) * samtools markdup now benefits from an increase in performance in the situation when a single read has tens or hundreds of thousands of duplicates. Thanks to `@denriquez` for reporting the issue. (#1345; fixed #1325) * The documentation for samtools ampliconstats has been added to the samtools man page. (#1351) * A new FASTA/FASTQ sanitizer script (`fasta-sanitize.pl`) was added, which corrects the invalid characters in the reference names. (#1314) Thanks to John Marshall for the installation fix. (#1353) * The CI scripts have been updated to recurse the HTSlib submodules when cloning HTSlib, to accommodate for the CRAM codecs, which now reside in the htscodecs submodule. (#1359) * The CI integrations now include Cirrus-CI rather than Travis. (#1335; #1365) * Updated the Windows image used by Appveyor to 'Visual Studio 2019'. (#1333; fixed #1332) * Fixed a bug in samtools cat, which prevented the command from running in multi-threaded mode. Thanks to Alex Leonard for reporting the issue. (#1337; fixed #1336) * A couple of invalid CIGAR strings have been corrected in the test data. (#1343) * The documentation for `samtools depth -s` has been improved. Thanks to `@wulj2`. (#1355) * Fixed a `samtools merge` segmentation fault when it failed to merge header `@PG` records. Thanks to John Marshall. (#1394; reported by Kemin Zhou in #1393) * Ampliconclip and ampliconstats now guard against the BED file containing more than one reference (chromosome) and fail when found. Adding proper support for multiple references will appear later. (#1398) ------------------------------------------------------------------------------ bcftools - changes v1.12 ------------------------------------------------------------------------------ Changes affecting the whole of bcftools, or multiple commands: * The output file type is determined from the output file name suffix, where available, so the -O/--output-type option is often no longer necessary. * Make F_MISSING in filtering expressions work for sites with multiple ALT alleles (#1343) * Fix N_PASS and F_PASS to behave according to expectation when reverse logic is used (#1397). This fix has the side effect of `query` (or programs like `+trio-stats`) behaving differently with these expressions, operating now in site-oriented rather than sample-oriented mode. For example, the new behavior could be: bcftools query -f'[%POS %SAMPLE %GT\n]' -i'N_PASS(GT="alt")==1' 11 A 0/0 11 B 0/0 11 C 1/1 while previously the same expression would return: 11 C 1/1 The original mode can be mimicked by splitting the filtering into two steps: bcftools view -i'N_PASS(GT="alt")==1' | \ bcftools query -f'[%POS %SAMPLE %GT\n]' -i'GT="alt"' Changes affecting specific commands: * bcftools annotate: - New `--rename-annots` option to help fix broken VCFs (#1335) - New -C option allows to read a long list of options from a file to prevent very long command lines. - New `append-missing` logic allows annotations to be added for each ALT allele in the same order as they appear in the VCF. Note that this is not bullet proof. In order for this to work: - the annotation file must have one line per ALT allele - fields must contain a single value as multiple values are appended as they are and would break the correspondence between the alleles and values * bcftools concat: - Do not phase genotypes by mistake if they are not already phased with `-l` (#1346) * bcftools consensus: - New `--mask-with`, `--mark-del`, `--mark-ins`, `--mark-snv` options (#1382, #1381, #1170) - Symbolic <DEL> should have only one REF base. If there are multiple, take POS+1 as the first deleted base. - Make consensus work when the first base of the reference genome is deleted. In this situation the VCF record has POS=1 and the first REF base cannot precede the event. (#1330) * bcftools +contrast: - The NOVELGT annotation was previously not added when requested. * bcftools convert: - Make the --hapsample and --hapsample2vcf options consistent with each other and with the documentation. * bcftools call: - Revamp of `call -G`, previously sample grouping by population was not truly independent and could still be influenced by the presence of other sample groups. - Optional addition of INFO/PV4 annotation with `call -a INFO/PV4` - Remove generation of useless HOB and ICB annotation; use `+fill-tags -- -t HWE,ExcHet` instead - The `call -f` option was renamed to `-a` to (1) make it consistent with `mpileup` and (2) to indicate that it includes both INFO and FORMAT annotations, not just FORMAT as previously - Any sensible Number=R,Type=Integer annotation can be used with -G, such as AD or QS - Don't trim QUAL; although usefulness of this change is questionable for true probabilistic interpretation (such high precision is unrealistic), using QUAL as a score rather than probability is helpful and permits more fine-grained filtering - Fix a suspected bug in `call -F` in the worst case, for certain improve readability - `call -C trio` is temporarily disabled * bcftools csq: - Fix a bug wich caused incorrect FORMAT/BCSQ formatting at sites with too many per-sample consequences - Fix a bug which incorrectly handled the --ncsq parameter and could clash with reserved BCF values, consequently producing truncated or even incorrect output of the %TBCSQ formatting expression in `bcftools query`. To account for the reserved values, the new default value is --ncsq 15 (#1428) * bcftools +fill-tags: - MAF definition revised for multiallelic sites, the second most common allele is considered to be the minor allele (#1313) - New FORMAT/VAF, VAF1 annotations to set the fraction of alternate reads provided FORMAT/AD is present * bcftools gtcheck: - support matching of a single sample against all other samples in the file with `-s qry:sample -s gt:-`. This was previously not possible, either full cross-check mode had to be run or a list of pairs/samples had to be created explicitly * bcftools merge: - Make `merge -R` behavior consistent with other commands and pull in overlapping records with POS outside of the regions (#1374) - Bug fix (#1353) * bcftools mpileup: - Add new optional tag `mpileup -a FORMAT/QS` * bcftools norm: - New `-a, --atomize` functionality to decompose complex variants, for example MNVs into consecutive SNVs - New option `--old-rec-tag` to indicate the original variant * bcftools query: - Incorrect fields were printed in the per-sample output when subset of samples was requested via -s/-S and the order of samples in the header was different from the requested -s/-S order (#1435) * bcftools +prune: - New options --random-seed and --nsites-per-win-mode (#1050) * bcftools +split-vep: - Transcript selection now works also on the raw CSQ/BCSQ annotation. - Bug fix, samples were dropped on VCF input and VCF/BCF output (#1349) * bcftools stats: - Changes to QUAL and ts/tv plotting stats: avoid capping QUAL to predefined bins, use an open-range logarithmic binning instead - plot dual ts/tv stats: per quality bin and cumulative as if threshold applied on the whole dataset * bcftools +trio-dnm2: - Major revamp of +trio-dnm plugin, which is now deprecated and replaced by +trio-dnm2. The original trio-dnm calling model used genotype likelihoods (PLs) as the input for calling. However, that is flawed because PLs make assumptions which are unsuitable for de novo calling: PL(RR) can become bigger than PL(RA) even when the ALT allele is present in the parents. Note that this is true also for other programs such as DeNovoGear which rely on the same samtools calculation. The new recommended workflow is: bcftools mpileup -a AD,QS -f ref.fa -Ou \ proband.bam father.bam mother.bam | \ bcftools call -mv -Ou | \ bcftools +trio-dnm -p proband,father,mother -Oz -o output.vcf.gz This new version also implements the DeNovoGear model. The original behavior of trio-dnm is no longer supported. For more details see http://samtools.github.io/bcftools/trio-dnm.pdf -- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. |