Menu

#1 Support BAM files with sequence names missing 'chr' prefix

2.0
open
nobody
2018-03-08
2018-03-08
No

The problem

Using a BAM file that is, for example, aligned against GRCh37 instead of hg19, means that its sequence names do not contain the chr prefix, i. e. it's 1, 2, 3, 4, 5, ... (instead of chr1, chr2, chr3, chr4, chr5, ...) . This causes SignatureGenerator#createComparatorFromSAMHeader to create an ill-defined comparator, because it is not checked whether contig.getSequenceName() returns a string with a chr prefix. As a result, only the coverage of variants on chromosome 1 will be calculated correctly, the others are returned with TOTAL:0 coverage.

How to reproduce

See the test case in the attached patch (details below).

Or, use a BAM file with chr-less sequence names and run SignatureGeneratoragainst a file that contains SNPs from multiple chromosomes; the calculation will only be non-0 for variants on chromosome 1.

Attached Patch

This patch moves the existing code to check for the prefix from
SignatureGenerator#match to a separate function, SignatureGenerator#ensureChrPrefix(String),
which is also tested in SignatureGeneratorTest.
The rest of the code already assumes that the chr prefix may be missing from the sequence names in the SAM records.

To make sure the new case is covered, the patch also extends the helpers in
SignatureGeneratorTest to write a BAM file with reference sequence names
1, 2, 3, 4, 5 (instead of chr1, chr2, chr3, chr4, chr5)
and extends SignatureGenerator#testCreateComparatorFromSAMHeader
to test against this case.

1 Attachments

Discussion


Log in to post a comment.