|
From: James B. <jk...@sa...> - 2017-04-13 11:51:24
|
On Mon, Apr 10, 2017 at 10:38:00AM +0000, Keiran Raine wrote:
> I think a main thing to consider is that often mixups on this level are generally by inexperienced users. Data sharing is common to collaborators (with appropriate agreements). These can have little expertise working with the actual raw data, often a PhD student thrown in at the deep end.
Actually thinking about this more, CRAM already has a second level of
protection. So before making any changes I really need to understand
what the real set of events was that triggered this issue.
My test:
# From the htslib/test dir, create a cram file using ce#2.sam and ce.fa
$ samtools view -C -t ce.fa ce#2.sam -o /tmp/ce2.cram
$ samtools view -t ce.fa /tmp/ce2.cram | wc -l
2
# Delete second line of ce.fa to produce a new /tmp/ce.fa
$ sed -n '1p;3,$p' ce.fa > /tmp/ce.fa
$ samtools faidx /tmp/ce.fa
$ samtools view -t /tmp/ce.fa /tmp/ce2.cram | wc -l
WARNING: Header @SQ length mismatch for ref CHROMOSOME_I, 1009800 vs 1009750
ERROR: md5sum reference mismatch for ref 0 pos 2..102
CRAM: 0d356c840daa0c943666a8e151a29182
Ref : 2486f9f32b4fb924b20fa7549984987d
Failure to decode slice
[main_samview] truncated file.
0
What's happening here is that there are multiple MD5 sums. The main
SAM header has the per @SQ value, but each *slice* in cram also has a
per slice md5sum of the reference it was aligned against. That case
spots this difference.
So I need to know what particular scenario crept through these checks.
James
--
James Bonfield (jk...@sa...) | Hora aderat briligi. Nunc et Slythia Tova
| Plurima gyrabant gymbolitare vabo;
A Staden Package developer: | Et Borogovorum mimzebant undique formae,
https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi.
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
|