From: Thomas W. B. <tb...@um...> - 2016-01-21 19:08:39
|
Rob - Thanks very much. Yes, I mistyped. The error message DOES say "Slice ends beyond reference end". not "before". I was puzzled that this was less than halfway through chromosome 13. I believe that yes, there were correct results using that cache directory before the error started happening. But it's hard to go back and verify that belief. The errors were all on one machine, and the same job ran fine on other machines at the same time. At that time, we were simply using the default value for REF_CACHE, so, a separate cache directory for each username x machine (local homes). A couple of days after the error had happened, and before rebuilding the cache, I logged in as user topmed to look at the suspect copy of the file. The file date was about a month before the error had started; the md5sum was exactly what it should have been; I copied the suspect copy of the file to my own directory to run 'diff' against a different, good copy of the file -- no difference. Your explanation of a truncated copy in the machine's temporary buffer cache makes a lot of sense. I was checking the cache directory several days after we had shifted the production jobs to another machine, so it's quite likely that any buffer cache copy was long gone and I was again reading directly from the disk. And, to answer James' question, turns out I was working from the samtools version 1.1 manpage. The explanation under ENVIRONMENT VARIABLES is much improved in the version 1.2 manpage. - thanks - tom blackwell - On Thu, 21 Jan 2016, Robert Davies wrote: > On Wed, 20 Jan 2016, Thomas W. Blackwell wrote: > >> Terry has asked me to chime in here. Yes, the pathname is >> mistyped. It should be: >> >> /home/topmed/.cache/hts-ref/28/3f/8d7892baa81b510a015719ca7b0b >> >> This is chromosome 13 from the hs37d5.fa 1000 Genomes reference >> sequence. >> >> We were using: >> samtools view -b -@ 3 -o newfile.bam oldfile.cram >> to convert from .cram back to .bam. With and without adding "-T >> ref.fa" to the command line. This is samtools version "1.2 Using >> htslib 1.2". I don't know the kernel or OS details. >> >> The samtools error message is: >> >> "Slice ends before reference end. >> ERROR: md5sum reference mismatch for ref 13 pos 46545406..46565374 >> CRAM: 5254f5764a5f27def1e306b7f047e027 >> Ref: ee40c1a58c458fe6ceada3818d41fcea >> Failure to decode slice >> [main_samview] truncated file." > > Was that "Slice ends beyond reference end."? > > You get this message when the bit of reference referred to in a cram file > goes beyond the end of the reference itself. It gets the length of the > reference from the size of the cached file, and that should be the second > number printed out in the "ERROR: md5sum reference mismatch" line. This is > why the ending location is always the same. > > 28/3f/8d7892baa81b510a015719ca7b0b is 115169878 bytes long, but the message > above says 46565374, which suggests that your cache file was truncated > somehow. > >> Once the problem starts happening, it will continue for hundreds of >> successive .cram files. Always exactly the same ending location for >> the bad slice, but the starting locations vary by +-20 kb, and both >> checksums differ from file to file. >> >> Of course the confusing thing is that when I try to reproduce the >> error, running as myself -- hence using MY reference cache file >> /home/tblackw/.cache/..., even on the same machine -- there is no >> error for each of the hundreds of .cram files ! And, once we FOUND >> where the cached reference files were and rebuilt topmed's cache, >> again no errors. [I only FOUND the cache thanks to a newly added >> paragraph at the end of Rob Davies's script seq_cache_populate.pl. >> It would be very helpful to state the default cache location at the >> end of the samtools man page.] >> >> Before we had re-built the cache, I ran 'md5sum' on both the bad copy >> and my good copy of the chromosome 13 file, and both md5sums matched >> the file name (and the file names were identical). Then I ran 'diff' >> on the good and bad copies -- again, diff exited normally. > > This is very suspicious. > > Was the broken cache file always bad, or did you get some good runs followed > by a series of bad ones (i.e. did it break at some point after the file had > been created)? > > Were all the failures run on a single machine, or did you get failures on > more than one host? > > Did you run your md5sum and diff on the same machine that samtools crashed > on, or a different one? > >> We don't understand how the cache can possibly get corrupted -- but, >> simply rebuilding it removes the error. > > If it happens again, try running md5sum on different machines (including the > one where samtools failed) to see if all the checksums are the same. > > If they're different, on one of the machines that gives a *good* md5, rename > the file and then copy it back to the original name, i.e.: > > mv 8d7892baa81b510a015719ca7b0b 8d7892baa81b510a015719ca7b0b.bak > cp 8d7892baa81b510a015719ca7b0b.bak 8d7892baa81b510a015719ca7b0b > > Then try md5sum and samtools on the machine that gave a *bad* md5 to see if > that cures the problem. If it does then that machine has probably managed to > get a bad copy of the file in its buffer cache. That shouldn't happen, so > you'd then want to probe that machine to find out why. > > Rob Davies rmd...@sa... > The Sanger Institute http://www.sanger.ac.uk/ > Hinxton, Cambs., Tel. +44 (1223) 834244 > CB10 1SA, U.K. Fax. +44 (1223) 494919 > > > -- > The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a > charity registered in England with number 1021457 and a company registered in > England with number 2742969, whose registered office is 215 Euston Road, > London, NW1 2BE. |