From: Peter C. <p.j...@go...> - 2014-01-29 14:31:46
|
On Wed, Jan 29, 2014 at 1:13 PM, Nicolas Delhomme <nic...@um...> wrote: > Hej! > > I'm working with Picea abies (Norway Spruce) that has a very large genome: ~20Gb. To keep things manageable for my analyses, I've had to combine some of the assembly scaffolds together (there are about 10 million of them in the first version of the genome assembly. By doing so, I ended up created a few "artificial scaffolds" that combines a large portion of the small scaffolds of the assembly, I've named them Combined_XX (with XX being digits), e.g.: > > @SQ SN:MA_438123 LN:5000 > @SQ SN:MA_435237 LN:5000 > @SQ SN:MA_425275 LN:5000 > @SQ SN:MA_425220 LN:5000 > @SQ SN:MA_425195 LN:5000 > @SQ SN:MA_404953 LN:5000 > @SQ SN:MA_404599 LN:5000 > @SQ SN:MA_392375 LN:5000 > @SQ SN:MA_362962 LN:5000 > @SQ SN:MA_361986 LN:5000 > @SQ SN:MA_348485 LN:5000 > @SQ SN:MA_338196 LN:5000 > @SQ SN:MA_302379 LN:5000 > @SQ SN:MA_284766 LN:5000 > @SQ SN:MA_266241 LN:5000 > @SQ SN:MA_256988 LN:5000 > @SQ SN:MA_253284 LN:5000 > @SQ SN:MA_249818 LN:5000 > @SQ SN:MA_222315 LN:5000 > @SQ SN:chloroplast LN:124084 > @SQ SN:Combined_01 LN:1633613231 > @SQ SN:Combined_02 LN:924807789 > ... > @SQ SN:Combined_20 LN:61500603 > > When trying to index the resulting bam files, I got the following error: > > [bam_index_core] the alignment is not sorted (HWI-ST588:186:C266AACXX:2:2315:3029:75947): 462867-th chr > 462848-th chr > [bam_index_build2] fail to index the BAM file. > > although the BAM file is sorted. After some unsuccessful googling and more trial and error checks, I realised that my problem is related to the scaffold Combined_01. Filtering its reads away from my bam file solved the error. Hence I suppose that it has to do with the length of that scaffold being over the maximal size possible for samtools index to store its information. > > In my current situation, I'll just create my "artificial scaffold" differently to avoid that issue, but Picea abies has 12 chromosomes so if we ever manage to assemble one fully, its size (all chromosomes are in P. abies - and in conifer in general - evenly sized) would be similar to the size of my Combined_01 scaffold, i.e. 1.6Gb. That would be great if samtools would support such long chromosomes :-) > > Let me know if you need any data from me to reproduce this issue, > > Best regards, > > Nicolas Delhomme The BAI index is limited to chromosomes or references of 514Mbp, i.e. 536870911 base pairs. Ideally samtools would give an clearer error here when attempting to build the index. The CSI indexing scheme in the new HTSlib will fix this limitation, which is going to be particularly useful for plant genomes. Peter |