#9 Consider using fasta index (.fai) if present

2.0
closed
None
2012-11-30
2012-11-24
No

I see that there is debugging code which calls samtools faidx to get reference sequence information, but that it slows Atlas2-Indel significantly due to repetitive process initialization/communication.

An alternative approach would be to use a .fai file, if present, but natively in ruby, for random access to the reference sequence. Although loading the chromosomes one at a time is not a big deal when processing the whole genome at a time, there is the potential for significant improvement if chromosome-level bam files are being processed or for "re-genotyping" or other situations where a restricted subset of sites are being processed. (There's no point in reading chromosomes 1-17 into memory if you're only interested in chromosome 18.)

With this in mind, the attached patch (against the Atlas-Indel-Remix branch) uses the .fai if available. This does lead to a lot of seeks, but with OS caching, it is comparable in runtime to the existing method (with less disk IO) and (in some limited testing) is a good bit faster when a single-chromosome bam file is provided (well, it doesn't help for chromosome 1...)

The title of this ticket includes 'consider' because more extensive testing may reveal this to be too slow in the general case (one option would be to only use the fasta index if a limited number of sites, i.e. a bed file, is indicated at the command line).

I could also provide a patch for the current trunk, but it looks like new development is in the remix branch.

1 Attachments

Discussion

  • Atlas2Team

    Atlas2Team - 2012-11-26
    • status: open --> accepted
    • assigned_to: Danny Challis
     
  • Danny Challis

    Danny Challis - 2012-11-26

    Thanks Bradford for your input and especially for the additional code. We are evaluating these changes and will discuss possible impact and advantages.

     
  • Danny Challis

    Danny Challis - 2012-11-30

    After evaluating the changes it looks like under normal running circumstances it slows down the execution time by <1%. Given the improvement in other use cases I think it is well worth the cost. I have rolled your changes into the latest Atlas-Indel-Remix commit. Thanks!

     
  • Danny Challis

    Danny Challis - 2012-11-30
    • status: accepted --> closed
     


Anonymous

Cancel  Add attachments