Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.

Close

doesorder of fasta file for indexing matters?

2009-04-16
2013-06-05
  • Matt  Settles
    Matt Settles
    2009-04-16

    I'm using the UCSC version of the genome which each chromosome in a separate file, the rat genome has chromosomes 1-20 (ie chr01.fa), a 'random' sequence for each  chromosome (ie chr01_random.fa) and the X, MT and UN sequences  for a total of 43 files. Using cat to combine files of a Bowtie index ch10.fa comes first, chr10_random.fa next and so forth. The random chromosomes and UN likely contains both unique and duplicated sequence, and should be left in, but not given preference over a match to the actual chromosome. So to the question, does the order of fasta file input matter when building the genome index, when multiple matches occur during during alignment.

    Matt Settles
    Bioinformatician
    Washington State University

     
    • Ben Langmead
      Ben Langmead
      2009-04-16

      Hi Matt,

      When Bowtie encounters a family of alignments that are equally good, it randomly chooses one (or more, depending on -k) to report.  The order in which the sequences were specified initially is not factored in.

      Note that Bowtie cannot enforce these types of preferences : "The random chromosomes and UN likely contains both unique and duplicated sequence, and should be left in, but not given preference over a match to the actual chromosome."

      To enforce a preference like this, you will generally need to build multiple indexes (in your case, one for the genome and one for the UN/random sequences) query them separately and combine results appropriately.

      Ben