#99 Can't index more than 10,000 fasta files

closed
nobody
5
2013-01-08
2010-08-27
No

bowtie-build can't handle a comma-separated list of reference FASTA files when there are around 10,000 fasta files (or more) in the list. It is desirable to be able to handle this because crossbow requires bowtie-build to be called with a comma-separated list of the chromosome FASTA files; e.g. "bowtie-build chr0.fa,chr1.fa,...,chr23.fa index"; however, when the reference sequences are contigs instead of chromosomes, there can be thousands of contigs in the list.

Increasing a limit in tokenize.h appears to fix the issue (though I'm not sure if INT_MAX is generous enough to break something else):

####################################################
--- tokenize.h.orig 2010-08-09 15:26:15.828839535 -0500
+++ tokenize.h 2010-08-27 12:27:50.552778333 -0500
@@ -8,6 +8,7 @@
#ifndef TOKENIZE_H_
#define TOKENIZE_H_

+#include <climits>
#include <string>
#include <sstream>
#include <vector>
@@ -21,7 +22,7 @@
static inline void tokenize(const string& s,
const string& delims,
vector<string>& ss,
- size_t max = 9999)
+ size_t max = INT_MAX)
{
string::size_type lastPos = s.find_first_not_of(delims, 0);
string::size_type pos = s.find_first_of(delims, lastPos);
####################################################

In addition, a user must set "ulimit -n" so a process can handle at least as many open files as there are chromosomes/contigs, or bowtie-build will fail without a useful error message. An enhancement to bowtie-build would be to use getrlimit()/setrlimit() to increase RLIM_NOFILE if necessary, and print a descriptive error message if this is not possible.

Discussion

  • Ben Langmead

    Ben Langmead - 2010-08-27

    Hi Nathan,

    Does the workaround of first concatenating all your .fa files together into one file seem good to you? If so, do you think an error message that suggests that workaround would be sufficient here? The filehandle limit makes me reluctant to do the fix you suggest.

    Ben

     
  • Nathan Weeks

    Nathan Weeks - 2010-08-27

    Ben,

    If it works the same as specifying the files individually, that's fine by me. A descriptive error message would be helpful; it would also be beneficial to mention in the crossbow documentation that a single FASTA file in which the reference sequences appear in the correct order is an alternative to the comma-separated-list method.

     
  • Ben Langmead

    Ben Langmead - 2013-01-08

    This was fixed in version 2.0.5.

    Thanks as always for the detailed report.

    Best,
    Ben

     
  • Ben Langmead

    Ben Langmead - 2013-01-08
    • status: open --> closed
     

Log in to post a comment.