Not working for large *.cidx files

Boulund
2011-07-12
2014-03-09
  • Boulund

    Boulund - 2011-07-12

    Hi!

    I've been trying to index some rather large FASTA files, like for example GenBank environmental or GenBank nucleotide databases. During my tests I've discovered that it is indeed possible to index the GenBank nucleotide FASTA file, which is 75GB in size and contains 87763542 entries.
    What does not work is indexing of GenBank environmental FASTA file, which is 23GB in size and contains  entries. What happens is that cdbfasta stops the indexing operation when the index file (env_nt.pfa.cidx_tmp) size reaches 4 GB (4294961152 bytes). I hope that it can be resolved easily. Maybe I can fix it in the source code if you point me the right way?

    I'm on 64bit Red Hat Enterprise Linux 6.1, Xeon X3470 and 16GB of RAM.

    /Boulund

     
  • Boulund

    Boulund - 2011-07-12

    It seems that the number of entries in ent_nt did not make it into the post (the exact number of entries is large, in fact so large that my computer won't work with me to count them right now…) and I noticed that I forgot to add information about the *.cidx size for genbank nucleotide database: nt.pfa.cidx is 4242573126 bytes (not far from what seems to be the limit).

    I've also encountered the same errors when trying to index two other files from the CAMERA metagenomic database that are also rather large (>100GB).

     
  • Boulund

    Boulund - 2011-07-13

    I thought I'd try to make everything more clear by consolidating all information into one last post, sorry for repeating myself.

    I've been trying to index some rather large FASTA files, like for example GenBank environmental or GenBank nucleotide databases. During my tests I've discovered that it is indeed possible to index the GenBank nucleotide FASTA file, which is 75GB in size and contains 87763542 entries. Indexing this file gives a file, nt.pfa.cidx, that is 4242573126 bytes in size.
    What does not work is indexing of GenBank environmental FASTA file (env_nt.pfa), which is only 23GB in size and contains XXXXX entries. What happens is that cdbfasta stops the indexing operation when the index file (env_nt.pfa.cidx_tmp) size reaches ~4 GB (4294961152 bytes).

    Summary of some files I've tried to index:

    Size                      Filename                                  Entries
     75G ( 79679500086 bytes) nt.pfa                                   87763542
     23G ( 24597965376 bytes) env_nt.pfa                              110633562
    108G (115131334460 bytes) CAM_PROJ_AntarcticaAquatic.read.fa.pfa  387757590
     75G ( 80281084264 bytes) CAM_PROJ_GOS.read.fa.pfa                 76035108
    

    Note that nt.pfa is the only file that could be indexed properly.

    When trying to index them using standard cdbfasta settings, indexing stops with the following error message:
    "Error: cdbhash was unable to write into file"
    I get the following files left in the directory:

    Size       Filename
    4242573126 nt.pfa.cidx   <--- this indexing worked and produced no error message
    4294961152 env_nt.pfa.cidx_tmp
    4294961152 CAM_PROJ_AntarcticaAquatic.read.fa.pfa.cidx_tmp
    4294961152 CAM_PROJ_GOS.read.fa.pfa.cidx_tmp
    

    Note that the index file size for nt.pfa is slightly less than for the three other files.

    The sources for all files are available from public FTP, should you need them for testing. My files have been translated into protein fasta format in all 6 reading frames, so they are approximately 3 times larger and contain 6 times the number of sequences as the original files.

    I'm on 64bit Red Hat Enterprise Linux 6.1, Xeon X3470 and 16GB of RAM. The program works well for all of my other FASTA files (which are obviously smaller/does not contain as many entries).

     
  • pausan

    pausan - 2013-09-26

    Hi,
    I am having the exact same problem, trying to index a 10G file with 70M entrances. I get the "Error: cdbhash was unable to write into file" message when the file reaches 4294961152 byte size and cdbfasta stops.
    I was wondering if you had solved the issue.
    Thanks in advance,
    pausan

     
  • J Alves

    J Alves - 2014-03-09

    I am also having the exact same problem, although I am starting from a FASTQ file. When the temporary file reaches the size 4,294,961,152, cdbfasta stops with the error reported above. My system clearly can create files larger than that, since I have file that are more than 5 GB. Any points on how to solve this? Source code changes or compiler options or...? Thanks.

     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks