[maq-help] illumina runs

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi !

I am wondering why it is taking me so long to do  an alignment using maq. I am trying to align illumina data (7 lanes, each about 1 gig in size ) against ref human genome (1 huge FASTA file about 2.5 GB  ) . On maq's website it says for 1-2 million reads and with 1 gig of ram , it would take about 10 cpu hours to align against human ref genome. I wrote a python script to count the number of reads in one of the lanes and it counted 7 million reads. Moreover, lane 1 ( 1.1 GB ), has been going for about 7 days now ( quad core machine, 32 bits, 6 GB RAM). I did as maq's website advised i.e downloaded 24 individual FASTA files for reference human genome from NCBI and using python i have put them all in 1 fasta file ( resulting size being 2.5 GB), then using "maq fasta2bfa ref.fasta ref.bfa" converted the ref genome to bfa format then converted the illumina reads to bfq using "maq fastq2bfq reads.fastq reads-1.bfq"  and finally to align the reads to the reference  used "
maq match reads-1.map ref.bfa reads-1.bfq". I am not sure whether I am actually using the right flags . Furthermore, I was also wondering whether it would affect maq if i wrote a python script which would randomly pick lets say 1 or 2 millions reads from the illumina data ( 7 million reads ) and align against the human reference genome.  I did do a try on 20,000 reads from lane 1 ( using maq 0.6.8, 64 bits) and the results after using "maq mapcheck ref.bfa reads-1.map >mapcheck.txt" are below : 

 Number of reference sequences: 24
Length of reference sequences exlcuding gaps: 2832337645

Length of gaps in the reference sequences: 166621004

Length of non-gap regions covered by reads: 100

Length of 24bp unique regions of the reference: 0

Reference nucleotide composition: A: 29.54%, C: 20.44%, G: 20.45%, T: 29.58%

Reads nucleotide composition: A: 100.00%, C: 0.00%, G: 0.00%, T: 0.00%

Average depth across all non-gap regions: 0.000

Average depth across 24bp unique regions: nan

A C G T : AC AG AT CA CG CT GA GC GT TA TC TG : 0? : 0?

I am  not sure whether "Reads nucleotide composition: A: 100.00%, C: 0.00%, G: 0.00%, T: 0.00%" is a bug or simply maq not able to process this data or simply me using the wrong option. Any ideas and suggestions would be most welcomed .

Many thanks,

Girish B