[maq-help] illumina runs
Status: Beta
Brought to you by:
lh3lh3
From: Girish B <bg...@he...> - 2008-08-10 23:12:18
|
Hi ! I am wondering why it is taking me so long to do an alignment using maq. I am trying to align illumina data (7 lanes, each about 1 gig in size ) against ref human genome (1 huge FASTA file about 2.5 GB ) . On maq's website it says for 1-2 million reads and with 1 gig of ram , it would take about 10 cpu hours to align against human ref genome. I wrote a python script to count the number of reads in one of the lanes and it counted 7 million reads. Moreover, lane 1 ( 1.1 GB ), has been going for about 7 days now ( quad core machine, 32 bits, 6 GB RAM). I did as maq's website advised i.e downloaded 24 individual FASTA files for reference human genome from NCBI and using python i have put them all in 1 fasta file ( resulting size being 2.5 GB), then using "maq fasta2bfa ref.fasta ref.bfa" converted the ref genome to bfa format then converted the illumina reads to bfq using "maq fastq2bfq reads.fastq reads-1.bfq" and finally to align the reads to the reference used " maq match reads-1.map ref.bfa reads-1.bfq". I am not sure whether I am actually using the right flags . Furthermore, I was also wondering whether it would affect maq if i wrote a python script which would randomly pick lets say 1 or 2 millions reads from the illumina data ( 7 million reads ) and align against the human reference genome. I did do a try on 20,000 reads from lane 1 ( using maq 0.6.8, 64 bits) and the results after using "maq mapcheck ref.bfa reads-1.map >mapcheck.txt" are below : Number of reference sequences: 24 Length of reference sequences exlcuding gaps: 2832337645 Length of gaps in the reference sequences: 166621004 Length of non-gap regions covered by reads: 100 Length of 24bp unique regions of the reference: 0 Reference nucleotide composition: A: 29.54%, C: 20.44%, G: 20.45%, T: 29.58% Reads nucleotide composition: A: 100.00%, C: 0.00%, G: 0.00%, T: 0.00% Average depth across all non-gap regions: 0.000 Average depth across 24bp unique regions: nan A C G T : AC AG AT CA CG CT GA GC GT TA TC TG : 0? : 0? I am not sure whether "Reads nucleotide composition: A: 100.00%, C: 0.00%, G: 0.00%, T: 0.00%" is a bug or simply maq not able to process this data or simply me using the wrong option. Any ideas and suggestions would be most welcomed . Many thanks, Girish B |