seqbarracuda-meth

Authors:

seqbarracuda-meth

Barracuda, nvidia libraries and toolkit

These are the instructions to align bisulfite sequencing methylation data with barracuda on an instance with CPU+GPU configuration, using Nvidia GPUs and the Nvidia CUDA toolkit.

Please refer to the website to download barracuda:

http://seqbarracuda.sourceforge.net

To install the nvidia cuda dependencies in an Ubuntu 14.04 system:



sudo apt-get install nvidia-cuda-dev nvidia-cuda-toolkit

To install the latest version of the nvidia cuda toolkit, please refer to the website:

https://developer.nvidia.com/cuda-downloads

Example data

Find below the URLs for example simulated WGBS datasets of increasing size. The alignment procedure should take, depending of the instance configuration, in the order of seconds, minutes or several hours for the different simulated datasets of increasing size.

Example data can be downloaded from the following links:

mkdir ~/tmwg-example-files
cd ~/tmwg-example-files
wget -c --no-check-certificate https://s3-eu-west-1.amazonaws.com/cegx-test-001/tmwg-example-files/SIM01_S1_L001_R1_001.fastq.gz
wget -c --no-check-certificate https://s3-eu-west-1.amazonaws.com/cegx-test-001/tmwg-example-files/SIM01_S1_L001_R2_001.fastq.gz

wget -c --no-check-certificate https://s3-eu-west-1.amazonaws.com/cegx-test-001/tmwg-example-files/SIM02_S1_L001_R1_001.fastq.gz
wget -c --no-check-certificate https://s3-eu-west-1.amazonaws.com/cegx-test-001/tmwg-example-files/SIM02_S1_L001_R2_001.fastq.gz

wget -c --no-check-certificate https://s3-eu-west-1.amazonaws.com/cegx-test-001/tmwg-example-files/SIM03_S1_L001_R1_001.fastq.gz
wget -c --no-check-certificate https://s3-eu-west-1.amazonaws.com/cegx-test-001/tmwg-example-files/SIM03_S1_L001_R2_001.fastq.gz

# SIM04-WGBS dataset is large and will require at least 60Gb of disk to download,
# and at least 500Gb of disk to analyse

wget -c --no-check-certificate https://s3-eu-west-1.amazonaws.com/cegx-test-001/tmwg-example-files/SIM04-WGBS_S1_L001_R1_001.fastq.gz
wget -c --no-check-certificate https://s3-eu-west-1.amazonaws.com/cegx-test-001/tmwg-example-files/SIM04-WGBS_S1_L001_R2_001.fastq.gz

# SIM05-WGBS dataset is large and will require at least 300Gb of disk to download,
# and at least 2Tb of disk to analyse

wget -c --no-check-certificate https://dl.dnanex.us/F/D/ky2g3FPV3JxgBpKfGB5f902K1JyQ1X4j3GK4gzQx/SIM05-WGBS-combined_S1_L001_R1_001.fastq.gz
wget -c --no-check-certificate https://dl.dnanex.us/F/D/54Vf7FYG575V45Yg1Yx7204y91KP8gZ361p0QpyG/SIM05-WGBS-combined_S1_L001_R2_001.fastq.gz

Example indexed reference files

Example indexed reference of the human genome c2t converted. The download may take between 5-30 minutes:

mkdir ~/genome_refs
cd ~/genome_refs
wget -c --no-check-certificate https://s3-eu-west-1.amazonaws.com/cegx-test-001/tmwg-example-files/hs38DH_bwameth.tar.gz

Running seqbarracuda-meth

To prepare the index files for barracuda:

cd ~/genome_refs
tar xzf hs38DH_bwameth.tar.gz

This will produce a folder with the reference files needed for the alignment process. Among others, the c2t file:

ls -l ~/genome_refs/hs38DH_bwameth/hs38DH.fa.bwameth.c2t

To prepare barracuda, download the barracuda_version.tar.gz from the website, decompress and compile:

cd ~/
tar xzf barracuda_version.tar.gz
cd ~/barracuda/
make

If you get a warning "ldconfig: command not found", 
then make sure ldconfig is in your PATH, then delete everything in barracuda/linux/release
and then try to compile linux/release/barracuda using "make" again.

The alignments on a human reference will require an instance with a GPU with at least 6GB of RAM memory (GPU memory).

To run barracuda on the R1 and R2 input files, use the following command-line:

cd ~/tmwg-example-files

~/barracuda/linux/release/barracuda aln ~/genome_refs/hs38DH_bwameth/hs38DH.fa.bwameth.c2t ./SIM01_S1_L001_R1_001.fastq.gz > ./SIM01_S1_L001_R1_001.sai

~/barracuda/linux/release/barracuda aln -C 1 ~/genome_refs/hs38DH_bwameth/hs38DH.fa.bwameth.c2t ./SIM01_S1_L001_R2_001.fastq.gz > ./SIM01_S1_L001_R2_001.sai

The '-C 1' switch means that barracuda will try to run on GPU "1" rather than the default GPU. On instances with two GPUs, the two commands above can run in parallel.

Each of the two steps above will produce a prompt similar to the one below:

Barracuda, Version 0.7.0r107
[aln] 17bp reads: max_diff = 2
[aln] 38bp reads: max_diff = 3
[aln] 64bp reads: max_diff = 4
[aln] 93bp reads: max_diff = 5
[aln] 124bp reads: max_diff = 6
[aln] 157bp reads: max_diff = 7
[aln] 190bp reads: max_diff = 8
[aln] 225bp reads: max_diff = 9
[aln_core] Running 0.7.0 beta $Revision: 1.112 $ CUDA mode.
[aln_core] Using specified CUDA device 1 Tesla K40c, memory available 11419 MB.
[aln_core] Loading BWTs, please wait..
[aln_core] Finished loading reference sequence assembly, 2040 MB in
3.86s (528.73 MB/s).
[aln_core] Memory available for performing alignment: 5282 MB.
[aln_core] Sweet! Running with an enlarged buffer.
[aln_core] Now aligning sequence reads to reference assembly, please wait..
[aln_core] 111847 reads processed.
[aln_core] 223694 reads processed.
[...]
[aln_core] Finished!
[aln_core] Total no. of sequences: 10000000, size in base pair:
1500000000 bp, average length 150.00 bp/sequence.
[aln_core] Alignment Speed: 21326.04 sequences/sec or 3198906.25 bp/sec.
[aln_core] Total program time: 468.91s.
[main] Version: 0.7.0r107
[main] CMD: /home/avilella/barracuda/linux/release/barracuda aln /home/avilella/genome_refs/hs38DH_bwameth/hs38DH.fa.bwameth.c2t ./SIM01_S1_L001_R1_001.fastq.gz
[main] Real time: 469.655 sec; CPU: 463.403 sec

The next step is to combine the two output .sai files into a single .sam.gz file. Run the sampe command to combine R1 and R2:

~/barracuda/linux/release/barracuda sampe ~/genome_refs/hs38DH_bwameth/hs38DH.fa.bwameth.c2t ./SIM01_S1_L001_R1_001.sai ./SIM01_S1_L001_R2_001.sai ./SIM01_S1_L001_R2_001.fastq.gz ./SIM01_S1_L001_R2_001.fastq.gz | gzip -c > SIM01_S1_L001.sam.gz

The step above will produce a prompt similar to the one below:

Barracuda, Version 0.7.0r107
[sampe_core] Loading BWTs, please wait..Done!  
[sampe_core] Time used: 41.55s
[sampe_core] Running with 4 threads
[sampe_core] Processing 524288 read pairs at a time
[sampe_core] Converting SA ./SIM01_S1_L001_R1_001.sai ./SIM01_S1_L001_R2_001.sai to linear sequence coordinates, please wait... 
[...]

This will produce a SAM.gz format file. This file can be converted into BAM format using samtools.

NOTES:

The alignment steps require loading the BWT indexes, a prompt will appear:
```
Loading BWTs, please wait...
```
If the alignment process stops at this point, it is possible the call ran out of GPU memory.
Using barracuda aln/sampe as above on the simulated datasets should map approximately 96% of reads.
The alignment rates can be tweaked with a non-default setting for -n.
Nvidia cards: Tested models -- Tesla K20, K40 and K80
https://en.wikipedia.org/wiki/Nvidia_Tesla#Specifications_and_configurations

BarraCUDA Fast Short Read Aligner Wiki