For this guide we will use a very small example for a walk through. The reference genome is the sequence of Zymomonas mobilis subsp. mobilis ZM4 chromosome written to NC_006526.fna. For the read mapping step we simply simulate 100k reads of length 100 using mason [2] in zymamo_100k_100L.fastq (simulation just for the sake of this tutorial). An overview for the results of this guide is given HERE
The first step will create the artificial reference genome. The options can be seen with:
arden-create
Moreover a list of examples can be printed with:
arden-create -e 1
The following command line will produce an artificial reference with a substitution on every 21th nucleotide (-d 21). Orfs will not be protected from mutations (-o 0) and the name for the sequence is zymamo (-n). It is recommended to set this name to a string without special characters. Otherwise the fasta header will be used which might not look "nice".
arden-create /data/test/ /data/test/NC_006526.fna -d 21 -o 0 -n zymamo
arden-create will generate three different types of output with the following naming convention. AR_abc_distance_referencename, where a = orf option, b = reverse unbalanced mutations option, c = random option.
For this step map the simulated reads to the reference and the created articial
reference. Use any read mapper supporting SAM format. We chose Bowtie2 and RazerS3.
To sort the reads the following combination of SAMtools [2] commands can be
used.
samtools view -F 4 -bS input.sam | samtools sort -no - -| samtools view -h - > output.sam
To sort the files in parallel it might be an option to use GNU parallel [3]. The following command will convert all *.sam files in the current directory with all available cores.
ls *.sam | parallel "samtools view -F 4 -bS {} | samtools sort -no - -| samtools view -h - > {.}_sorted.sam"
For this step the sorted .SAM les and the input read le are needed. To
provide an easy to handle function call (especially when multiple mappers are
used). These parameters need be dened in an extra le. We will now create
this file ControlFile.ini. The following table explains the character encoding:
| Character | Variable |
|---|---|
| $ | Reference fasta |
| # | Artificial fasta |
| & | Fastq file |
| @ | Mapper ID |
| ref | alignment REF |
| art | alignment ART |
| + | End |
For our example the ControlFile.ini should like this:
$:/data/test/NC_006526.fna
#:/data/test/AR_111_21_zymamo.fasta
&:/data/test/zymamo.fastq
@Bow2
ref:/data/test/Bow2_REF_sorted.sam
art:/data/test/Bow2_ART_sorted.sam
+
@RazerS3
ref:/data/test/Raz3_REF_sorted.sam
art:/data/test/Raz3_ART_sorted.sam
+
After defining the .ini the programm call to compare the alignments can
be done with:
arden-analyze /data/test/ControlFile.ini /data/test/results/
The program call will generate four types of output:
To filter alignments according to the features RQS (-r), gaps (-g) and mismatches (-m) just run:
arden-filter input.sam output.sam -r value1 -m value2 -g value3
[1] Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer,
N., Marth, G., et al. (2009). The Sequence Alignment/Map for-mat and SAMtools. Bioinformatics (Oxford, England), 25(16), 2078-9. doi:10.1093/bioinformatics/btp352
[2] Holtgrewe, M. (2010). Mason - a read simulator for second generation se-quencing data. Technical Report TR-B-10-06, Institut für Mathematik und
Informatik, Freie Universität Berlin.
[3] O. Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47.