ARDEN Wiki

Specificity Control for Read Alignments Using an Artificial Reference

Brought to you by: gieses

GS

Getting Started Guide

Introduction
Creating an artificial reference genome
- Generated Output
Mapping step
Sorting step
Analyzing both mappings
- Generated Output
Filtering (optional)
References

Introduction

For this guide we will use a very small example for a walk through. The reference genome is the sequence of Zymomonas mobilis subsp. mobilis ZM4 chromosome written to NC_006526.fna. For the read mapping step we simply simulate 100k reads of length 100 using mason [2] in zymamo_100k_100L.fastq (simulation just for the sake of this tutorial). An overview for the results of this guide is given HERE

Creating an artificial reference genome

The first step will create the artificial reference genome. The options can be seen with:

arden-create

Moreover a list of examples can be printed with:

arden-create -e 1

The following command line will produce an artificial reference with a substitution on every 21th nucleotide (-d 21). Orfs will not be protected from mutations (-o 0) and the name for the sequence is zymamo (-n). It is recommended to set this name to a string without special characters. Otherwise the fasta header will be used which might not look "nice".

arden-create /data/test/ /data/test/NC_006526.fna -d 21 -o 0 -n zymamo

Generated Output

arden-create will generate three different types of output with the following naming convention. AR_abc_distance_referencename, where a = orf option, b = reverse unbalanced mutations option, c = random option.

Artificial Reference: A nucleotide sequence generated from the input genome.
Distances.txt: A file which summarizes the differences between the artificial reference and the input genome in terms of nucleotide /amino acid distribution.
CompleteLog.txt: A log file which traces the nucleotide / amino acid changes between artificial reference and input genome.

Mapping step

For this step map the simulated reads to the reference and the created articial
reference. Use any read mapper supporting SAM format. We chose Bowtie2 and RazerS3.

Sorting step

To sort the reads the following combination of SAMtools [2] commands can be
used.

samtools view -F 4 -bS input.sam | samtools sort -no - -| samtools view -h - > output.sam

To sort the files in parallel it might be an option to use GNU parallel [3]. The following command will convert all *.sam files in the current directory with all available cores.

ls *.sam | parallel "samtools view -F 4 -bS {} | samtools sort -no - -| samtools view -h - > {.}_sorted.sam"

Analyzing both mappings

For this step the sorted .SAM les and the input read le are needed. To
provide an easy to handle function call (especially when multiple mappers are
used). These parameters need be dened in an extra le. We will now create
this file ControlFile.ini. The following table explains the character encoding:

Character	Variable
$	Reference fasta
#	Artificial fasta
&	Fastq file
@	Mapper ID
ref	alignment REF
art	alignment ART
+	End

For our example the ControlFile.ini should like this:

$:/data/test/NC_006526.fna #:/data/test/AR_111_21_zymamo.fasta &:/data/test/zymamo.fastq @Bow2 ref:/data/test/Bow2_REF_sorted.sam art:/data/test/Bow2_ART_sorted.sam + @RazerS3 ref:/data/test/Raz3_REF_sorted.sam art:/data/test/Raz3_ART_sorted.sam +
After defining the .ini the programm call to compare the alignments can
be done with:

arden-analyze /data/test/ControlFile.ini /data/test/results/

Generated Output

The program call will generate four types of output:

.esam files: The files contain the alignment information extracted by ARDEN in a simpliefied "sam" format (see here for details)
results.txt: This textfile contains the results table with the overview about numerical results (senstivitiy, specificity, AUC) and all read mappers
ROC plots: Plots of the correspoding ROC curve for each mapper.
ROC tables: Tables for each mapper, where for each subclass the sensitivity, specificity and M are displayed. This table is usefull to determine an adequate quality threshold.

Filtering (optional)

To filter alignments according to the features RQS (-r), gaps (-g) and mismatches (-m) just run:

arden-filter input.sam output.sam -r value1 -m value2 -g value3

References

[1] Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer,
N., Marth, G., et al. (2009). The Sequence Alignment/Map for-mat and SAMtools. Bioinformatics (Oxford, England), 25(16), 2078-9. doi:10.1093/bioinformatics/btp352

[2] Holtgrewe, M. (2010). Mason - a read simulator for second generation se-quencing data. Technical Report TR-B-10-06, Institut für Mathematik und
Informatik, Freie Universität Berlin.

[3] O. Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47.