Arpeggio User's Guide
Introduction
This page explains how to install Arpeggio and use the different options included. It also provides a brief guide on how to install the additional software required to run Arpeggio, and a step-by-step tutorial to get your first Arpeggio profiles.
The guide assumes a linux64 machine and presents commands and configuration files to analyze a dataset under the folder /raid1. Of course, the locations can be changed to match the user's configuration, and the additional external software may also be available for other operating systems and architectures.
Arpeggio has been written in Java: the present page assumes that your system is correctly configured to execute Java applications.
^back to the top
Installation
Arpeggio
Arpeggio can be installed by copying the latest distribution of Arpeggio to a folder of choice (e.g. /raid1/software/Arpeggio/). The latest distribution can be downloaded from Sourceforge at:
Download Arpeggio
^back to the top
Additional required software
In order to facilitate the analyses, we recommend to install additional open-source softwares: bowtie and the sra-toolkit.
Bowtie is used to align sequenced reads to the corresponding reference genome and the sra toolkit is used to facilitate downloading of publicly available experiments from the Short Read Archive (SRA) repository. For completeness, we provide an illustrative installation procedure for both softwares. Detailed instructions are available from the respective authors:
...on a Mac
Brew
In a Mac environment, both Bowtie and the SRAtoolkit can be conveniently installed using the package manager
Brew (for more informations, visit the
Brew homepage). If Brew is already installed on your system, you can go directly to
the instructions for installing Bowtie and the SRAtoolkit on a Mac computer.
Brew can be installed by exectuing the following command in a terminal window:
ruby -e "$(curl -fsSL https://raw.github.com/mxcl/homebrew/go)"
Bowtie and SRAtoolkit
With Brew, Bowtie and the SRAtoolkit can be installed executing the following commands in a terminal window:
brew tap homebrew/science
brew install bowtie
brew install sratoolkit
We also recommend to install the download manager
wget to facilitate download of reference genomes. Wget is available on Brew and can be installed executing the following command in a terminal window:
brew install wget
In order to map sequenced reads to the reference genome, it is necessary to obtain the genomes. Reference genomes can be obtained using an
FTP client at
ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/ or, in alternative, executing the following commands in a terminal window (human and mouse genomes are shown in the example):
mkdir /raid1/software/bowtie_genomes/
cd /raid1/software/bowtie_genomes/
wget ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/hg19.ebwt.zip
unzip hg19.ebwt.zip
wget ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/mm9.ebwt.zip
unzip mm9.ebwt.zip
rm *.zip
^back to the top
...in Linux
Bowtie
In a Linux environment, Bowtie can be installed to the folder of choice (e.g. /raid1/software/) using the following commands on a terminal window:
cd /raid1/software
wget http://sourceforge.net/projects/bowtie-bio/files/bowtie/1.0.0/bowtie-1.0.0-linux-x86_64.zip/download
mv download bowtie-1.0.0-linux-x86_64.zip
unzip bowtie-1.0.0-linux-x86_64.zip
rm bowtie-1.0.0-linux-x86_64.zip
In order to map sequenced reads to the reference genome, it is necessary to "build" the genomes. The human and mouse reference genomes can be built using the following commands on a terminal window:
cd /raid1/software/bowtie-1.0.0
scripts/make_mm9.sh && rm *.fa
scripts/make_hg19.sh && rm *.fa
SRA toolkit
In a Linux environment, the SRA toolkit can be installed to the folder of choice (e.g. /raid1/software/) using the following commands on a terminal window:
cd /raid1/software
wget http://ftp-private.ncbi.nlm.nih.gov/sra/sdk/2.2.2a/sratoolkit.2.2.2a-centos_linux64.tar.gz
tar xvf sratoolkit.2.2.2a-centos_linux64.tar.gz
rm sratoolkit.2.2.2a-centos_linux64.tar.gz
mv sratoolkit.2.2.2a-centos_linux64/ sratoolkit.2.2.2a
^back to the top
Quick Start
Warning: NGS data is space consuming, we recommend to ensure that enough disk space is available prior to starting any analysis.
Arpeggio requires a configuration file named
properties.txt and a data spreadsheet named
data.csv. The software is also constructed to handle the data in different working folders so that the data is accurately organized during every stage of the processing pipeline. The following three sections describe:
^back to the top
Working folders structure
We suggest to create working folders where data will be stored, processed and evaluated in order to compute Arpeggio profiles. We will assume that Arpeggio has been installed in the folder /raid1/software/, in the following lists of commands, replace the folder with your folder of choice.
cd /raid1/software
mkdir -p /raid1/sra/
mkdir -p /raid1/sra/input_data
mkdir -p /raid1/sra/data/sra
mkdir -p /raid1/sra/data/fastq
mkdir -p /raid1/sra/data/sam
mkdir -p /raid1/sra/data/tmp
^back to the top
Configuring the Property file
In order for the Arpeggio software to run correctly, a property file called
properties.txt is needed in the same folder as the data file
data.csv. We suggest to use the subfolder input_data in the working folder structure for storing these two files.
The property file can be edited with a text editor, such as Emacs, and contains configuration parameters that facilitate and guide the execution of different parts of the pipeline. The available options that can be set by the user are:
- files.xxx, where xxx can be sra, fastq, sam, or tmp, specify in which folder the corresponding files, identified by type, should be located
- bowtie specifies the location of the bowtie executables
- bowtie.opt the options to be passed to bowtie
- bowtie.genome.xxx, where xxx can be for instance hg19, or mm9, etc..., indicates the location of the reference genomes
- arpeggio.window_size specifies the window size used for computing the autocorrelation
- arpeggio.remove_duplicate_reads is used to set whether duplicate reads should be removed when computing autocorrelations
An illustrative property file can be downloaded from the Sourceforge repository (
property.txt for Linux,
property.txt for Mac (sourceforge.net)). Alternatively, it can be obtained using the download manager
wget:
for Linux
wget http://sourceforge.net/projects/arpeggio/files/samples/properties.txt -O properties.txt
for Mac
wget http://sourceforge.net/projects/arpeggio/files/samples/properties_mac.txt -O properties.txt
A typical property.txt file reads as follows:
files.sra=/raid1/sra/data/sra
files.fastq=/raid1/sra/data/fastq
files.sam=/raid1/sra/data/sam
files.tmp=/raid1/sra/data/tmp
bowtie=/raid1/software/bowtie-1.0.0/bowtie
bowtie.opt=-n2 -k1 -m1 --best --strata --chunkmbs 512
bowtie.genome.mm9=/raid1/software/bowtie-1.0.0/mm9
bowtie.genome.hg19=/raid1/software/bowtie-1.0.0/hg19
fastq-dump=/raid1/software/sratoolkit.2.2.2a/bin/fastq-dump
arpeggio.window_size=8192
arpeggio.remove_duplicate_reads=true
^back to the top
Configuring the Data file
The data file
data.csv, which can be edited with spreadsheet software, such as Excel, is a table containing relevant information about each sequencing experiment. We suggest to use the subfolder input_data in the working folder structure for storing this file. While such table can be expanded arbitrarily to include any additional information needed by the user, the following columns, identified by their column headers, are
necessary for Arpeggio to run correctly.
- experiment.name: unique experiment identifier
- genome: reference genome for mapping (e.g. hg19, mm9, etc...)
- exp_type: to be set to ChIP-seq
- layout: use single if the experiment uses single-end reads; use 5'-3'-3'-5', 3'-5'-5'-3', fr, rf, or ff for paired-end reads
- Run: This specifies the SRA identifier for the experiment (e.g. SRR054876 or ERR011988). It can be left empty if the experiments are not present on the SRA, see below
- DNA_shearing: DNA shearing technique (e.g. Sonication, MNase, etc...). This information is used for best matching experiments to controls
- Protein: The protein investigated by the experiment. Note: IgG and DNAInput are used to identify control experiments
sample data file
Note: If "Run" is not available (e.g. unpublished data), a sequence file is needed in any of these formats: experiment_name.
sra, experiment_name.
fastq or experiment_name.
sam and it should be copied to the relative /raid1/sra/
file_type folder, where
file_type is either
sra,
fastq, or
sam.
^back to the top
Running Arpeggio: Examples
Once the data file has been created and the property file has been correctly configured, running Arpeggio will build all the autocorrelation profiles. If SRA identifiers for the experiments have been specified, Arpeggio will download the corresponding data from the SRA, map it and build all autocorrelation profiles.
Note: downloading data from the SRA may take a long time depending on the speed of your internet connection.
A series of commands that can be executed using the illustrative configuration files provided on the Sourceforge repository is:
Download the data from the SRA, map it and build all autocorrelation profiles
java -Xmx4g -server -jar /raid1/software/Arpeggio/Arpeggio.jar arpeggioInput/input_data bda >> arpeggiolog.txt
Once the autocorrelation profiles are calculated, the controls can be matched:
java -Xmx4g -server -jar /raid1/software/Arpeggio/Arpeggio.jar arpeggioInput/input_data ctr >> arpeggiolog.txt
Note: the Arpeggio profiles and matching controls in the paper were calculated using R code, so there may be slight numerical differences.
Finally, arpeggio profiles can be computed:
java -Xmx4g -server -jar /raid1/software/Arpeggio/Arpeggio.jar arpeggioInput/input_data arp >> arpeggiolog.txt
^back to the top
Running Arpeggio: your own samples
- Place your sample fastq file into the directory specified for fastq files in the properties.txt file. Or if you have an alignment (sam format) file already available place it in the directory specified for sam files.
- Add an entry for your sample to the data.csv file filling out the necessary fields.
- Run Arpeggio as previously described.
^back to the top
Arpeggio Output
The output of Arpeggio is in the input_data folder and is as follows:
- Arpeggio.prof.csv
- Contains the Arpeggio profiles for the specified window size. Each column represents a single sample. Column headers are the sample names.
- Arpeggio.Star.prof.csv
- Contains the Arpeggio fragment length distributions. Each column represents a single sample. Column headers are the sample names.
- BDAOS.prof.csv
- Contains the positive to negative strand cross correlations for the specified window size. Each column represents a single sample. Column headers are the sample names.
- BDASS.prof.csv
- Contains the sample autocorrelations for the specified window size. Each column represents a single sample. Column headers are the sample names.
To create the nucleosome plots (figure 1 from the paper) run the following code in R (set your working directory to the input_data directory) using setwd() in R. (note that if you provided your own alignment rather than mapping with Arpeggio you will need to fill out the BDA.flattened.reads field in the data.csv with the number of unique mapped reads in the sample)
Arpeggio = read.table("Arpeggio.prof.csv",
header=TRUE, sep=",",row.names=NULL)
BDAannotation = read.table("data.csv",
header=TRUE, sep=",",row.names=NULL)
runmean=function(x,n){
d1=length(x)
d2=n
y = matrix(NA,length(x),d2)
for(i in 1:d2){
j = max(1,d2/2 - i + 1,na.rm=TRUE):min(d1,d1 + d2/2 - i,na.rm=TRUE)
k = max(1,i - d2/2 + 1,na.rm=TRUE):min(d1,d1 - d2/2 + i,na.rm=TRUE)
y[k,i] = as.double(x[j])
}
apply(y,1,mean,na.rm=TRUE)
}
makePretty = function(arpeg){
Arpeg = Arpeg[
c((length(Arpeg)/2+1):length(Arpeg),
1:(length(Arpeg)/2))]
#remove large spike
Arpeg[length(Arpeg)/2] = (Arpeg[length(Arpeg)/2-1]*2 -
Arpeg[length(Arpeg)/2-2])
Arpeg[length(Arpeg)/2+2] = (Arpeg[length(Arpeg)/2+3]*2 -
Arpeg[length(Arpeg)/2+4])
Arpeg[length(Arpeg)/2+1] = (Arpeg[length(Arpeg)/2] +
Arpeg[length(Arpeg)/2+2])/2
Arpeg = runmean(Arpeg,31)
return(Arpeg)
}
Arpeg=Arpeggio[,"H3K36me3_CD4T_NULL_1_C"]
readsExp=BDAannotation[
BDAannotation[,"experiment.name"]=="H3K36me3_CD4T_NULL_1_C",
"BDA.flattened.reads"]
readsCtrl=BDAannotation[
BDAannotation[,"experiment.name"]=="DNAInput_c2c12_MT_4",
"BDA.flattened.reads"]
ratio = readsExp^2/readsCtrl^2
Arpeg = Arpeg/ratio
Arpeg = makePretty(Arpeg)
par(mgp = c(1.4, 0.5, 0))
plot(
(-length(Arpeg)/4):(length(Arpeg)*1/4),
Arpeg[(length(Arpeg)/4):(length(Arpeg)*3/4)],
type="l",
ylab = bquote(.("IP signal") ~ (frac(Reads~ChIP,Total~Reads~ChIP))^2 ~ .("/") ~ (frac(Reads~Control,Total~Reads~Control))^2 ),
xlab = "Offset(bp)", col="blue", ylim=c(0, 0.002))
legend("topright", c("H3K36me3","H3K27me3", "H3K27Ac", "pRB"), cex=1.5,fill=c("blue","green","black","red")
)
Arpeg=Arpeggio[,"H3K27me3_c2c12_MT_1"]
readsExp=BDAannotation[
BDAannotation[,"experiment.name"]=="H3K27me3_c2c12_MT_1",
"BDA.flattened.reads"]
readsCtrl=BDAannotation[
BDAannotation[,"experiment.name"]=="DNAInput_c2c12_MT_5",
"BDA.flattened.reads"]
ratio = readsExp^2/readsCtrl^2
Arpeg = Arpeg/ratio
Arpeg = makePretty(Arpeg)
lines(
(-length(Arpeg)/4):(length(Arpeg)*1/4),
Arpeg[(length(Arpeg)/4):(length(Arpeg)*3/4)],
col="green"
)
Arpeg=Arpeggio[,"pRb_GroFib_NULL_2"]
readsExp=BDAannotation[
BDAannotation[,"experiment.name"]=="pRb_GroFib_NULL_2",
"BDA.flattened.reads"]
readsCtrl=BDAannotation[
BDAannotation[,"experiment.name"]=="IgG_VCaP_NULL_6",
"BDA.flattened.reads"]
ratio = readsExp^2/readsCtrl^2
Arpeg = Arpeg/ratio
Arpeg = makePretty(Arpeg)
lines(
(-length(Arpeg)/4):(length(Arpeg)*1/4),
Arpeg[(length(Arpeg)/4):(length(Arpeg)*3/4)],
col="red"
)
Arpeg=Arpeggio[,"H3K27Ac_HESC_NULL_2"]
readsExp=BDAannotation[
BDAannotation[,"experiment.name"]=="H3K27Ac_HESC_NULL_2",
"BDA.flattened.reads"]
readsCtrl=BDAannotation[
BDAannotation[,"experiment.name"]=="IgG_MCF7_NULL_1",
"BDA.flattened.reads"]
ratio = readsExp^2/readsCtrl^2
Arpeg = Arpeg/ratio
Arpeg = makePretty(Arpeg)
lines(
(-length(Arpeg)/4):(length(Arpeg)*1/4),
Arpeg[(length(Arpeg)/4):(length(Arpeg)*3/4)],
col="black"
)
^back to the top
References
TBA
^back to the top
Likely Asked Questions
Bug Reporting
What is a bug?:
Taking forever to complete a command can be a bug, but you must make certain that it was really Arpeggio's fault. Some commands simply take a long time. If the input was such that you know it should have been processed quickly, report a bug. If you don't know whether the command should take a long time, find out by looking in the manual or by asking for assistance.
It is very useful to try and find simple examples that produce apparently the same bug, and somewhat useful to find simple examples that might be expected to produce the bug but actually do not. If you want to debug the problem and find exactly what caused it, that is wonderful. You should still report the facts as well as any explanations or solutions. Please include an example that reproduces the problem, preferably the simplest one you have found.
^back to the top
Troubleshooting and Known Bugs
- If Arpeggio cannot find the file or the name is incorrect it simply reports "Can't run" rather than indicating that it couldn't find the file.
- Sometimes Arpeggio catches errors but they fail to get reported if the log variable is not filled out.
- If Arpeggio fails to create an output column it fills it with NaNs which then sets "shouldRun" to false the next time you run it. Workaround: To avoid this delete output files between runs.
- If Arpeggio cannot delete the directory (input_data) or otherwise fails after it deletes the files but before it prints out new ones, all the user made data.csv and properties.txt will be wiped out. Workaround: backup the data.csv and properties.txt file and replace them when they get deleted.
- Arpeggio will fail if there is white space at the end of file paths. For instance if you have
files.sam=/media/raid1/sra/data/sam and there is a space after sam it will try to locate a direcotry "sam ". The error reported for this currently does not reflect that the directory cannot be found.
- If an alignment file is provided (sam was tested) and is not the correct format, Arpeggio silently fails without informing the user that the file format is incorrect).
^back to the top