Arpeggio Wiki

Harmonic analysis of ChIP-seq experiments

Brought to you by: fabio-x, francescostrino, kstant0725

Home

Authors:

Arpeggio User's Guide

Introduction
Installation

Arpeggio
Additional required software

...on a Mac
...in Linux

Quick start

Working folders structure
Configuration property file (properties.txt)
Data spreadsheet (data.csv)

Running Arpeggio: Examples
Arpeggio: your own Samples
Arpeggio Output
References
Likely Asked Questions
Bug Reporting
Troubleshooting and Known Bugs

Introduction

This page explains how to install Arpeggio and use the different options included. It also provides a brief guide on how to install the additional software required to run Arpeggio, and a step-by-step tutorial to get your first Arpeggio profiles. The guide assumes a linux64 machine and presents commands and configuration files to analyze a dataset under the folder /raid1. Of course, the locations can be changed to match the user's configuration, and the additional external software may also be available for other operating systems and architectures. Arpeggio has been written in Java: the present page assumes that your system is correctly configured to execute Java applications.

^back to the top

Installation

Arpeggio

Arpeggio can be installed by copying the latest distribution of Arpeggio to a folder of choice (e.g. /raid1/software/Arpeggio/). The latest distribution can be downloaded from Sourceforge at:
Download Arpeggio

^back to the top

Additional required software

In order to facilitate the analyses, we recommend to install additional open-source softwares: bowtie and the sra-toolkit.
Bowtie is used to align sequenced reads to the corresponding reference genome and the sra toolkit is used to facilitate downloading of publicly available experiments from the Short Read Archive (SRA) repository. For completeness, we provide an illustrative installation procedure for both softwares. Detailed instructions are available from the respective authors:

...on a Mac

Brew

In a Mac environment, both Bowtie and the SRAtoolkit can be conveniently installed using the package manager Brew (for more informations, visit the Brew homepage). If Brew is already installed on your system, you can go directly to the instructions for installing Bowtie and the SRAtoolkit on a Mac computer.

Brew can be installed by exectuing the following command in a terminal window:

ruby -e "$(curl -fsSL https://raw.github.com/mxcl/homebrew/go)"

Bowtie and SRAtoolkit

With Brew, Bowtie and the SRAtoolkit can be installed executing the following commands in a terminal window:

brew tap homebrew/science
brew install bowtie
brew install sratoolkit

We also recommend to install the download manager wget to facilitate download of reference genomes. Wget is available on Brew and can be installed executing the following command in a terminal window:

brew install wget

In order to map sequenced reads to the reference genome, it is necessary to obtain the genomes. Reference genomes can be obtained using an FTP client at ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/ or, in alternative, executing the following commands in a terminal window (human and mouse genomes are shown in the example):

mkdir /raid1/software/bowtie_genomes/
cd /raid1/software/bowtie_genomes/

wget ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/hg19.ebwt.zip
unzip hg19.ebwt.zip

wget ftp://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/mm9.ebwt.zip
unzip mm9.ebwt.zip

rm *.zip

^back to the top

...in Linux

Bowtie

In a Linux environment, Bowtie can be installed to the folder of choice (e.g. /raid1/software/) using the following commands on a terminal window:

cd /raid1/software
wget http://sourceforge.net/projects/bowtie-bio/files/bowtie/1.0.0/bowtie-1.0.0-linux-x86_64.zip/download
mv download bowtie-1.0.0-linux-x86_64.zip
unzip bowtie-1.0.0-linux-x86_64.zip
rm bowtie-1.0.0-linux-x86_64.zip

In order to map sequenced reads to the reference genome, it is necessary to "build" the genomes. The human and mouse reference genomes can be built using the following commands on a terminal window:

cd /raid1/software/bowtie-1.0.0
scripts/make_mm9.sh && rm *.fa
scripts/make_hg19.sh && rm *.fa

SRA toolkit

In a Linux environment, the SRA toolkit can be installed to the folder of choice (e.g. /raid1/software/) using the following commands on a terminal window:

cd /raid1/software
wget http://ftp-private.ncbi.nlm.nih.gov/sra/sdk/2.2.2a/sratoolkit.2.2.2a-centos_linux64.tar.gz
tar xvf sratoolkit.2.2.2a-centos_linux64.tar.gz
rm sratoolkit.2.2.2a-centos_linux64.tar.gz
mv sratoolkit.2.2.2a-centos_linux64/ sratoolkit.2.2.2a

^back to the top

Quick Start

Warning: NGS data is space consuming, we recommend to ensure that enough disk space is available prior to starting any analysis. Arpeggio requires a configuration file named properties.txt and a data spreadsheet named data.csv. The software is also constructed to handle the data in different working folders so that the data is accurately organized during every stage of the processing pipeline. The following three sections describe:

How to create a working folders structure
How to create a configuration property file
How to create and organize a data spreadsheet

^back to the top

Working folders structure

We suggest to create working folders where data will be stored, processed and evaluated in order to compute Arpeggio profiles. We will assume that Arpeggio has been installed in the folder /raid1/software/, in the following lists of commands, replace the folder with your folder of choice.

cd /raid1/software
mkdir -p /raid1/sra/
mkdir -p /raid1/sra/input_data
mkdir -p /raid1/sra/data/sra
mkdir -p /raid1/sra/data/fastq
mkdir -p /raid1/sra/data/sam
mkdir -p /raid1/sra/data/tmp

^back to the top

Configuring the Property file

In order for the Arpeggio software to run correctly, a property file called properties.txt is needed in the same folder as the data file data.csv. We suggest to use the subfolder input_data in the working folder structure for storing these two files.
The property file can be edited with a text editor, such as Emacs, and contains configuration parameters that facilitate and guide the execution of different parts of the pipeline. The available options that can be set by the user are:

files.xxx, where xxx can be sra, fastq, sam, or tmp, specify in which folder the corresponding files, identified by type, should be located
bowtie specifies the location of the bowtie executables
bowtie.opt the options to be passed to bowtie
bowtie.genome.xxx, where xxx can be for instance hg19, or mm9, etc..., indicates the location of the reference genomes
arpeggio.window_size specifies the window size used for computing the autocorrelation
arpeggio.remove_duplicate_reads is used to set whether duplicate reads should be removed when computing autocorrelations

An illustrative property file can be downloaded from the Sourceforge repository ( property.txt for Linux, property.txt for Mac (sourceforge.net)). Alternatively, it can be obtained using the download manager wget:

for Linux
wget http://sourceforge.net/projects/arpeggio/files/samples/properties.txt -O properties.txt

for Mac
wget http://sourceforge.net/projects/arpeggio/files/samples/properties_mac.txt -O properties.txt

A typical property.txt file reads as follows:

  files.sra=/raid1/sra/data/sra
  files.fastq=/raid1/sra/data/fastq
  files.sam=/raid1/sra/data/sam
  files.tmp=/raid1/sra/data/tmp
  bowtie=/raid1/software/bowtie-1.0.0/bowtie
  bowtie.opt=-n2 -k1 -m1 --best --strata --chunkmbs 512
  bowtie.genome.mm9=/raid1/software/bowtie-1.0.0/mm9
  bowtie.genome.hg19=/raid1/software/bowtie-1.0.0/hg19
   fastq-dump=/raid1/software/sratoolkit.2.2.2a/bin/fastq-dump
  arpeggio.window_size=8192
  arpeggio.remove_duplicate_reads=true

^back to the top

Configuring the Data file

The data file data.csv, which can be edited with spreadsheet software, such as Excel, is a table containing relevant information about each sequencing experiment. We suggest to use the subfolder input_data in the working folder structure for storing this file. While such table can be expanded arbitrarily to include any additional information needed by the user, the following columns, identified by their column headers, are necessary for Arpeggio to run correctly.

experiment.name: unique experiment identifier
genome: reference genome for mapping (e.g. hg19, mm9, etc...)
exp_type: to be set to ChIP-seq
layout: use single if the experiment uses single-end reads; use 5'-3'-3'-5', 3'-5'-5'-3', fr, rf, or ff for paired-end reads
Run: This specifies the SRA identifier for the experiment (e.g. SRR054876 or ERR011988). It can be left empty if the experiments are not present on the SRA, see below
DNA_shearing: DNA shearing technique (e.g. Sonication, MNase, etc...). This information is used for best matching experiments to controls
Protein: The protein investigated by the experiment. Note: IgG and DNAInput are used to identify control experiments

sample data file

Note: If "Run" is not available (e.g. unpublished data), a sequence file is needed in any of these formats: experiment_name.sra, experiment_name.fastq or experiment_name.sam and it should be copied to the relative /raid1/sra/file_type folder, where file_type is either sra, fastq, or sam.

^back to the top

Running Arpeggio: Examples

Once the data file has been created and the property file has been correctly configured, running Arpeggio will build all the autocorrelation profiles. If SRA identifiers for the experiments have been specified, Arpeggio will download the corresponding data from the SRA, map it and build all autocorrelation profiles. Note: downloading data from the SRA may take a long time depending on the speed of your internet connection. A series of commands that can be executed using the illustrative configuration files provided on the Sourceforge repository is: Download the data from the SRA, map it and build all autocorrelation profiles

java -Xmx4g -server -jar /raid1/software/Arpeggio/Arpeggio.jar arpeggioInput/input_data bda >> arpeggiolog.txt

Once the autocorrelation profiles are calculated, the controls can be matched:

java -Xmx4g -server -jar /raid1/software/Arpeggio/Arpeggio.jar arpeggioInput/input_data ctr >> arpeggiolog.txt

Note: the Arpeggio profiles and matching controls in the paper were calculated using R code, so there may be slight numerical differences.
Finally, arpeggio profiles can be computed:

java -Xmx4g -server -jar /raid1/software/Arpeggio/Arpeggio.jar arpeggioInput/input_data arp >> arpeggiolog.txt

^back to the top

Running Arpeggio: your own samples

Place your sample fastq file into the directory specified for fastq files in the properties.txt file. Or if you have an alignment (sam format) file already available place it in the directory specified for sam files.
Add an entry for your sample to the data.csv file filling out the necessary fields.
Run Arpeggio as previously described.

^back to the top

Arpeggio Output

The output of Arpeggio is in the input_data folder and is as follows:

Arpeggio.prof.csv - Contains the Arpeggio profiles for the specified window size. Each column represents a single sample. Column headers are the sample names.
Arpeggio.Star.prof.csv - Contains the Arpeggio fragment length distributions. Each column represents a single sample. Column headers are the sample names.
BDAOS.prof.csv - Contains the positive to negative strand cross correlations for the specified window size. Each column represents a single sample. Column headers are the sample names.
BDASS.prof.csv - Contains the sample autocorrelations for the specified window size. Each column represents a single sample. Column headers are the sample names.

To create the nucleosome plots (figure 1 from the paper) run the following code in R (set your working directory to the input_data directory) using setwd() in R. (note that if you provided your own alignment rather than mapping with Arpeggio you will need to fill out the BDA.flattened.reads field in the data.csv with the number of unique mapped reads in the sample)

Arpeggio = read.table("Arpeggio.prof.csv",         header=TRUE, sep=",",row.names=NULL) BDAannotation = read.table("data.csv",         header=TRUE, sep=",",row.names=NULL)
runmean=function(x,n){   d1=length(x)   d2=n   y = matrix(NA,length(x),d2)   for(i in 1:d2){     j = max(1,d2/2 - i + 1,na.rm=TRUE):min(d1,d1 + d2/2 - i,na.rm=TRUE)     k = max(1,i - d2/2 + 1,na.rm=TRUE):min(d1,d1 - d2/2 + i,na.rm=TRUE)     y[k,i] = as.double(x[j])   }   apply(y,1,mean,na.rm=TRUE) }
makePretty = function(arpeg){   Arpeg = Arpeg[     c((length(Arpeg)/2+1):length(Arpeg),     1:(length(Arpeg)/2))]
  #remove large spike   Arpeg[length(Arpeg)/2] = (Arpeg[length(Arpeg)/2-1]*2 -         Arpeg[length(Arpeg)/2-2])   Arpeg[length(Arpeg)/2+2] = (Arpeg[length(Arpeg)/2+3]*2 -         Arpeg[length(Arpeg)/2+4])   Arpeg[length(Arpeg)/2+1] = (Arpeg[length(Arpeg)/2] +         Arpeg[length(Arpeg)/2+2])/2   Arpeg = runmean(Arpeg,31)   return(Arpeg) }
Arpeg=Arpeggio[,"H3K36me3_CD4T_NULL_1_C"] readsExp=BDAannotation[     BDAannotation[,"experiment.name"]=="H3K36me3_CD4T_NULL_1_C",     "BDA.flattened.reads"] readsCtrl=BDAannotation[     BDAannotation[,"experiment.name"]=="DNAInput_c2c12_MT_4",     "BDA.flattened.reads"] ratio = readsExp^2/readsCtrl^2 Arpeg = Arpeg/ratio
Arpeg = makePretty(Arpeg) par(mgp = c(1.4, 0.5, 0)) plot(     (-length(Arpeg)/4):(length(Arpeg)*1/4),     Arpeg[(length(Arpeg)/4):(length(Arpeg)*3/4)],     type="l",     ylab = bquote(.("IP signal") ~ (frac(Reads~ChIP,Total~Reads~ChIP))^2 ~ .("/") ~ (frac(Reads~Control,Total~Reads~Control))^2 ),     xlab = "Offset(bp)", col="blue", ylim=c(0, 0.002)) legend("topright", c("H3K36me3","H3K27me3", "H3K27Ac", "pRB"),     cex=1.5,fill=c("blue","green","black","red") )
Arpeg=Arpeggio[,"H3K27me3_c2c12_MT_1"] readsExp=BDAannotation[     BDAannotation[,"experiment.name"]=="H3K27me3_c2c12_MT_1",     "BDA.flattened.reads"] readsCtrl=BDAannotation[     BDAannotation[,"experiment.name"]=="DNAInput_c2c12_MT_5",     "BDA.flattened.reads"] ratio = readsExp^2/readsCtrl^2 Arpeg = Arpeg/ratio Arpeg = makePretty(Arpeg) lines(     (-length(Arpeg)/4):(length(Arpeg)*1/4),     Arpeg[(length(Arpeg)/4):(length(Arpeg)*3/4)],     col="green" )
Arpeg=Arpeggio[,"pRb_GroFib_NULL_2"] readsExp=BDAannotation[     BDAannotation[,"experiment.name"]=="pRb_GroFib_NULL_2",     "BDA.flattened.reads"] readsCtrl=BDAannotation[     BDAannotation[,"experiment.name"]=="IgG_VCaP_NULL_6",     "BDA.flattened.reads"] ratio = readsExp^2/readsCtrl^2 Arpeg = Arpeg/ratio Arpeg = makePretty(Arpeg) lines(     (-length(Arpeg)/4):(length(Arpeg)*1/4),     Arpeg[(length(Arpeg)/4):(length(Arpeg)*3/4)],     col="red" )
Arpeg=Arpeggio[,"H3K27Ac_HESC_NULL_2"] readsExp=BDAannotation[     BDAannotation[,"experiment.name"]=="H3K27Ac_HESC_NULL_2",     "BDA.flattened.reads"] readsCtrl=BDAannotation[     BDAannotation[,"experiment.name"]=="IgG_MCF7_NULL_1",     "BDA.flattened.reads"] ratio = readsExp^2/readsCtrl^2 Arpeg = Arpeg/ratio Arpeg = makePretty(Arpeg) lines(     (-length(Arpeg)/4):(length(Arpeg)*1/4),     Arpeg[(length(Arpeg)/4):(length(Arpeg)*3/4)],     col="black" )

^back to the top

References

TBA

^back to the top

Likely Asked Questions

Reporting bugs

Bug Reporting

What is a bug?: Taking forever to complete a command can be a bug, but you must make certain that it was really Arpeggio's fault. Some commands simply take a long time. If the input was such that you know it should have been processed quickly, report a bug. If you don't know whether the command should take a long time, find out by looking in the manual or by asking for assistance.
It is very useful to try and find simple examples that produce apparently the same bug, and somewhat useful to find simple examples that might be expected to produce the bug but actually do not. If you want to debug the problem and find exactly what caused it, that is wonderful. You should still report the facts as well as any explanations or solutions. Please include an example that reproduces the problem, preferably the simplest one you have found.

^back to the top

Troubleshooting and Known Bugs

If Arpeggio cannot find the file or the name is incorrect it simply reports "Can't run" rather than indicating that it couldn't find the file.
Sometimes Arpeggio catches errors but they fail to get reported if the log variable is not filled out.
If Arpeggio fails to create an output column it fills it with NaNs which then sets "shouldRun" to false the next time you run it. Workaround: To avoid this delete output files between runs.
If Arpeggio cannot delete the directory (input_data) or otherwise fails after it deletes the files but before it prints out new ones, all the user made data.csv and properties.txt will be wiped out. Workaround: backup the data.csv and properties.txt file and replace them when they get deleted.
Arpeggio will fail if there is white space at the end of file paths. For instance if you have files.sam=/media/raid1/sra/data/sam and there is a space after sam it will try to locate a direcotry "sam ". The error reported for this currently does not reflect that the directory cannot be found.
If an alignment file is provided (sam was tested) and is not the correct format, Arpeggio silently fails without informing the user that the file format is incorrect).

^back to the top