Read Me
************************************************************
************************************************************
*** ***
*** Poppi, the Protein Occupancy Profiling PIpeline ***
*** ***
************************************************************
************************************************************
This document describes the usage of our Unix pipeline to analyse and compare protein occupancy profiles.
The original protein occupancy profiling experiment is described in Baltz and Munschauer et al. "The mRNA-bound proteome and its global occupancy profile on protein-coding transcripts", Mol Cell. 2012 Jun 8;46(5):674-90
If you use this pipeline for your study, please cite Schueler and Munschauer et al. "Differential Protein occupancy Profiling of the mRNA Transcriptome", submitted manuscript
*************************
* Software requirements *
*************************
The following software needs to be installed and be part of your path:
- Perl (http://www.perl.org/)
- bedtools (http://samtools.sourceforge.net/samtools.shtml)
- samtools (http://samtools.sourceforge.net/samtools.shtml)
- TopHat and Bowtie (http://bowtie-bio.sourceforge.net/index.shtml , http://bowtie-bio.sourceforge.net/index.shtml) for mapping using Tophat
- STAR (https://code.google.com/p/rna-star/) for mapping using STAR
- convert (http://www.imagemagick.org/)
- twoBitToFa (part of the UCSC Genome Browser Kent tools, http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=blob;f=src/product/README.building.source)
- R
**************************************************
* Before running the pipeline for the first time *
**************************************************
Before you run poppi for the first time, there are a few steps to take:
1) Run ./install_poppi.sh
2) Add The poppi installation and the "bin" subdirectory to your path variable
export PATH=$PATH:/home/<YOUR_USERNAME>/ProtOccProPipeline:/home/<YOUR_USERNAME>/<YOUR_POPPI_PATH>/bin
3) Poppi needs to download genome and gene-model files for your reference species (e.g. "hg18") from UCSC Genome Browser. An internet connection is required. To install a reference, open a console in the poppi installation directory and type:
./install_poppi_genome <DESTINATION_FOLDER> hg18
ATTENTION: if you want to install the genomes into a directory different from the poppi pipeline files, you have to adjust the respective line in install_poppi_genomes manually.
**************************************************************
* Running poppi on protein occupancy profiling experiment(s) *
**************************************************************
Two use poppi, you need reads representing the result of one or more protein occupancy profiling experiments. The standard input format is fastq, which will mapped by our pipeline. If you have mapped your reads already, you can also use these mapped reads in BAM format as input to the pipeline.
To tell poppi what's to do, you need to set up a parameter file, which contains one line per parameter in the format KEY<TAB>VALUE or KEY=VALUE .
An example parameter file ("example.txt") can be found in the poppi installation directory. The most important parameters are defined in the following. To get explanations for all adjustable parameters refer to the example file itself.
Important parameters (NAME = EXAMPLE)
GENOME = hg18 The reference to be used. This name can refere to any reference you have installed (see above)
MAPPER = "tophat" The mapper to be used. So far poppi supports TopHat and STAR. Only in case you don't provide BAM files
OUTPUT_DIR = any/directory The pipeline output directory
For each experiment you want to analyze in one poppi run, you have to add the following lines
SAMPLE1{ This name (which cannot be left out) is introduced to increase readability but will be ignored in the output
FILE_NAME = "any_given_name_1" Sample name (MUST be file name conform, e.g. no whitespaces)
NAME = "any given name 1" Sample name in human readable form
INPUT_FILE = /path/to/file/with/any_given_name_1.fastq Absolute path to the input file. So far, the input format MUST be either FASTQ or BAM!
}
The number of experiments to analyse in one poppi run is not limited.
After having defined your parameter file save it (e.g. under "my_project.txt") and use the "poppi" script to generate Makefiles. To do so type the following on the console:
poppi my_project.txt
This will generate a set of Makefiles in OUTPUT_DIR. Change to that directory and type "make" (or "make -j <NR OF THREADS>" for parallel processing) to start the Makefile processing.
*******************************************
* Understanding the output of a poppi run *
*******************************************
After the pipeline has finished, you will find a new file "index.html" in OUTPUT_DIR. Open this file to get the statistics of your protein occupancy profiling run.
In addition, the directory contains the following subdirectories and files:
reads - Contains all input reads as well as unique input reads
OUTPUT_DIR/mapping - results from the read mapping
OUTPUT_DIR/mapping/FILE_NAME_mapping_all.bam - BAM file containing all mapping positions of reads
OUTPUT_DIR/mapping/FILE_NAME_mapping.bam - BAM file containing filtered mapping positions of reads (e.g. only uniquely mapping reads, see "example.txt", will be the same as "_all" if no filtering was done)
OUTPUT_DIR/coverage - Contains files representing read coverage
OUTPUT_DIR/coverage/coverage_genome_<FILE_NAME>_minus.bed - Read coverage on minus strand in bedgraph format (can be uploaded to UCSC Genome Browser)
OUTPUT_DIR/coverage/coverage_genome_<FILE_NAME>_plus.bed - Read coverage read depth on plus strand in bedgraph format (can be uploaded to UCSC Genome Browser)
OUTPUT_DIR/coverage/coverage_<FILE_NAME>_cds - Per-position read coverage over coding sequences of refGene transcripts (format described below)
OUTPUT_DIR/coverage/coverage_<FILE_NAME>_transcript - Per-position read coverage over full refGene transcripts (format described below)
OUTPUT_DIR/coverage/coverage_<FILE_NAME>_utr3 - Per-position read coverage over 3'UTRs of refGene transcripts (format described below)
OUTPUT_DIR/coverage/coverage_<FILE_NAME>_utr5 - Per-position read coverage over 5'UTRs of refGene transcripts (format described below)
OUTPUT_DIR/coverage/TC_read_coverage... - Same files as described above yet based only on reads with a T-C transition (THIS FEATURE IS STILL UNDER DEVELOPMENT AND SHOULD ONLY BE USED WITH CARE)
OUTPUT_DIR/coverage/split_by_chromosome - contains coverage files splitted by chromosome for a faster upload to the UCSC Genome Browser
Files that define the per-transcript read depth contain annotations in the following format:
1st column: refgene id
2nd column: Position in transcript (starting from 1) - the position is not strand-specific but always starts at the most 5' genomic coordinate
3rd column: Relative position in transcript (between 0.00 and 1.00) - the relative position is strand-specific, meaning 0 referes to the start of the compartments/transcript and 0 refers to its end
4th column: Number of covering reads
OUTPUT_DIR/TC_conversion - Contains files representing positional T-C conversion events
allTC_conversions_genome_FILE_NAME.bed - Per-position T-C transitions events from reads mapped to that position (BED format, unfiltered). The score field (5. column) is used to represent to transition count.
TC_conversions_genome_FILE_NAME.bed - Same as "allTC_conversions_genome_FILE_NAME.bed" but filtered for positions with a minimal transition count (e.g. "2", see "example.txt")
TC_conversions_genome_<FILE_NAME>_plus.bed - Subset of "TC_conversions_genome_FILE_NAME.bed" based only on transitions assigned to the plus strand (bedgraph format, can be uploaded to UCSC Genome Browser)
TC_conversions_genome_<FILE_NAME>_minus.bed - Subset of "TC_conversions_genome_FILE_NAME.bed" based only on transitions assigned to the minus strand (bedgraph format, can be uploaded to UCSC Genome Browser)
TC_conversions_<FILE_NAME>_cds/transcript/utr3/utr5 - In line with the respective coverage files, these files represent transitions in the respective transcripts and transcript compartments. The format is based on stardard BED (first 6 columns). The 7. column represents the absolute position in the transcript/compartment while the 8. position represents the relative position.
OUTPUT_DIR/plots - Contains all produced plots
OUTPUT_DIR/statistics - Contains all statistic files used to generate the plots
OUTPUT_DIR/html - Contains the HTML files for index.html
OUTPUT_DIR/parameters.txt - A file that contains all the parameters of the poppi run (can also be found in index.html)
***************************************************
* Comparing two replicates based on read coverage *
***************************************************
Two compare the read coverage of two or more protein occupancy profiles over all transcripts, which have been processed by the poppi pipeline, we have implemented the function "compare_replicates.RScript".
To use that function, you will have to prepare one file (e.g. "comparison.txt") that contains one line per experiment to compare like this:
<EXPERIMENT_NAME_1> /OUTPUT_DIR/coverage/coverage_<FILE_NAME_1>_transcript
<EXPERIMENT_NAME_2> /OUTPUT_DIR/coverage/coverage_<FILE_NAME_2>_transcript
... ...
Instead of comparing the read coverage of complete transcripts, it is also possible to use individual compartments like cds, 3'UTR, etc. by inserting the respective files.
ATTENTION: It is not possible to compare different compartments against each other (e.g. cds against 3'UTR)!
Then type the following on the console:
compare_replicates.RScript comparison.txt output.pdf
This will generate the file "output.pdf", which contains density plots of Spearman correlation coefficients for each pairwise comparison.
The function "compare_replicates.RScript" has multiple adjustable parameters (like the size of the used window). To learn more type "compare_replicates.RScript --help"
************************************************************************
* Call differentially occupied position based on T-C conversion counts *
************************************************************************
To define positions that show significant differences in occupancy, we have developed a method based on T-C transition counts (published in Schueler and Munschauer et al.).
To run our differential poppi pipeline, you have to first process both experiments with the poppi pipeline. Then set up a parameter file in the same way as for the the poppi pipeline (see above).
The file "example_differential.txt" contains a detailed description of all adjustable parameters. In the following, only the most important parameters are described:
GENOME = hg18 The reference to be used. This name can refere to any reference you have installed (see above) but MUST be the same as used in the original poppi run
OUTPUT_DIR = any/directory The differential pipeline output directory
EXP1NAME = any_name_1 Name of the first experiment (does not have to be the same as used in the original poppi run)
EXP2NAME = any_name_2 Name of the second experiment (does not have to be the same as used in the original poppi run)
Differential occupancy profiling requires at least two replicates protein occupancing profiles per condition. To tell the pipeline were the replicates are located add the following for each replicate (see "example_differential.txt" for a more detailed example)
EXPERIMENT1_REPLICATE1{ This name (which cannot be left out) is introduced to increase readability but will be ignored in the output
EXPERIMENT = 1 Must be either "1" or "2" to assign this replicate to experiment 1 or 2 (multiple samples can be assigned "1" and "2", respectively)
INPUT_POPPI_DIR = /path/to/poppi/output/directory original poppi output directory (OUTPUT_DIR)
EXPERIMENT_NAME = "any_given_name_1" Name of the sample from the poppi run (MUST be file name conform, e.g. no whitespaces); MUST be the same name as the "FILE_NAME" from the original poppi run
}
Our differential pipeline also enables the filtering of differential positions based on mRNA-seq data. This must be provided as mapped reads in BAM format (support for automatic read mapping will be added in the near future). To add this functionality set
EXPRESSION_FILTER == true
In addition, it is important to define what thresholds are used for differential gene/transcript/exon expression. To do so use
EXPRESSION_FILTER_MIN_LOG2_FC = 0 minimal absolute foldchange for a gene/transcript/exon to be considered differentially expressed (log2-scale). The default (0) means no filtering based on foldchange.
EXPRESSION_FILTER_MIN_FDR = 0.01 false discovery rate threshold for a gene/transcript/exon to be considered differentially expressed. The default is FDR < 0.01.
Again, at least two replicates are needed to make use of this feature. To tell the pipeline were the replicates are located add the following for each replicate (see "example_differential.txt" for a more detailed example)
EXPRESSION1_REPLICATE1{ This name (which cannot be left out) is introduced to increase readability but will be ignored in the output
EXPERIMENT = 1 Assignment to experiment 1 and 2; MUST be consistent with occupancy profiling (will lead to wrong result otherwise)
EXPRESSION_FILE = /path/to/expression/file BAM file containing mRNA-seq aligned reads
}
After having defined your parameter file save it (e.g. under "my_differential_project.txt") and use the "poppi" script to generate Makefiles. To do so type the following on the console:
differential_poppi my_differential_project.txt
This will generate a set of Makefiles in OUTPUT_DIR. Change to that directory and type "make" to start the Makefile processing.
********************************************************
* Understanding the output of a differential poppi run *
********************************************************
Running the differential pipeline will produce the file "Differential_positions_EXP1NAME_vs_EXP2NAME.bed", which contains all tested T-C transition positions.
The format of this file is as follows:
1st to 6th column Description of positions in BED format; the score field contains the log2(foldchange experiment 1/experiment 2) of the comparison (after per-transcript normalization)
7th column T-C transition counts of that position for all replicates of experiment 1, separated by comma
8th column T-C transition counts of that position for all replicates of experiment 2, separated by comma
9th column Genes harboring that position
10th column log2(foldchange experiment 1/experiment 2) of the comparison (after per-transcript normalization)
11th column Unadjusted P-value of the significance test
12th column FDR of the p-value adjusted for each gene (EXPERIMENTAL)
13th column FDR of the p-value adjusted over all tested positions (ATTENTION: using this FDR will likely result in a very low number of differential positions. Use the method described in our paper instead define a good significance threshold)
To filter this file for significant positions, you can use awk. To filter only positions with a p-value < 0.01 type
awk '($11<0.01)' Differential_positions_EXP1NAME_vs_EXP2NAME.bed
If additional mRNA-seq data has been specified, the file "Differential_positions_EXP1NAME_vs_EXP2NAME_plusDifferentialGeneExpression.bed" will also have been generated.
The format of this file is as follows:
1st to 13th column Same as for Differential_positions_EXP1NAME_vs_EXP2NAME.bed
14th column TRUE/FALSE; is (at least one) of the associated genes differentially expressed in respect to the defined thresholds (see above)
15th column TRUE/FALSE; is (at least one) of the associated transcripts differentially expressed in respect to the defined thresholds (see above)
16th column TRUE/FALSE; is (at least one) of the associated exons differentially expressed in respect to the defined thresholds (see above)
To filter positions that have a p-avlue < 0.01 and are not located in differential exons use
awk '($11<0.01)&&($16=="FALSE")' Differential_positions_EXP1NAME_vs_EXP2NAME.bed
In addition, the following files are also produced:
Transition_counts_EXP1NAME_vs_EXP2NAME.bed - T-C transition counts for all positions over all replicates (similar to "Differential_positions_EXP1NAME_vs_EXP2NAME.bed" but also containing untested positions)
Transition_counts_EXP1NAME_vs_EXP2NAME_perGene.bed - Same as "Transition_counts_HEK293_vs_MCF7.bed" but adding overlapping gene information
Transition_counts_EXP1NAME_vs_EXP2NAME_perGene_normalized.bed - If an initial normalization has been defined (see "example_differential.txt") this file is the result of using this normalization on "Transition_counts_EXP1NAME_vs_EXP2NAME_perGene.bed" (otherwise this is a link to the former file)
gene_models.bed - The gene models used for the mRNA-seq filtering (only if EXPRESSION_FILTER == true)
parameters.txt - Parameters used for the differential poppi run