APAtrap Wiki

Identification of APA sites from RNA-seq data

Brought to you by: yec

User Manual

APAtrap User Manual

What is APAtrap?
Installation of APAtrap
How to run APAtrap?
3.1 identifyDistal3UTR
3.2 predictAPA
3.3 deAPA
Demo of processing the Test_Data.zip
Inputs of APAtrap
Outputs of APAtrap

1. What is APAtrap?

APAtrap is a tool capable of refining annotated 3'UTR and identifying novel 3' UTRs and 3' UTR extensions, and aiming to identify all potential APA (alternative polyadenylation) sites and detect genes with differential APA site usage between conditions by leveraging the resolution of RNA-seq data.
This software is an open-source tool that follows specifications on website (http://creativecommons.org/licenses/by-nc-sa/3.0/).
If you have any question or comment, please contact with Dr. Congting Ye(yec@xmu.edu.cn).

Ye C, Long Y, Ji G, Li Q. Q, Wu X (2018) APAtrap: identification and quantification of alternative polyadenylation sites from RNA-seq data. Bioinformatics 34(11): 1841–1849.

↑Back To Top

2. Installation of APAtrap?

[1]. Download and unzip our package - 'APAtrap_Linux.zip' (or 'APAtrap_Windows.zip', 'APAtrap_MacOS.zip')

After you download and unzip our APAtrap package, you will see a folder named APAtrap, where 2 standalone executables ('identifyDistal3UTR' and 'predictAPA') compressed from Perl programs and and 1 R package ('deAPA_1.0.tar.gz') are located.

[2]. Install R package 'deAPA_1.0.tar.gz'

You shoule make sure the R environment is installed. After opening the R, change the R's Working Path to the path where 'deAPA_1.0.tar.gz' is located (e.g., './APAtrap'). Type the following command in the Command Window of R:

> install.packages("deAPA_1.0.tar.gz",repos = NULL, type = "source")

↑Back To Top

3. How to run APAtrap?

There are 3 steps of running APAtrap: (1) run identifyDistal3UTR to refine annotated 3'UTRs and identify novel 3'UTRs or 3'UTR extensions. (2) run predictAPA to infer all potential APA sites and estimate their corresponding usages. (3) run R function deAPA to detect genes having significant changes in APA site usage between conditions.

3.1 identifyDistal3UTR

Refine annotated 3'UTRs and identify novel 3'UTRs or 3'UTR extensions.

Print usage

identifyDistal3UTR -h

Necessary parameters:

-i  short reads mapping result in bedgraph/wig format, can accept single file or multiple files.
-m  gene model file in bed format.
-o  file store the information of extended 3'UTR in bed format.

Optional parameters:

-w  window size used to scan the mapping result, default is 100.
-e  pre-extension size of each 3'UTR, default is 10000.
-c  minimum coverage of the end of the distal 3'UTR with comparing to the whole transcript.
-p  minimum percentage of valid nucleotides in a scanning-window.
-s  gene symbol file.

Example:

1) For genome having long 3'UTR,

identifyDistal3UTR -i Sample1.bedgraph Sample2.bedgraph -m hg19.genemodel.bed -o novel.utr.bed

2) For genome having short 3'UTR,

identifyDistal3UTR -i Sample1.bedgraph Sample2.bedgraph -m rice.genemodel.bed -o novel.utr.bed -w 50 -e 5000

↑Back To Top

3.2 predictAPA

Infer all potential APA sites and estimate their corresponding usages.

Print usage:

predictAPA -h

Necessary parameters:

-i  short reads mapping result in bedgraph/wig format, can accept single file or multiple files.
-g  number of groups (treatments/conditions) of the input files, e.g. -g 2.
-n  number of files(biological replicates) in each group (treatment/condition), e.g. -n 1 1.
-u  3'UTR annotation file in bed format.
-o  information of the predicted APA sites and their usage.

Optional parameters:

-d  minimum degree of coverage variation between two adjcent APA sites, >0 and <1, default is 0.2.
-c  minimum average coverage required for each 3'UTR, >=10, default is 20.
-a  minimum distance between the predicted APA sites, >=20, default is 100.
-w  window size used to scan the profile, >=20, default is 50.

Example:

1) For genome having long 3'UTR,

predictAPA -i Sample1.bedgraph Sample2.bedgraph -g 2 -n 1 1 -u hg19.utr.bed -o output.txt

2) For genome having short 3'UTR,

predictAPA -i Sample1.bedgraph Sample2.bedgraph -g 2 -n 1 1 -u rice.utr.bed -o output.txt -a 50

↑Back To Top

3.3 deAPA

Detect genes having significant changes in APA site usage between conditions.

Usage

deAPA(input_file, output_file, group1, group2, least_qualified_num_in_group1, least_qualified_num_in_group2, coverage_cutoff)

Arguments

input_file  
        The result generated by 'predictAPA'.

output_file 
        Name of output file.

group1  
        The first group of sample to be compared, default is 1.

group2  
        The second group of sample to be compared, default is 2.

least_qualified_num_in_group1   
        Minimum number of qualified replicates in sample group1, default is 1.

least_qualified_num_in_group2   
        Minimum number of qualified replicates in sample group2, default is 1.

coverage_cutoff 
        Minimum coverage depth required for each sample, default is 20.

↑Back To Top

4. Demo of processing the Test_Data.zip

Steps of using APAtrap to process the Test_Data.zip:

1st step: type the following command in the Command Prompt of Linux or Windows,

$ ./identifyDistal3UTR -i Sample1.bedgraph Sample2.bedgraph -m hg19.genemodel.bed -o test.utr.bed

2nd step: type the following command in the Command Prompt of Linux or Windows,

$ ./predictAPA -i Sample1.bedgraph Sample2.bedgraph -g 2 -n 1 1 -u test.utr.bed -o test.APA.txt

3rd step: type the following command in the R Command Prompt,

> library(deAPA)
> deAPA('test.APA.txt', 'test.APA.stat.txt', 1, 2, 1, 1, 20)

↑Back To Top

5. Inputs of APAtrap

The main inputs of APAtrap including "short reads mapping result" in bedgraph/wig format and "gene model file" in 12 column bed format.

[1] Generate bedgraph format file

(1) use FASTX-Toolkit or Trimmomatic etc., to trim and filter out low quality reads. Example:

$ java -jar $TRIMMOMATIC PE -phred33 SRR_1.fastq SRR_2.fastq SRR_1.paired.fastq SRR_1.unpaired.fastq SRR_2.paired.fastq SRR_2.unpaired.fastq ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

(2) use HISAT2 or BWA etc., to align the short reads to reference genome. Example:

$ hisat2 -x reference_genome_index SRR_1.paired.fastq -2 SRR_2.paired.fastq -S SRR.sam

(3) convert file format (Tool: Samtools and bedtools)

$ samtools view -bS SRR.sam > SRR.bam

$ samtools sort SRR.bam -o SRR.sorted.bam

$ genomeCoverageBed -bg -ibam SRR.sort.bam -g reference.genome.size.txt -split > SRR.bedgraph

[2] Obtain gene model file of reference genome

Gene model files of most animal species could be retrieved from UCSC, but for plants they are not available. Currently, we provide gene model files of plants Arabidopsis thaliana and Oryza sativa. We can help generate the gene model file for other plant species if corresponding genome annotation file was provided. Users can also generate their own gene model files as follows,

(1) download UCSC tools

$ wget -c http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/gtfToGenePred
$ wget -c http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/genePredToBed

(2) modify Linux file permission

$ chmod 755 gtfToGenePred genePredToBed

(3) convert gtf file into bed12

$ gtfToGenePred Homo_sapiens.GRCh38.86.gtf test.genePhred 
$ genePredToBed test.genePhred hg.genemodel.bed
$ rm test.genePhred

↑Back To Top

6. Outputs of APAtrap

[1] Output of identifyDistal3UTR is a 6 column bed format file:

Column	Explanation
1	Name of the chromosome/scaffold
2	Starting position
2	Ending position
3	Label including info of Gene ID, Gene symbol, Chromosome Name, Strand
5	Score
6	Strand

[2] Output of preditAPA is a text table seperated by tab:

Column Name	Explanation
Gene	Gene ID
Mean_Squared_Error	Mean squared error of fitting
Predicted_APA	Coordinates of proximal APA sites inferred by APAtrap (seperated by comma)
Loci	Range of the 3'UTR, of which the terminal site represents the most distal poly(A) site
Group_m_n_Separate_Exp	Expression level of each APA sites (from the most proximal site to the distal site, seperated by comma). m,n indicate the m_th sample, n_th replicate
Group_m_n_Total_Exp	Total expression level of sample m, replicate n

[3] Output of deAPA contains 4 additional columns as compared with output of preditAPA:

Column Name	Explanation
p.value	p value
perc_diff	PD index, percentage difference of APA site usages between two commpared groups, ∈[0,1]
r	Pearson product moment correlation coefficient, ∈[-1,1], a positive value represents that group2 uses more distal poly(A) site (or long 3' UTR) compared to group1, a negtive value represents that group2 uses more proximal poly(A) site (or short 3' UTR) compared to group1
p.adjust	Adjusted p value

↑Back To Top

Project Members:

Congting Ye (admin)

APAtrap Wiki

Identification of APA sites from RNA-seq data

User Manual

APAtrap User Manual

Table of Contents

[1]. Download and unzip our package - 'APAtrap_Linux.zip' (or 'APAtrap_Windows.zip', 'APAtrap_MacOS.zip')

[2]. Install R package 'deAPA_1.0.tar.gz'

3.1 identifyDistal3UTR

Print usage

Necessary parameters:

Optional parameters:

Example:

3.2 predictAPA

Print usage:

Necessary parameters:

Optional parameters:

Example:

3.3 deAPA

Usage

Arguments

[1] Generate bedgraph format file

[2] Obtain gene model file of reference genome

[1] Output of identifyDistal3UTR is a 6 column bed format file:

[2] Output of preditAPA is a text table seperated by tab:

[3] Output of deAPA contains 4 additional columns as compared with output of preditAPA:

Project Members: