Menu

User Manual

Congting Ye

APAtrap User Manual

Table of Contents

  1. What is APAtrap?
  2. Installation of APAtrap
  3. How to run APAtrap?
    3.1 identifyDistal3UTR
    3.2 predictAPA
    3.3 deAPA
  4. Demo of processing the Test_Data.zip
  5. Inputs of APAtrap
  6. Outputs of APAtrap


1. What is APAtrap?


APAtrap is a tool capable of refining annotated 3'UTR and identifying novel 3' UTRs and 3' UTR extensions, and aiming to identify all potential APA (alternative polyadenylation) sites and detect genes with differential APA site usage between conditions by leveraging the resolution of RNA-seq data.
This software is an open-source tool that follows specifications on website (http://creativecommons.org/licenses/by-nc-sa/3.0/).
If you have any question or comment, please contact with Dr. Congting Ye(yec@xmu.edu.cn).

Ye C, Long Y, Ji G, Li Q. Q, Wu X (2018) APAtrap: identification and quantification of alternative polyadenylation sites from RNA-seq data. Bioinformatics 34(11): 1841–1849.

↑Back To Top


2. Installation of APAtrap?


[1]. Download and unzip our package - 'APAtrap_Linux.zip' (or 'APAtrap_Windows.zip', 'APAtrap_MacOS.zip')

After you download and unzip our APAtrap package, you will see a folder named APAtrap, where 2 standalone executables ('identifyDistal3UTR' and 'predictAPA') compressed from Perl programs and and 1 R package ('deAPA_1.0.tar.gz') are located.

[2]. Install R package 'deAPA_1.0.tar.gz'

You shoule make sure the R environment is installed. After opening the R, change the R's Working Path to the path where 'deAPA_1.0.tar.gz' is located (e.g., './APAtrap'). Type the following command in the Command Window of R:

> install.packages("deAPA_1.0.tar.gz",repos = NULL, type = "source") 

↑Back To Top


3. How to run APAtrap?


There are 3 steps of running APAtrap: (1) run identifyDistal3UTR to refine annotated 3'UTRs and identify novel 3'UTRs or 3'UTR extensions. (2) run predictAPA to infer all potential APA sites and estimate their corresponding usages. (3) run R function deAPA to detect genes having significant changes in APA site usage between conditions.

3.1 identifyDistal3UTR

Refine annotated 3'UTRs and identify novel 3'UTRs or 3'UTR extensions.

identifyDistal3UTR -h  
Necessary parameters:
-i  short reads mapping result in bedgraph/wig format, can accept single file or multiple files.
-m  gene model file in bed format.
-o  file store the information of extended 3'UTR in bed format.
Optional parameters:
-w  window size used to scan the mapping result, default is 100.
-e  pre-extension size of each 3'UTR, default is 10000.
-c  minimum coverage of the end of the distal 3'UTR with comparing to the whole transcript.
-p  minimum percentage of valid nucleotides in a scanning-window.
-s  gene symbol file.
Example:

1) For genome having long 3'UTR,

identifyDistal3UTR -i Sample1.bedgraph Sample2.bedgraph -m hg19.genemodel.bed -o novel.utr.bed

2) For genome having short 3'UTR,

identifyDistal3UTR -i Sample1.bedgraph Sample2.bedgraph -m rice.genemodel.bed -o novel.utr.bed -w 50 -e 5000

↑Back To Top

3.2 predictAPA

Infer all potential APA sites and estimate their corresponding usages.

predictAPA -h  
Necessary parameters:
-i  short reads mapping result in bedgraph/wig format, can accept single file or multiple files.
-g  number of groups (treatments/conditions) of the input files, e.g. -g 2.
-n  number of files(biological replicates) in each group (treatment/condition), e.g. -n 1 1.
-u  3'UTR annotation file in bed format.
-o  information of the predicted APA sites and their usage.
Optional parameters:
-d  minimum degree of coverage variation between two adjcent APA sites, >0 and <1, default is 0.2.
-c  minimum average coverage required for each 3'UTR, >=10, default is 20.
-a  minimum distance between the predicted APA sites, >=20, default is 100.
-w  window size used to scan the profile, >=20, default is 50.
Example:

1) For genome having long 3'UTR,

predictAPA -i Sample1.bedgraph Sample2.bedgraph -g 2 -n 1 1 -u hg19.utr.bed -o output.txt

2) For genome having short 3'UTR,

predictAPA -i Sample1.bedgraph Sample2.bedgraph -g 2 -n 1 1 -u rice.utr.bed -o output.txt -a 50

↑Back To Top

3.3 deAPA

Detect genes having significant changes in APA site usage between conditions.

Usage
deAPA(input_file, output_file, group1, group2, least_qualified_num_in_group1, least_qualified_num_in_group2, coverage_cutoff)
Arguments
input_file  
        The result generated by 'predictAPA'.

output_file 
        Name of output file.

group1  
        The first group of sample to be compared, default is 1.

group2  
        The second group of sample to be compared, default is 2.

least_qualified_num_in_group1   
        Minimum number of qualified replicates in sample group1, default is 1.

least_qualified_num_in_group2   
        Minimum number of qualified replicates in sample group2, default is 1.

coverage_cutoff 
        Minimum coverage depth required for each sample, default is 20.

↑Back To Top


4. Demo of processing the Test_Data.zip


Steps of using APAtrap to process the Test_Data.zip:

  • 1st step: type the following command in the Command Prompt of Linux or Windows,
$ ./identifyDistal3UTR -i Sample1.bedgraph Sample2.bedgraph -m hg19.genemodel.bed -o test.utr.bed
  • 2nd step: type the following command in the Command Prompt of Linux or Windows,
$ ./predictAPA -i Sample1.bedgraph Sample2.bedgraph -g 2 -n 1 1 -u test.utr.bed -o test.APA.txt
  • 3rd step: type the following command in the R Command Prompt,
> library(deAPA)
> deAPA('test.APA.txt', 'test.APA.stat.txt', 1, 2, 1, 1, 20)

↑Back To Top


5. Inputs of APAtrap


The main inputs of APAtrap including "short reads mapping result" in bedgraph/wig format and "gene model file" in 12 column bed format.

[1] Generate bedgraph format file

(1) use FASTX-Toolkit or Trimmomatic etc., to trim and filter out low quality reads. Example:

$ java -jar $TRIMMOMATIC PE -phred33 SRR_1.fastq SRR_2.fastq SRR_1.paired.fastq SRR_1.unpaired.fastq SRR_2.paired.fastq SRR_2.unpaired.fastq ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

(2) use HISAT2 or BWA etc., to align the short reads to reference genome. Example:

$ hisat2 -x reference_genome_index SRR_1.paired.fastq -2 SRR_2.paired.fastq -S SRR.sam

(3) convert file format (Tool: Samtools and bedtools)

$ samtools view -bS SRR.sam > SRR.bam

$ samtools sort SRR.bam -o SRR.sorted.bam

$ genomeCoverageBed -bg -ibam SRR.sort.bam -g reference.genome.size.txt -split > SRR.bedgraph


[2] Obtain gene model file of reference genome

Gene model files of most animal species could be retrieved from UCSC, but for plants they are not available. Currently, we provide gene model files of plants Arabidopsis thaliana and Oryza sativa. We can help generate the gene model file for other plant species if corresponding genome annotation file was provided. Users can also generate their own gene model files as follows,

(1) download UCSC tools

$ wget -c http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/gtfToGenePred
$ wget -c http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/genePredToBed

(2) modify Linux file permission

$ chmod 755 gtfToGenePred genePredToBed

(3) convert gtf file into bed12

$ gtfToGenePred Homo_sapiens.GRCh38.86.gtf test.genePhred 
$ genePredToBed test.genePhred hg.genemodel.bed
$ rm test.genePhred

↑Back To Top


6. Outputs of APAtrap


[1] Output of identifyDistal3UTR is a 6 column bed format file:
Column Explanation
1 Name of the chromosome/scaffold
2 Starting position
2 Ending position
3 Label including info of Gene ID, Gene symbol, Chromosome Name, Strand
5 Score
6 Strand


[2] Output of preditAPA is a text table seperated by tab:
Column Name Explanation
Gene Gene ID
Mean_Squared_Error Mean squared error of fitting
Predicted_APA Coordinates of proximal APA sites inferred by APAtrap (seperated by comma)
Loci Range of the 3'UTR, of which the terminal site represents the most distal poly(A) site
Group_m_n_Separate_Exp Expression level of each APA sites (from the most proximal site to the distal site, seperated by comma). m,n indicate the mth sample, nth replicate
Group_m_n_Total_Exp Total expression level of sample m, replicate n


[3] Output of deAPA contains 4 additional columns as compared with output of preditAPA:
Column Name Explanation
p.value p value
perc_diff PD index, percentage difference of APA site usages between two commpared groups, ∈[0,1]
r Pearson product moment correlation coefficient, ∈[-1,1], a positive value represents that group2 uses more distal poly(A) site (or long 3' UTR) compared to group1, a negtive value represents that group2 uses more proximal poly(A) site (or short 3' UTR) compared to group1
p.adjust Adjusted p value

↑Back To Top

Project Members: