1. What is APAtrap?
APAtrap is a tool capable of refining annotated 3'UTR and identifying novel 3' UTRs and 3' UTR extensions, and aiming to identify all potential APA (alternative polyadenylation) sites and detect genes with differential APA site usage between conditions by leveraging the resolution of RNA-seq data.
This software is an open-source tool that follows specifications on website (http://creativecommons.org/licenses/by-nc-sa/3.0/).
If you have any question or comment, please contact with Dr. Congting Ye(yec@xmu.edu.cn).
2. Installation of APAtrap?
After you download and unzip our APAtrap package, you will see a folder named APAtrap, where 2 standalone executables ('identifyDistal3UTR' and 'predictAPA') compressed from Perl programs and and 1 R package ('deAPA_1.0.tar.gz') are located.
You shoule make sure the R environment is installed. After opening the R, change the R's Working Path to the path where 'deAPA_1.0.tar.gz' is located (e.g., './APAtrap'). Type the following command in the Command Window of R:
> install.packages("deAPA_1.0.tar.gz",repos = NULL, type = "source")
3. How to run APAtrap?
There are 3 steps of running APAtrap: (1) run identifyDistal3UTR to refine annotated 3'UTRs and identify novel 3'UTRs or 3'UTR extensions. (2) run predictAPA to infer all potential APA sites and estimate their corresponding usages. (3) run R function deAPA to detect genes having significant changes in APA site usage between conditions.
Refine annotated 3'UTRs and identify novel 3'UTRs or 3'UTR extensions.
identifyDistal3UTR -h
-i short reads mapping result in bedgraph/wig format, can accept single file or multiple files.
-m gene model file in bed format.
-o file store the information of extended 3'UTR in bed format.
-w window size used to scan the mapping result, default is 100.
-e pre-extension size of each 3'UTR, default is 10000.
-c minimum coverage of the end of the distal 3'UTR with comparing to the whole transcript.
-p minimum percentage of valid nucleotides in a scanning-window.
-s gene symbol file.
1) For genome having long 3'UTR,
identifyDistal3UTR -i Sample1.bedgraph Sample2.bedgraph -m hg19.genemodel.bed -o novel.utr.bed
2) For genome having short 3'UTR,
identifyDistal3UTR -i Sample1.bedgraph Sample2.bedgraph -m rice.genemodel.bed -o novel.utr.bed -w 50 -e 5000
Infer all potential APA sites and estimate their corresponding usages.
predictAPA -h
-i short reads mapping result in bedgraph/wig format, can accept single file or multiple files.
-g number of groups (treatments/conditions) of the input files, e.g. -g 2.
-n number of files(biological replicates) in each group (treatment/condition), e.g. -n 1 1.
-u 3'UTR annotation file in bed format.
-o information of the predicted APA sites and their usage.
-d minimum degree of coverage variation between two adjcent APA sites, >0 and <1, default is 0.2.
-c minimum average coverage required for each 3'UTR, >=10, default is 20.
-a minimum distance between the predicted APA sites, >=20, default is 100.
-w window size used to scan the profile, >=20, default is 50.
1) For genome having long 3'UTR,
predictAPA -i Sample1.bedgraph Sample2.bedgraph -g 2 -n 1 1 -u hg19.utr.bed -o output.txt
2) For genome having short 3'UTR,
predictAPA -i Sample1.bedgraph Sample2.bedgraph -g 2 -n 1 1 -u rice.utr.bed -o output.txt -a 50
Detect genes having significant changes in APA site usage between conditions.
deAPA(input_file, output_file, group1, group2, least_qualified_num_in_group1, least_qualified_num_in_group2, coverage_cutoff)
input_file
The result generated by 'predictAPA'.
output_file
Name of output file.
group1
The first group of sample to be compared, default is 1.
group2
The second group of sample to be compared, default is 2.
least_qualified_num_in_group1
Minimum number of qualified replicates in sample group1, default is 1.
least_qualified_num_in_group2
Minimum number of qualified replicates in sample group2, default is 1.
coverage_cutoff
Minimum coverage depth required for each sample, default is 20.
4. Demo of processing the Test_Data.zip
Steps of using APAtrap to process the Test_Data.zip:
$ ./identifyDistal3UTR -i Sample1.bedgraph Sample2.bedgraph -m hg19.genemodel.bed -o test.utr.bed
$ ./predictAPA -i Sample1.bedgraph Sample2.bedgraph -g 2 -n 1 1 -u test.utr.bed -o test.APA.txt
> library(deAPA)
> deAPA('test.APA.txt', 'test.APA.stat.txt', 1, 2, 1, 1, 20)
5. Inputs of APAtrap
The main inputs of APAtrap including "short reads mapping result" in bedgraph/wig format and "gene model file" in 12 column bed format.
(1) use FASTX-Toolkit or Trimmomatic etc., to trim and filter out low quality reads. Example:
$ java -jar $TRIMMOMATIC PE -phred33 SRR_1.fastq SRR_2.fastq SRR_1.paired.fastq SRR_1.unpaired.fastq SRR_2.paired.fastq SRR_2.unpaired.fastq ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
(2) use HISAT2 or BWA etc., to align the short reads to reference genome. Example:
$ hisat2 -x reference_genome_index SRR_1.paired.fastq -2 SRR_2.paired.fastq -S SRR.sam
(3) convert file format (Tool: Samtools and bedtools)
$ samtools view -bS SRR.sam > SRR.bam
$ samtools sort SRR.bam -o SRR.sorted.bam
$ genomeCoverageBed -bg -ibam SRR.sort.bam -g reference.genome.size.txt -split > SRR.bedgraph
Gene model files of most animal species could be retrieved from UCSC, but for plants they are not available. Currently, we provide gene model files of plants Arabidopsis thaliana and Oryza sativa. We can help generate the gene model file for other plant species if corresponding genome annotation file was provided. Users can also generate their own gene model files as follows,
(1) download UCSC tools
$ wget -c http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/gtfToGenePred
$ wget -c http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/genePredToBed
(2) modify Linux file permission
$ chmod 755 gtfToGenePred genePredToBed
(3) convert gtf file into bed12
$ gtfToGenePred Homo_sapiens.GRCh38.86.gtf test.genePhred
$ genePredToBed test.genePhred hg.genemodel.bed
$ rm test.genePhred
6. Outputs of APAtrap
Column | Explanation |
---|---|
1 | Name of the chromosome/scaffold |
2 | Starting position |
2 | Ending position |
3 | Label including info of Gene ID, Gene symbol, Chromosome Name, Strand |
5 | Score |
6 | Strand |
Column Name | Explanation |
---|---|
Gene | Gene ID |
Mean_Squared_Error | Mean squared error of fitting |
Predicted_APA | Coordinates of proximal APA sites inferred by APAtrap (seperated by comma) |
Loci | Range of the 3'UTR, of which the terminal site represents the most distal poly(A) site |
Group_m_n_Separate_Exp | Expression level of each APA sites (from the most proximal site to the distal site, seperated by comma). m,n indicate the mth sample, nth replicate |
Group_m_n_Total_Exp | Total expression level of sample m, replicate n |
Column Name | Explanation |
---|---|
p.value | p value |
perc_diff | PD index, percentage difference of APA site usages between two commpared groups, ∈[0,1] |
r | Pearson product moment correlation coefficient, ∈[-1,1], a positive value represents that group2 uses more distal poly(A) site (or long 3' UTR) compared to group1, a negtive value represents that group2 uses more proximal poly(A) site (or short 3' UTR) compared to group1 |
p.adjust | Adjusted p value |