1. What is QuantifyPoly(A)?
QuantifyPoly(A) is a complete and seamless R package for accurately quantifying poly(A) site usages under diverse biological conditions. It employs a weighted density peak clustering algorithm to address the innate microheterogeneity of cleavage and polyadenylation.
If you have any question or comment, please contact with Dr. Congting Ye(yec@xmu.edu.cn). You can also report a bug as a Ticket request, or start a topic session in the Discussion webpage of the QuantifyPoly(A) website.
2. Installation of QuantifyPoly(A)?
Install software: SAMtools, bedtools, and R.
Install R packages: tools, bedr, stringr, outliers, dplyr, tidyr, matrixStats, pbmcapply, FactoMineR, factoextra, ggalt, ggplot2, uwot, GenomicRanges, GenomicFeatures, rtracklayer, Rsamtools, DESeq2, ggbio. If R >=3.5.0. ,
> if (!require("tools")) install.packages("tools") > if (!require("bedr")) install.packages("bedr") > if (!require("stringr")) install.packages("stringr") > if (!require("outliers")) install.packages("outliers") > if (!require("dplyr")) install.packages("dplyr") > if (!require("tidyr")) install.packages("tidyr") > if (!require("matrixStats")) install.packages("matrixStats") > if (!require("pbmcapply")) install.packages("pbmcapply") > if (!require("FactoMineR")) install.packages("FactoMineR") > if (!require("factoextra")) install.packages("factoextra") > if (!require("ggalt")) install.packages("ggalt") > if (!require("ggplot2")) install.packages("ggplot2") > if (!require("uwot")) install.packages("uwot") > if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") > if (!require("GenomicRanges")) BiocManager::install("GenomicRanges") > if (!require("GenomicFeatures")) BiocManager::install("GenomicFeatures") > if (!require("rtracklayer")) BiocManager::install("rtracklayer") > if (!require("Rsamtools")) BiocManager::install("Rsamtools") > if (!require("DESeq2")) BiocManager::install("DESeq2") > if (!require("ggbio")) BiocManager::install("ggbio")
If R < 3.5.0. ,
> if (!require("tools")) install.packages("tools") > if (!require("bedr")) install.packages("bedr") > if (!require("stringr")) install.packages("stringr") > if (!require("outliers")) install.packages("outliers") > if (!require("dplyr")) install.packages("dplyr") > if (!require("tidyr")) install.packages("tidyr") > if (!require("matrixStats")) install.packages("matrixStats") > if (!require("pbmcapply")) install.packages("pbmcapply") > if (!require("FactoMineR")) install.packages("FactoMineR") > if (!require("factoextra")) install.packages("factoextra") > if (!require("ggalt")) install.packages("ggalt") > if (!require("ggplot2")) install.packages("ggplot2") > if (!require("uwot")) install.packages("uwot") > source("https://bioconductor.org/biocLite.R") > if (!require("GenomicRanges")) BiocInstaller::biocLite("GenomicRanges") > if (!require("GenomicFeatures")) BiocInstaller::biocLite("GenomicFeatures") > if (!require("rtracklayer")) BiocInstaller::biocLite("rtracklayer") > if (!require("Rsamtools")) BiocInstaller::biocLite("Rsamtools") > if (!require("DESeq2")) BiocInstaller::biocLite("DESeq2") > if (!require("ggbio")) BiocInstaller::biocLite("ggbio")
Go to the webpage of the QuantifyPoly(A), click the 'Files' tab to download the latest version of QuantifyPoly(A) and related test dataset.
You should make sure the R environment is installed. After opening the R, change the R's Working Path to the path where 'QuantifyPolyA_0.X.0.tar.gz' is located (e.g., './Downloads'). Type the following command in the Command Window of R:
> install.packages("QuantifyPolyA_0.X.0.tar.gz",repos = NULL, type = "source")
3. How to run QuantifyPoly(A)?
There are 6 major steps of running QuantifyPoly(A): (a) input preparing; (b) internal priming artifact removing; (c) poly(A) site clustering; (d) feature annotation; (e) APA dynamics detection; (f) APA visualization (see Figure 1).
Figure 1. The schematic diagram of QuantifyPoly(A).
Load the raw poly(A) site information extracted from 3' end sequencing data.
1) Load files located in a directory,
> QpolyA <- Load.PolyA(dir='./Test')
2) Or load specified files,
> QpolyA <- Load.PolyA(files=c('Control1.bed','Control2.bed','Treatment1.bed','Treatment2.bed'))
files - A character vector specifying the names of files containing poly(A) site information, each file contains 4 columns 'seqnames', 'strand', 'coord', 'score'. dir - A string specifying the directory of files containing poly(A) site information.
Remove poly(A) sites which potentially result from internal priming events.
> QpolyA <- Remove.IP(QpolyA,fasta='Genome.fasta')
QpolyA - A QuantifyPolyA object containing all raw poly(A) sites. fasta - A string specifying the location and name of the corresponding genome file in FASTA format.
Cluster poly(A) sites into clusters (PACs) using a weighted density peak clustering algorithm.
> QpolyA <- Cluster.PolyA(QpolyA, max.gapwidth = 24, mc.cores = 4)
QpolyA - A QuantifyPolyA object containing clean poly(A) sites. max.gapwidth - A cutoff limiting the max distance between two adjacent poly(A) sites in a poly(A) site cluster (PAC), default value is 24. mc.cores - An integer indicating the number of processors/cores to perform the clustering, default value is 4.
Annotate the PACs based on existed genome annotations, and generate the PAC table of all experimental samples.
> QpolyA <- Annotate.PolyA(QpolyA,gff='Genome.annotation.gtf',seq.levels = NA)
QpolyA - A QuantifyPolyA object containing clean poly(A) sites and PACs. gff - A genome annotation file in GFF or GTF format, GTF format is recommended. seq.levels - A character vector of the names of Chromosomes/Contigs. This parameter is useful when the Chromosomes/Contigs names in genome file (used for short reads mapping) and genome annotation file are not consistent.
Identify APA usage patterns of genes showing significantly difference between two samples, treatments, or conditions. The dynamics detection could be performed on different categories of poly(A), e.g., canonical APA (PACs located at 3’ UTR), non-canonical APA (PACs located at intragenic regions exclude 3’ UTR), split APA (PACs recovered through the secondary clustering - weighted density peak clustering).
Quantify APA dynamics among split PACs identified by the weighted density peak clustering.
> colData <- data.frame(condition=c(rep('Control',2),rep('Treatment',2)),type='single-end') > split.APA <- Quantify.SplitAPA(QpolyA,colData,contrast=c("condition","Control","Treatment"))
QpolyA - A QuantifyPolyA object generated after poly(A) site clustering and annotation. colData - A data.frame object containing the information of experimental design. contrast - A three element vector specifying the two samples to be compared, e.g. contrast=c("condition","Control","Treatment").
Quantify APA dynamics among canonical PACs of a specific gene, canonical PACs represent PACs located at 3' UTR.
> colData <- data.frame(condition=c(rep('Control',2),rep('Treatment',2)),type='single-end') > canonical.APA <- Quantify.CanonicalAPA(QpolyA,colData,contrast=c("condition","Control","Treatment"))
QpolyA - A QuantifyPolyA object generated after poly(A) site clustering and annotation. colData - A data.frame object containing the information of experimental design. contrast - A three element vector specifying the two samples to be compared, e.g. contrast=c("condition","Control","Treatment").
Quantify the APA dynamics between two categories of PAC (canonical and non-canonical PACs in a gene). Non-canonical PACs represent PACs located at intragenic regions exclude 3’ UTR.
> colData <- data.frame(condition=c(rep('Control',2),rep('Treatment',2)),type='single-end') > cnc.APA <- Quantify.CNCAPA(QpolyA,colData,contrast=c("condition","Control","Treatment"))
QpolyA - A QuantifyPolyA object generated after poly(A) site clustering and annotation. colData - A data.frame object containing the information of experimental design. contrast - A three element vector specifying the two samples to be compared, e.g. contrast=c("condition","Control","Treatment").
Quantify APA dynamics among all PACs of a specific gene, each PAC is treated equally.
> colData <- data.frame(condition=c(rep('Control',2),rep('Treatment',2)),type='single-end') > gene.APA <- Quantify.GeneAPA(QpolyA,colData,contrast=c("condition","Control","Treatment"))
QpolyA - A QuantifyPolyA object generated after poly(A) site clustering and annotation. colData - A data.frame object containing the information of experimental design. contrast - A three element vector specifying the two samples to be compared, e.g. contrast=c("condition","Control","Treatment").
Visualize poly(A) profiles of a specific gene across multiple samples.
> library(rtracklayer) > anno <- import('Genome.annotation.gtf') > gene_id <- 'XXXXXXXX' > dp <- Visualize.PolyA(QpolyA,anno,gene_id) > library(ggsci) > dp + scale_color_aaas() + scale_fill_aaas()
QpolyA - A QuantifyPolyA object containing clean poly(A) sites and PAC table. anno - A Granges object of gene model info. It could be generated using the function 'import' of R package 'rtracklayer', e.g. gtf = import('./Mus_musculus.GRCm38.84.gtf') gene.id - A string indicating the gene to be visualized, e.g. gene.id = 'ENSMUSG00000073490'. sample.names - A character vector specifying the samples to be visualized. If not defined, all samples will display. coord.lim - Two numeric values, specifying the left and right limit of X-axis.
4. Demos of running QuantifyPoly(A)
Steps of using QuantifyPoly(A) to process test data from Arabidopsis:
1) download the test data Arabidopsis_FY.zip from QuantifyPoly(A) website, the genome release TAIR10_chr_all.fas from The Arabidopsis Information Resource (TAIR), and the corresponding genome annotation Arabidopsis_thaliana.TAIR10.45.gtf.gz from EnsemblPlants.
2) move the downloaded files 'Arabidopsis_FY.zip', 'TAIR10_chr_all.fas', and 'Arabidopsis_thaliana.TAIR10.45.gtf.gz' to the Working Path and unzip 'Arabidopsis_PAT.zip' and 'Arabidopsis_thaliana.TAIR10.45.gtf.gz'.
> library(QuantifyPolyA) > QpolyA <- Load.PolyA(dir = './Arabidopsis_FY')
> QpolyA <- Remove.IP(QpolyA,'TAIR10_chr_all.fas')
> QpolyA <- Cluster.PolyA(QpolyA, max.gapwidth = 24, mc.cores = 16)
> QpolyA <- Annotate.PolyA(QpolyA,'Arabidopsis_thaliana.TAIR10.45.gtf',seq.levels = as.character(unique(QpolyA@polyA$seqnames)))
> QpolyA <- Filter.PolyA(QpolyA, min_count = 10, min_sample = 1)
> colData <- data.frame(condition=c(rep('fy3',3), rep('fy3oxt6',3), rep('oxt6',3), rep('wt',3)),type='single-end') > rownames(colData) <- QpolyA@sample_names > # Quantify APA dynamics among split PACs. > split.APA <- Quantify.SplitAPA(QpolyA,colData,contrast=c("condition","wt","oxt6")) > # Quantify APA dynamics among canonical PACs. > canonical.APA <- Quantify.CanonicalAPA(QpolyA,colData,contrast=c("condition","wt","oxt6")) > # Quantify the APA dynamics between canonical and non-canonical PACs. > cnc.APA <- Quantify.CNCAPA(QpolyA,colData,contrast=c("condition","wt","oxt6")) > # Quantify the APA dynamics across a whole gene. > gene.APA <- Quantify.GeneAPA(QpolyA,colData,contrast=c("condition","wt","oxt6"))
> # Immport annotaions. > library(rtracklayer) > anno <- import('Arabidopsis_thaliana.TAIR10.45.gtf') > # Get poly(A) profiles of specific genes. > gene_id <- 'AT1G01140' > dp <- Visualize.PolyA(QpolyA,anno,gene_id,sample.names = c('wt','fy3oxt6')) > # Customize the plot. > library(ggsci) > dp + scale_color_aaas() + scale_fill_aaas()
Figure 2. Illustration of a gene CIPK9 showing dynamic APA profiles between wild type (wt) and double mutants (fy3oxt6) samples.
> # Run basic deseq2 analysis. > res <- DESeq2.PolyA(QpolyA,colData) > # Get result of a specific comparison. > library(DESeq2) > compare_wt_oxt6 <- results(res$DESeq2.Result, contrast=c("condition","wt","oxt6")) > compare_wt_oxt6.df <- as.data.frame(compare_wt_oxt6) > # Summarize some basic tallies. > summary(compare_wt_oxt6) > sum(compare_wt_oxt6$padj < 0.05, na.rm=TRUE) > # Show the PCA plot. > res$PCA.Plot
Figure 3. The PCA plot based on the PAC matrix.
Steps of using QuantifyPoly(A) to process test data from human:
1) download the test data Human_MAQC.zip from QuantifyPoly(A) website, the genome release Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz and the corresponding genome annotation Homo_sapiens.GRCh38.101.gtf.gz from Ensembl.
2) move the downloaded files 'Human_MAQC.zip', 'Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz', and 'Homo_sapiens.GRCh38.101.gtf.gz' to the Working Path and unzip them.
> library(QuantifyPolyA) > QpolyA <- Load.PolyA(dir = './Human_MAQC')
> QpolyA <- Remove.IP(QpolyA,'Homo_sapiens.GRCh38.dna.primary_assembly.fa')
> QpolyA <- Cluster.PolyA(QpolyA, max.gapwidth = 24, mc.cores = 16)
> QpolyA <- Annotate.PolyA(QpolyA,'Homo_sapiens.GRCh38.101.gtf')
> QpolyA = Filter.PolyA(QpolyA, min_count = 10, min_sample = 1)
> colData = data.frame(condition=c(rep('Brain',2), rep('UHR',2)),type='single-end') > rownames(colData) = QpolyA@sample_names > # Quantify APA dynamics among split PACs. > split.APA = Quantify.SplitAPA(QpolyA,colData,contrast=c("condition","Brain","UHR")) > # Quantify APA dynamics among canonical PACs. > canonical.APA = Quantify.CanonicalAPA(QpolyA,colData,contrast=c("condition","Brain","UHR")) > # Quantify the APA dynamics between canonical and non-canonical PACs. > cnc.APA = Quantify.CNCAPA(QpolyA,colData,contrast=c("condition","Brain","UHR")) > # Quantify the APA dynamics across a whole gene. > gene.APA = Quantify.GeneAPA(QpolyA,colData,contrast=c("condition","Brain","UHR"))
> # Immport annotaions. > library(rtracklayer) > anno = import('Homo_sapiens.GRCh38.101.gtf') > # Get poly(A) profiles of specific genes. > gene_id = 'ENSG00000164091' > dp = Visualize.PolyA(QpolyA,anno,gene_id,coord.lim=c(52254310,52255210)) > # Customize the plot. > library(ggsci) > dp + scale_color_aaas() + scale_fill_aaas()
Figure 4. Illustration of a gene PCNX1 showing dynamic APA profiles between brain and UHR samples.
> # Run basic deseq2 analysis. > res = DESeq2.PolyA(QpolyA,colData) > # Get result of a specific comparison. > library(DESeq2) > compare_Brain_UHR <- results(res$DESeq2.Result, contrast=c("condition","Brain","UHR")) > compare_Brain_UHR.df = as.data.frame(compare_Brain_UHR) > # Summarize some basic tallies. > summary(compare_Brain_UHR) > sum(compare_Brain_UHR$padj < 0.05, na.rm=TRUE) > # Show the PCA plot. > res$PCA.Plot
Figure 5. The PCA plot based on the PAC matrix.
5. Inputs of QuantifyPoly(A)
The main inputs of QuantifyPoly(A) are raw poly(A) data (one file per sample) of genomic coordinates of poly(A) sites, and corresponding genome sequence and genome annotation files.
Use commonly used tools, e.g. STAR, and HISAT2 etc., and pipelines from PlantAPAdb or PolyASite to align 3' end sequencing reads, and extract genomic coordinates of poly(A) sites.
Tab-separated values files containing 4 columns as follows,
Columns | Explanation |
---|---|
seqnames | Chromosome or contig names. |
strand | '+' or '-', indicates the strand the ploy(A) site located. |
coord | Coordinates of poly(A) sites. |
score | Number of reads support each poly(A) site. |
6. Outputs of QuantifyPoly(A)
A QuantifyPolyA object with 5 slots:
Slots | Explanation |
---|---|
sample_names | A character vector storing names of samples. Sample names are extracted from the names of input files. |
pre.polyA | A list storing un-clustered poly(A) sites. |
polyA | A data.frame storing PACs (Poloy(A) Site Clusters) generated from the weighted density peak clustering. |
simple.clusters | A data.frame storing PACs generated from the k-nt iteratively clustering. |
split.clusters | A data.frame storing only the split PACs. |
A tibble object containing APA dynamics metrics of all split PACs or genes.
Columns | Explanation |
---|---|
split_label or gene_id | A label indicates the group of split PACs, or a gene ID. |
pd | A percentage difference metric measures the dynamics induced by biological condition changes. |
r | An averaged Pearson correlation coefficient indicates the lengthening/shortening status. |
p.value | Statistical test p value. |
A ggplot2 object, users can use relevant functions of ggplot2 package to customize the output.
Examples of how to customize the plot could be found at section 4. Demos of running QuantifyPoly(A).