QuantifyPoly(A) Wiki

Quantification of poly(A) sites from 3' end sequencing data

Brought to you by: yec

User Manual

Authors:

Attachments

Figure1.png (127092 bytes)

Figure2.png (56202 bytes)

Figure3.png (21586 bytes)

Figure4.png (30683 bytes)

Figure5.png (15542 bytes)

QuantifyPoly(A) User Manual

What is QuantifyPoly(A)?
Installation of QuantifyPoly(A)
How to run QuantifyPoly(A)?
3.1 Load raw poly(A) site data
3.2 Remove internal priming artifacts
3.3 Cluster poly(A) sites
3.4 Annotate and quantify poly(A) site clusters
3.5 Quantify APA dynamics
3.6 Visualize poly(A) profiles
Demo of running QuantifyPoly(A)
4.1 Application of QuantifyPoly(A) in an Arabidopsis dataset
4.2 Application of QuantifyPoly(A) in a human dataset
Inputs of QuantifyPoly(A)
Outputs of QuantifyPoly(A)

1. What is QuantifyPoly(A)?

QuantifyPoly(A) is a complete and seamless R package for accurately quantifying poly(A) site usages under diverse biological conditions. It employs a weighted density peak clustering algorithm to address the innate microheterogeneity of cleavage and polyadenylation.

If you have any question or comment, please contact with Dr. Congting Ye(yec@xmu.edu.cn). You can also report a bug as a Ticket request, or start a topic session in the Discussion webpage of the QuantifyPoly(A) website.

Ye C, Zhao D, Ye W, Wu X, Ji G, Li Q. Q., Lin J (2021) QuantifyPoly(A): reshaping alternative polyadenylation landscapes of eukaryotes with weighted density peak clustering. Briefings in Bioinformatics doi: 10.1093/bib/bbab268

↑Back To Top

2. Installation of QuantifyPoly(A)?

[1]. Install dependencies of QuantifyPoly(A)

Install software: SAMtools, bedtools, and R.
Install R packages: tools, bedr, stringr, outliers, dplyr, tidyr, matrixStats, pbmcapply, FactoMineR, factoextra, ggalt, ggplot2, uwot, GenomicRanges, GenomicFeatures, rtracklayer, Rsamtools, DESeq2, ggbio. If R >=3.5.0. ,

> if (!require("tools")) install.packages("tools")
> if (!require("bedr")) install.packages("bedr")
> if (!require("stringr")) install.packages("stringr")
> if (!require("outliers")) install.packages("outliers")
> if (!require("dplyr")) install.packages("dplyr")
> if (!require("tidyr")) install.packages("tidyr")
> if (!require("matrixStats")) install.packages("matrixStats")
> if (!require("pbmcapply")) install.packages("pbmcapply")
> if (!require("FactoMineR")) install.packages("FactoMineR")
> if (!require("factoextra")) install.packages("factoextra")
> if (!require("ggalt")) install.packages("ggalt")
> if (!require("ggplot2")) install.packages("ggplot2")
> if (!require("uwot")) install.packages("uwot")


> if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
> if (!require("GenomicRanges")) BiocManager::install("GenomicRanges")
> if (!require("GenomicFeatures")) BiocManager::install("GenomicFeatures")
> if (!require("rtracklayer")) BiocManager::install("rtracklayer")
> if (!require("Rsamtools")) BiocManager::install("Rsamtools")
> if (!require("DESeq2")) BiocManager::install("DESeq2")
> if (!require("ggbio")) BiocManager::install("ggbio")

If R < 3.5.0. ,

> if (!require("tools")) install.packages("tools")
> if (!require("bedr")) install.packages("bedr")
> if (!require("stringr")) install.packages("stringr")
> if (!require("outliers")) install.packages("outliers")
> if (!require("dplyr")) install.packages("dplyr")
> if (!require("tidyr")) install.packages("tidyr")
> if (!require("matrixStats")) install.packages("matrixStats")
> if (!require("pbmcapply")) install.packages("pbmcapply")
> if (!require("FactoMineR")) install.packages("FactoMineR")
> if (!require("factoextra")) install.packages("factoextra")
> if (!require("ggalt")) install.packages("ggalt")
> if (!require("ggplot2")) install.packages("ggplot2")
> if (!require("uwot")) install.packages("uwot")


> source("https://bioconductor.org/biocLite.R")
> if (!require("GenomicRanges")) BiocInstaller::biocLite("GenomicRanges")
> if (!require("GenomicFeatures")) BiocInstaller::biocLite("GenomicFeatures")
> if (!require("rtracklayer")) BiocInstaller::biocLite("rtracklayer")
> if (!require("Rsamtools")) BiocInstaller::biocLite("Rsamtools")
> if (!require("DESeq2")) BiocInstaller::biocLite("DESeq2")
> if (!require("ggbio")) BiocInstaller::biocLite("ggbio")

[2]. Download our package - 'QuantifyPolyA_0.X.0.tar.gz'

Go to the webpage of the QuantifyPoly(A), click the 'Files' tab to download the latest version of QuantifyPoly(A) and related test dataset.

[3]. Install R package 'QuantifyPolyA_0.X.0.tar.gz'

You should make sure the R environment is installed. After opening the R, change the R's Working Path to the path where 'QuantifyPolyA_0.X.0.tar.gz' is located (e.g., './Downloads'). Type the following command in the Command Window of R:

> install.packages("QuantifyPolyA_0.X.0.tar.gz",repos = NULL, type = "source")

↑Back To Top

3. How to run QuantifyPoly(A)?

There are 6 major steps of running QuantifyPoly(A): (a) input preparing; (b) internal priming artifact removing; (c) poly(A) site clustering; (d) feature annotation; (e) APA dynamics detection; (f) APA visualization (see Figure 1).

Figure 1. The schematic diagram of QuantifyPoly(A).

3.1 Load raw poly(A) site data (a)

Load the raw poly(A) site information extracted from 3' end sequencing data.

Usage

1) Load files located in a directory,

> QpolyA <- Load.PolyA(dir='./Test')

2) Or load specified files,

> QpolyA <- Load.PolyA(files=c('Control1.bed','Control2.bed','Treatment1.bed','Treatment2.bed'))

Arguments

files - A character vector specifying the names of files containing poly(A) site information, each file contains 4 columns 'seqnames', 'strand', 'coord', 'score'.
dir - A string specifying the directory of files containing poly(A) site information.

↑Back To Top

3.2 Remove internal priming artifacts (b)

Remove poly(A) sites which potentially result from internal priming events.

Usage

> QpolyA <- Remove.IP(QpolyA,fasta='Genome.fasta')

Arguments

QpolyA - A QuantifyPolyA object containing all raw poly(A) sites.
fasta - A string specifying the location and name of the corresponding genome file in FASTA format.

↑Back To Top

3.3 Cluster poly(A) sites (c)

Cluster poly(A) sites into clusters (PACs) using a weighted density peak clustering algorithm.

Usage

> QpolyA <- Cluster.PolyA(QpolyA, max.gapwidth = 24, mc.cores = 4)

Arguments

QpolyA - A QuantifyPolyA object containing clean poly(A) sites.
max.gapwidth - A cutoff limiting the max distance between two adjacent poly(A) sites in a poly(A) site cluster (PAC), default value is 24.
mc.cores - An integer indicating the number of processors/cores to perform the clustering, default value is 4.

↑Back To Top

3.4 Annotate and quantify poly(A) site clusters (d)

Annotate the PACs based on existed genome annotations, and generate the PAC table of all experimental samples.

Usage

> QpolyA <- Annotate.PolyA(QpolyA,gff='Genome.annotation.gtf',seq.levels = NA)

Arguments

QpolyA - A QuantifyPolyA object containing clean poly(A) sites and PACs.
gff - A genome annotation file in GFF or GTF format, GTF format is recommended.
seq.levels - A character vector of the names of Chromosomes/Contigs. This parameter is useful when the Chromosomes/Contigs names in genome file (used for short reads mapping) and genome annotation file are not consistent.

↑Back To Top

3.5 Quantify APA dynamics (e)

Identify APA usage patterns of genes showing significantly difference between two samples, treatments, or conditions. The dynamics detection could be performed on different categories of poly(A), e.g., canonical APA (PACs located at 3’ UTR), non-canonical APA (PACs located at intragenic regions exclude 3’ UTR), split APA (PACs recovered through the secondary clustering - weighted density peak clustering).

3.5.1 Quantify APA dynamics among split PACs

Quantify APA dynamics among split PACs identified by the weighted density peak clustering.

Usage

> colData <- data.frame(condition=c(rep('Control',2),rep('Treatment',2)),type='single-end')
> split.APA <- Quantify.SplitAPA(QpolyA,colData,contrast=c("condition","Control","Treatment"))

Parameters:

QpolyA - A QuantifyPolyA object generated after poly(A) site clustering and annotation.
colData - A data.frame object containing the information of experimental design.
contrast - A three element vector specifying the two samples to be compared, e.g. contrast=c("condition","Control","Treatment").

3.5.2 Quantify APA dynamics among canonical PACs

Quantify APA dynamics among canonical PACs of a specific gene, canonical PACs represent PACs located at 3' UTR.

Usage

> colData <- data.frame(condition=c(rep('Control',2),rep('Treatment',2)),type='single-end')
> canonical.APA <- Quantify.CanonicalAPA(QpolyA,colData,contrast=c("condition","Control","Treatment"))

Parameters:

QpolyA - A QuantifyPolyA object generated after poly(A) site clustering and annotation.
colData - A data.frame object containing the information of experimental design.
contrast - A three element vector specifying the two samples to be compared, e.g. contrast=c("condition","Control","Treatment").

3.5.3 Quantify the APA dynamics between canonical and non-canonical PACs

Quantify the APA dynamics between two categories of PAC (canonical and non-canonical PACs in a gene). Non-canonical PACs represent PACs located at intragenic regions exclude 3’ UTR.

Usage

> colData <- data.frame(condition=c(rep('Control',2),rep('Treatment',2)),type='single-end')
> cnc.APA <- Quantify.CNCAPA(QpolyA,colData,contrast=c("condition","Control","Treatment"))

Parameters:

QpolyA - A QuantifyPolyA object generated after poly(A) site clustering and annotation.
colData - A data.frame object containing the information of experimental design.
contrast - A three element vector specifying the two samples to be compared, e.g. contrast=c("condition","Control","Treatment").

3.5.4 Quantify the APA dynamics across a whole gene

Quantify APA dynamics among all PACs of a specific gene, each PAC is treated equally.

Usage

> colData <- data.frame(condition=c(rep('Control',2),rep('Treatment',2)),type='single-end')
> gene.APA <- Quantify.GeneAPA(QpolyA,colData,contrast=c("condition","Control","Treatment"))

Parameters:

QpolyA - A QuantifyPolyA object generated after poly(A) site clustering and annotation.
colData - A data.frame object containing the information of experimental design.
contrast - A three element vector specifying the two samples to be compared, e.g. contrast=c("condition","Control","Treatment").

↑Back To Top

3.6 Visualize poly(A) profiles (f)

Visualize poly(A) profiles of a specific gene across multiple samples.

Usage

> library(rtracklayer)
> anno <- import('Genome.annotation.gtf')
> gene_id <- 'XXXXXXXX'
> dp <- Visualize.PolyA(QpolyA,anno,gene_id)
> library(ggsci)
> dp + scale_color_aaas() + scale_fill_aaas()

Parameters:

QpolyA - A QuantifyPolyA object containing clean poly(A) sites and PAC table.
anno - A Granges object of gene model info. It could be generated using the function 'import' of R package 'rtracklayer', e.g. gtf = import('./Mus_musculus.GRCm38.84.gtf')
gene.id - A string indicating the gene to be visualized, e.g. gene.id = 'ENSMUSG00000073490'.
sample.names - A character vector specifying the samples to be visualized. If not defined, all samples will display.
coord.lim - Two numeric values, specifying the left and right limit of X-axis.

↑Back To Top

4. Demos of running QuantifyPoly(A)

4.1 Application of QuantifyPoly(A) in an Arabidopsis dataset

Steps of using QuantifyPoly(A) to process test data from Arabidopsis:

Preparation of necessary data

1) download the test data Arabidopsis_FY.zip from QuantifyPoly(A) website, the genome release TAIR10_chr_all.fas from The Arabidopsis Information Resource (TAIR), and the corresponding genome annotation Arabidopsis_thaliana.TAIR10.45.gtf.gz from EnsemblPlants.
2) move the downloaded files 'Arabidopsis_FY.zip', 'TAIR10_chr_all.fas', and 'Arabidopsis_thaliana.TAIR10.45.gtf.gz' to the Working Path and unzip 'Arabidopsis_PAT.zip' and 'Arabidopsis_thaliana.TAIR10.45.gtf.gz'.

**1. load raw poly(A) site data **

> library(QuantifyPolyA)
> QpolyA <- Load.PolyA(dir = './Arabidopsis_FY')

**2. remove internal priming artifacts **

> QpolyA <- Remove.IP(QpolyA,'TAIR10_chr_all.fas')

**3. cluster poly(A) sites **

> QpolyA <- Cluster.PolyA(QpolyA, max.gapwidth = 24, mc.cores = 16)

**4. annotate and quantify poly(A) site clusters **

> QpolyA <- Annotate.PolyA(QpolyA,'Arabidopsis_thaliana.TAIR10.45.gtf',seq.levels = as.character(unique(QpolyA@polyA$seqnames)))

**5. filter out PACs with low read count support **

> QpolyA <- Filter.PolyA(QpolyA, min_count = 10, min_sample = 1)

**6. quantify APA dynamics **

> colData <- data.frame(condition=c(rep('fy3',3),
                                 rep('fy3oxt6',3),
                                 rep('oxt6',3),
                                 rep('wt',3)),type='single-end')
> rownames(colData) <- QpolyA@sample_names

> # Quantify APA dynamics among split PACs.
> split.APA <- Quantify.SplitAPA(QpolyA,colData,contrast=c("condition","wt","oxt6"))

> # Quantify APA dynamics among canonical PACs.
> canonical.APA <- Quantify.CanonicalAPA(QpolyA,colData,contrast=c("condition","wt","oxt6"))

> # Quantify the APA dynamics between canonical and non-canonical PACs.
> cnc.APA <- Quantify.CNCAPA(QpolyA,colData,contrast=c("condition","wt","oxt6"))

> # Quantify the APA dynamics across a whole gene.
> gene.APA <- Quantify.GeneAPA(QpolyA,colData,contrast=c("condition","wt","oxt6"))

**7. visualize poly(A) profiles **

> # Immport annotaions.
> library(rtracklayer)
> anno <- import('Arabidopsis_thaliana.TAIR10.45.gtf')

> # Get poly(A) profiles of specific genes.
> gene_id <- 'AT1G01140'
> dp <- Visualize.PolyA(QpolyA,anno,gene_id,sample.names = c('wt','fy3oxt6'))

> # Customize the plot.
> library(ggsci)
> dp + scale_color_aaas() + scale_fill_aaas()

Figure 2. Illustration of a gene CIPK9 showing dynamic APA profiles between wild type (wt) and double mutants (fy3oxt6) samples.

**8. perform DESeq2 analysis **

> # Run basic deseq2 analysis.
> res <- DESeq2.PolyA(QpolyA,colData)

> # Get result of a specific comparison.
> library(DESeq2)
> compare_wt_oxt6 <- results(res$DESeq2.Result, contrast=c("condition","wt","oxt6"))
> compare_wt_oxt6.df <-  as.data.frame(compare_wt_oxt6)

> # Summarize some basic tallies.
> summary(compare_wt_oxt6) 
> sum(compare_wt_oxt6$padj < 0.05, na.rm=TRUE)

> # Show the PCA plot.
> res$PCA.Plot

Figure 3. The PCA plot based on the PAC matrix.

↑Back To Top

4.2 Application of QuantifyPoly(A) in a human dataset

Steps of using QuantifyPoly(A) to process test data from human:

Preparation of necessary data

1) download the test data Human_MAQC.zip from QuantifyPoly(A) website, the genome release Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz and the corresponding genome annotation Homo_sapiens.GRCh38.101.gtf.gz from Ensembl.
2) move the downloaded files 'Human_MAQC.zip', 'Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz', and 'Homo_sapiens.GRCh38.101.gtf.gz' to the Working Path and unzip them.

**1. load raw poly(A) site data **

> library(QuantifyPolyA)
> QpolyA <- Load.PolyA(dir = './Human_MAQC')

**2. remove internal priming artifacts **

> QpolyA <- Remove.IP(QpolyA,'Homo_sapiens.GRCh38.dna.primary_assembly.fa')

**3. cluster poly(A) sites **

> QpolyA <- Cluster.PolyA(QpolyA, max.gapwidth = 24, mc.cores = 16)

**4. annotate and quantify poly(A) site clusters **

> QpolyA <- Annotate.PolyA(QpolyA,'Homo_sapiens.GRCh38.101.gtf')

**5. filter out PACs with low read count support **

> QpolyA = Filter.PolyA(QpolyA, min_count = 10, min_sample = 1)

**6. quantify APA dynamics **

> colData = data.frame(condition=c(rep('Brain',2),
                                 rep('UHR',2)),type='single-end')
> rownames(colData) = QpolyA@sample_names

> # Quantify APA dynamics among split PACs.
> split.APA = Quantify.SplitAPA(QpolyA,colData,contrast=c("condition","Brain","UHR"))

> # Quantify APA dynamics among canonical PACs.
> canonical.APA = Quantify.CanonicalAPA(QpolyA,colData,contrast=c("condition","Brain","UHR"))

> # Quantify the APA dynamics between canonical and non-canonical PACs.
> cnc.APA = Quantify.CNCAPA(QpolyA,colData,contrast=c("condition","Brain","UHR"))

> # Quantify the APA dynamics across a whole gene.
> gene.APA = Quantify.GeneAPA(QpolyA,colData,contrast=c("condition","Brain","UHR"))

**7. visualize poly(A) profiles **

> # Immport annotaions.
> library(rtracklayer)
> anno = import('Homo_sapiens.GRCh38.101.gtf')

> # Get poly(A) profiles of specific genes.
> gene_id = 'ENSG00000164091'
> dp = Visualize.PolyA(QpolyA,anno,gene_id,coord.lim=c(52254310,52255210))

> # Customize the plot.
> library(ggsci)
> dp + scale_color_aaas() + scale_fill_aaas()

Figure 4. Illustration of a gene PCNX1 showing dynamic APA profiles between brain and UHR samples.

**8. perform DESeq2 analysis **

> # Run basic deseq2 analysis.
> res = DESeq2.PolyA(QpolyA,colData)

> # Get result of a specific comparison.
> library(DESeq2)
> compare_Brain_UHR <- results(res$DESeq2.Result, contrast=c("condition","Brain","UHR"))
> compare_Brain_UHR.df =  as.data.frame(compare_Brain_UHR)

> # Summarize some basic tallies.
> summary(compare_Brain_UHR) 
> sum(compare_Brain_UHR$padj < 0.05, na.rm=TRUE)

> # Show the PCA plot.
> res$PCA.Plot

Figure 5. The PCA plot based on the PAC matrix.

↑Back To Top

5. Inputs of QuantifyPoly(A)

The main inputs of QuantifyPoly(A) are raw poly(A) data (one file per sample) of genomic coordinates of poly(A) sites, and corresponding genome sequence and genome annotation files.

[1] Generate short read mapping results and raw poly(A) site data

Use commonly used tools, e.g. STAR, and HISAT2 etc., and pipelines from PlantAPAdb or PolyASite to align 3' end sequencing reads, and extract genomic coordinates of poly(A) sites.

[2] Format of raw poly(A) site data

Tab-separated values files containing 4 columns as follows,

Columns	Explanation
seqnames	Chromosome or contig names.
strand	'+' or '-', indicates the strand the ploy(A) site located.
coord	Coordinates of poly(A) sites.
score	Number of reads support each poly(A) site.

↑Back To Top

6. Outputs of QuantifyPoly(A)

[1] Output of preprocessing (a-d)

A QuantifyPolyA object with 5 slots:

Slots	Explanation
sample_names	A character vector storing names of samples. Sample names are extracted from the names of input files.
pre.polyA	A list storing un-clustered poly(A) sites.
polyA	A data.frame storing PACs (Poloy(A) Site Clusters) generated from the weighted density peak clustering.
simple.clusters	A data.frame storing PACs generated from the k-nt iteratively clustering.
split.clusters	A data.frame storing only the split PACs.

[2] Output of APA dynamics detection (e)

A tibble object containing APA dynamics metrics of all split PACs or genes.

Columns	Explanation
split_label or gene_id	A label indicates the group of split PACs, or a gene ID.
pd	A percentage difference metric measures the dynamics induced by biological condition changes.
r	An averaged Pearson correlation coefficient indicates the lengthening/shortening status.
p.value	Statistical test p value.

[3] Output of APA visualization (f)

A ggplot2 object, users can use relevant functions of ggplot2 package to customize the output.
Examples of how to customize the plot could be found at section 4. Demos of running QuantifyPoly(A).

↑Back To Top

QuantifyPoly(A) Wiki

Quantification of poly(A) sites from 3' end sequencing data

User Manual

QuantifyPoly(A) User Manual

Table of Contents

[1]. Install dependencies of QuantifyPoly(A)

[2]. Download our package - 'QuantifyPolyA_0.X.0.tar.gz'

[3]. Install R package 'QuantifyPolyA_0.X.0.tar.gz'

3.1 Load raw poly(A) site data (a)

Usage

Arguments

3.2 Remove internal priming artifacts (b)

Usage

Arguments

3.3 Cluster poly(A) sites (c)

Usage

Arguments

3.4 Annotate and quantify poly(A) site clusters (d)

Usage

Arguments

3.5 Quantify APA dynamics (e)

3.5.1 Quantify APA dynamics among split PACs

Usage

Parameters:

3.5.2 Quantify APA dynamics among canonical PACs

Usage

Parameters:

3.5.3 Quantify the APA dynamics between canonical and non-canonical PACs

Usage

Parameters:

3.5.4 Quantify the APA dynamics across a whole gene

Usage

Parameters:

3.6 Visualize poly(A) profiles (f)

Usage

Parameters:

4.1 Application of QuantifyPoly(A) in an Arabidopsis dataset

4.2 Application of QuantifyPoly(A) in a human dataset

[1] Generate short read mapping results and raw poly(A) site data

[2] Format of raw poly(A) site data

[1] Output of preprocessing (a-d)

[2] Output of APA dynamics detection (e)

[3] Output of APA visualization (f)