1. What is scDAPA?
scDAPA is a tool capable of identifying and visualizing dynamic alternative polyadenylation (APA) from scRNA-seq data. If you have any question or comment, please contact with Dr. Congting Ye(yec@xmu.edu.cn). You can also report a bug as a Ticket request, or start a topic session in the Discussion webpage of this website.
2. Installation of scDAPA?
Install softwares: GNU Awk, SAMtools, bedtools, and R.
Install R packages: tools, stringr, rtracklayer, ggplot2, ggbio. If R >=3.5.0. ,
> if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
> if (!require("rtracklayer")) BiocManager::install("rtracklayer")
> if (!require("ggplot2")) install.packages("ggplot2")
> if (!require("ggbio")) BiocManager::install("ggbio")
> if (!require("stringr")) install.packages("stringr")
> if (!require("tools")) install.packages("tools")
If R < 3.5.0. ,
> source("https://bioconductor.org/biocLite.R")
> if (!require("rtracklayer")) BiocInstaller::biocLite("rtracklayer")
> if (!require("ggplot2")) install.packages("ggplot2")
> if (!require("ggbio")) BiocInstaller::biocLite("ggbio")
> if (!require("stringr")) install.packages("stringr")
> if (!require("tools")) install.packages("tools")
After you download and unzip our scDAPA package, you will see a folder named scDAPA, where 3 Shell scripts and and 1 R package ('scDAPAminer_1.1.tar.gz') are located.
$ chmod +x extractReads.sh extractGenes.sh annotate3Ends.sh
You shoule make sure the R environment is installed. After opening the R, change the R's Working Path to the path where 'scDAPAminer_1.1.tar.gz' is located (e.g., './scDAPA'). Type the following command in the Command Window of R:
> install.packages("scDAPAminer_1.1.tar.gz",repos = NULL, type = "source")
3. How to run scDAPA?
There are 3 major steps of running scDAPA: (1) extraction and annotation of 3′ ends from scRNA-seq data; (2) detection of dynamic APA ; and (3) visualization of dynamic APA. The step (1) can only run on Linux, the step (2) and (3) can run on Linux, Windows or Mac OS.
Extract valid mapping records from a bam/sam file.
extractReads.sh -h
-r|-read a file of short reads mapping result in bam/sam format.
-c|-cluster a csv format file of cell clustering result, the first column is cell barcode, and the second column is cluster label.
-o|-output a output directory, if not set, files of extracted reads will be stored in current working path.
$ ./extractReads.sh -r pbmc_10k_v3_possorted_genome_bam.bam -c ./analysis/clustering/kmeans_10_clusters/clusters.csv -o ./result
Extract gene annotation from a gff/gtf file.
extractGenes.sh -h
-i|-input a genome annotation file in gff/gtf format.
-o|-output a file to store the gene annotations.
$ ./extractGenes.sh -i ./Homo_sapiens.GRCh38.86.gtf -o hg38.gene.gff
Annotate valid mapping reads from a/multiple sam file(s).
annotate3Ends.sh -h
-f|-file a single sam file generated by extractReads.sh.
-d|-dir a directory containing the sam files generated by extractReads.sh.
-g|-gene a file of gene annotation extracted by extractGenes.sh.
1) Annotate a single file,
$./annotate3Ends.sh -f celltype_a.sam -g hg38.gene.gff
2) Annotate multiple files in a directory,
$./annotate3Ends.sh -d ./result -g hg38.gene.gff
Identify genes with dynamic APA usage using scRNA-seq data.
1) type='f2f', compare two cell groups stored in two different files,
scDAPAdetect(file1='cell_A.anno',file2='cell_B.anno',type='f2f',output_dir='./',bin_size=100,count_cutoff=20)
2) type='d2d', compare two same cell groups stored in two different directories,
scDAPAdetect(dir1='./control',dir2='./treatment',type='d2d',output_dir='./stat',bin_size=100,count_cutoff=20)
3) type='d', compare every two cell groups stored in one directory,
scDAPAdetect(dir='./anno_result',type='d',output_dir='./stat',bin_size=100,count_cutoff=20)
file1 - string, an input file generated by 'annotate3Ends.sh', set when type='f2f'.
file2 - string, an input file generated by 'annotate3Ends.sh', set when type='f2f'.
dir1 - string, a directory of files generated by 'annotate3Ends.sh', set when type='d2d'.
dir2 - string, a directory of files generated by 'annotate3Ends.sh', set when type='d2d'.
dir - string, a directory of files generated by 'annotate3Ends.sh', set when type='d'.
type - string, indicating which type of input(s) is/are used.
output_dir - string, a directory to store the output(s), default is the current working path.
bin_size - number, size of bin/window used to quantify the APA usage, default is 100.
count_cutoff - number, minimum number of 3' ends required for each gene per cell type, default is 20.
View 3' ends distributions of gene from scRNA-seq data.
scDAPAview(files=c('MG0.anno','sMG3.anno'),alt_names=c('MG0','sMG3'),gtf=import('Mus_musculus.GRCm38.84.gtf'),gene_id='ENSMUSG00000073490')
files - string vector, a string vector of names of files generated by 'annotate3Ends.sh'.
alt_names - string vector, a string vector of alternative names to be shown, instead of corresponding file names.
dir - string, a directory of files generated by 'annotate3Ends.sh'.
gtf - string, a Granges object of gene model info. It could be generated using the function 'import' of R package 'rtracklayer', e.g. gtf = import('Mus_musculus.GRCm38.84.gtf').
gene_id - string, indicates which gene to be visualized. E.g. gene_id = 'ENSMUSG00000073490'.
adjust - A multiplicate bandwidth adjustment of the 3' ends density plot. This makes it possible to adjust the bandwidth while still using the a bandwidth estimator. For example, adjust = 1/2 means use half of the default bandwidth. Default, adjust = 1/5.
heights - A numeric vector of length=2 to indicate the ratio of each track (isoform track and 3' ends track). Default, heights = c(0.5,0.5).
legend.position - The position of legends ("none", "left", "right", "bottom", "top", or two-element numeric vector). Default, legend.position = c(0.8,0.8).
coord.lim - Two numeric values, specifying the left and right limit of X-axis, e.g. coord.lim = c(1000,2000).
4. Demo of running scDAPA
Steps of using scDAPA to process test data from 10x Genomics:
1) download public dataset Genome-aligned BAM and Clustering analysis from 10x Genomics.
2) download corresponding genome annotation file Homo_sapiens.GRCh38.86.gtf.gz (ftp.ensembl.org) from Ensembl.
3) move the downloaded files 'pbmc_10k_v3_possorted_genome_bam.bam', 'pbmc_10k_v3_analysis.tar.gz' and 'Homo_sapiens.GRCh38.86.gtf.gz' to the folder 'scDAPA' and unzip 'pbmc_10k_v3_analysis.tar.gz' and 'Homo_sapiens.GRCh38.86.gtf.gz'.
1) extract valid mapping records (~2 hrs),
$ ./extractReads.sh -r pbmc_10k_v3_possorted_genome_bam.bam -c ./analysis/clustering/kmeans_10_clusters/clusters.csv -o ./result
2) extract gene annotation,
$ ./extractGenes.sh -i ./Homo_sapiens.GRCh38.86.gtf -o hg38.gene.gff
3) annotate 3′ ends (~30 mins),
$ ./annotate3Ends.sh -d ./result -g hg38.gene.gff
> library(scDAPAminer)
> # creat a folder named 'stat'
> # 1. only compare two specific cell groups
> scDAPAdetect(file1='./result/1.anno',file2='./result/2.anno',type='f2f',output_dir='./stat')
>
> # 2. compare every two cell groups stored in the ./result directory
> scDAPAdetect(dir='./result',type='d',output_dir='./stat',bin_size=100,count_cutoff=20)
> gtf = import('./Homo_sapiens.GRCh38.86.gtf')
> dp = scDAPAview(files=c('./result/1.anno','./result/2.anno'),alt_names=c('cell_A','cell_B'),gtf=gtf,gene_id='ENSG00000160062',legend.position = c(0.2,0.8))
>
> # customize colour theme
> library(ggsci)
> dp + scale_colour_aaas()
>
> # customize legend title
> dp + labs(colour = "Cell type")
>
> # customize legend position
> dp + theme(legend.position = c(0.6, 0.9))
>
> # customize simultaneuouly
> dp + scale_colour_aaas() + labs(colour = "Cell type") + theme(legend.position = c(0.6, 0.9))

5. Inputs of scDAPA
The main inputs of scDAPA including "short reads mapping result" in BAM/SAM format and "cell classification result" in 2 columns comma-separated values (CSV) file.
Use the commonly used pipelines and tools, e.g. Cell Ranger, Seurat, STAR, and SC3 etc., to align reads, perform clustering.
A comma-separated values (CSV) file, the first column is cell barcodes, and the second column is cluster labels.
| Column Name | Explanation |
|---|---|
| Barcode | cell barcodes, e.g., AAACCCAAGCGCCCAT-1 |
| Cluster | cell type labels, e.g., 1 or MG0 |
6. Outputs of scDAPA
For "+" strand gene, 3′ ends is stored in the column 'end of read'; For "-" strand gene, 3′ ends is stored in the column 'start of read'.
| Column Name | Explanation |
|---|---|
| seqname | The name of the sequence |
| source | The program that generated this feature |
| feature | The name of this type of feature |
| start | The starting position of the feature in the sequence |
| end | The ending position of the feature |
| score | A score between 0 and 1000 |
| strand | Valid entries include "+", "-", or "." |
| frame | If the feature is not a coding exon, the value should be "." |
| gene | Gene ID and name |
| start of read | The starting positions of reads annoted to this gene, separated by comma |
| end of read | The ending positions of reads annoted to this gene, separated by comma |
Users can use the index SDD (e.g., >=0.2)and p.adjust (e.g., <0.05) to select out candidate genes with APA dynamics.
| Column Name | Explanation |
|---|---|
| chr | Name of the chromosome/scaffold |
| gene | Gene ID and name |
| meanlen1 | Mean length of 3′ ends to gene's start site in cell group 1 |
| meanlen2 | Mean length of 3′ ends to gene's start site in cell group 2 |
| SDD | Site distribution difference SDD∈[0,1] |
| p.value | Statistical test p values |
| p.adjust | Adjusted p values |
Users can use relevant functions of ggplot2 package to customize the output.
Examples of how to customize the plot could be found at section 4. Demo of running scDAPA.