Menu

User Manual

Example (1)
Congting Ye
Attachments
1.png (50754 bytes)

scDAPA User Manual

Table of Contents

  1. What is scDAPA?
  2. Installation of scDAPA
  3. How to run scDAPA?
    3.1 Extraction and annotation of 3′ ends
    3.2 Detection of dynamic APA
    3.3 Visualization of dynamic APA
  4. Demo of running scDAPA
  5. Inputs of scDAPA
  6. Outputs of scDAPA


1. What is scDAPA?


scDAPA is a tool capable of identifying and visualizing dynamic alternative polyadenylation (APA) from scRNA-seq data. If you have any question or comment, please contact with Dr. Congting Ye(yec@xmu.edu.cn). You can also report a bug as a Ticket request, or start a topic session in the Discussion webpage of this website.

Ye C, Zhou Q, Wu X, Yu C, Ji G, Saban D. R., Li Q. Q. (2020) scDAPA: detection and visualization of dynamic alternative polyadenylation from single cell RNA-seq data. Bioinformatics 36(4): 1262–1264.

↑Back To Top


2. Installation of scDAPA?


[1]. Install dependencies of scDAPA

Install softwares: GNU Awk, SAMtools, bedtools, and R.
Install R packages: tools, stringr, rtracklayer, ggplot2, ggbio. If R >=3.5.0. ,

> if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
> if (!require("rtracklayer")) BiocManager::install("rtracklayer")
> if (!require("ggplot2")) install.packages("ggplot2")
> if (!require("ggbio")) BiocManager::install("ggbio")
> if (!require("stringr")) install.packages("stringr")
> if (!require("tools")) install.packages("tools")

If R < 3.5.0. ,

> source("https://bioconductor.org/biocLite.R")
> if (!require("rtracklayer")) BiocInstaller::biocLite("rtracklayer")
> if (!require("ggplot2")) install.packages("ggplot2")
> if (!require("ggbio"))  BiocInstaller::biocLite("ggbio")
> if (!require("stringr")) install.packages("stringr")
> if (!require("tools")) install.packages("tools")
[2]. Download and unzip our package - 'scDAPA.20210901.zip'

After you download and unzip our scDAPA package, you will see a folder named scDAPA, where 3 Shell scripts and and 1 R package ('scDAPAminer_1.1.tar.gz') are located.

[3]. Add executable permissions
$ chmod +x extractReads.sh extractGenes.sh annotate3Ends.sh
[4]. Install R package 'scDAPAminer_1.1.tar.gz'

You shoule make sure the R environment is installed. After opening the R, change the R's Working Path to the path where 'scDAPAminer_1.1.tar.gz' is located (e.g., './scDAPA'). Type the following command in the Command Window of R:

> install.packages("scDAPAminer_1.1.tar.gz",repos = NULL, type = "source")

↑Back To Top


3. How to run scDAPA?


There are 3 major steps of running scDAPA: (1) extraction and annotation of 3′ ends from scRNA-seq data; (2) detection of dynamic APA ; and (3) visualization of dynamic APA. The step (1) can only run on Linux, the step (2) and (3) can run on Linux, Windows or Mac OS.

3.1 Extraction and annotation of 3′ ends

3.1.1 extractReads.sh

Extract valid mapping records from a bam/sam file.

extractReads.sh -h
Parameters:
-r|-read    a file of short reads mapping result in bam/sam format.
-c|-cluster a csv format file of cell clustering result, the first column is cell barcode, and the second column is cluster label.
-o|-output  a output directory, if not set, files of extracted reads will be stored in current working path.
Example:
$ ./extractReads.sh -r pbmc_10k_v3_possorted_genome_bam.bam -c ./analysis/clustering/kmeans_10_clusters/clusters.csv -o ./result
3.1.2 extractGenes.sh

Extract gene annotation from a gff/gtf file.

extractGenes.sh -h
Parameters:
-i|-input   a genome annotation file in gff/gtf format.
-o|-output  a file to store the gene annotations.
Example:
$ ./extractGenes.sh -i ./Homo_sapiens.GRCh38.86.gtf -o hg38.gene.gff
3.1.3 annotate3Ends.sh

Annotate valid mapping reads from a/multiple sam file(s).

annotate3Ends.sh -h
Parameters:
-f|-file    a single sam file generated by extractReads.sh.
-d|-dir     a directory containing the sam files generated by extractReads.sh.
-g|-gene    a file of gene annotation extracted by extractGenes.sh.
Example:

1) Annotate a single file,

$./annotate3Ends.sh -f celltype_a.sam -g hg38.gene.gff

2) Annotate multiple files in a directory,

$./annotate3Ends.sh -d ./result -g hg38.gene.gff

↑Back To Top

3.2 scDAPAdetect

Identify genes with dynamic APA usage using scRNA-seq data.

Usage

1) type='f2f', compare two cell groups stored in two different files,

scDAPAdetect(file1='cell_A.anno',file2='cell_B.anno',type='f2f',output_dir='./',bin_size=100,count_cutoff=20)

2) type='d2d', compare two same cell groups stored in two different directories,

scDAPAdetect(dir1='./control',dir2='./treatment',type='d2d',output_dir='./stat',bin_size=100,count_cutoff=20)

3) type='d', compare every two cell groups stored in one directory,

scDAPAdetect(dir='./anno_result',type='d',output_dir='./stat',bin_size=100,count_cutoff=20)
Arguments
file1         -  string, an input file generated by 'annotate3Ends.sh', set when type='f2f'. 
file2         -  string, an input file generated by 'annotate3Ends.sh', set when type='f2f'.
dir1          -  string, a directory of files generated by 'annotate3Ends.sh', set when type='d2d'. 
dir2          -  string, a directory of files generated by 'annotate3Ends.sh', set when type='d2d'.
dir           -  string, a directory of files generated by 'annotate3Ends.sh', set when type='d'.
type          -  string, indicating which type of input(s) is/are used. 
output_dir    -  string, a directory to store the output(s), default is the current working path.
bin_size      -  number, size of bin/window used to quantify the APA usage, default is 100.
count_cutoff  -  number, minimum number of 3' ends required for each gene per cell type, default is 20.

↑Back To Top

3.3 scDAPAview

View 3' ends distributions of gene from scRNA-seq data.

Usage
scDAPAview(files=c('MG0.anno','sMG3.anno'),alt_names=c('MG0','sMG3'),gtf=import('Mus_musculus.GRCm38.84.gtf'),gene_id='ENSMUSG00000073490')
Arguments
files        -  string vector, a string vector of names of files generated by 'annotate3Ends.sh'. 
alt_names    -  string vector, a string vector of alternative names to be shown, instead of corresponding file names.
dir          -  string, a directory of files generated by 'annotate3Ends.sh'.
gtf          -  string, a Granges object of gene model info. It could be generated using the function 'import' of R package 'rtracklayer', e.g. gtf = import('Mus_musculus.GRCm38.84.gtf').
gene_id      -  string, indicates which gene to be visualized. E.g. gene_id = 'ENSMUSG00000073490'.
adjust       -  A multiplicate bandwidth adjustment of the 3' ends density plot. This makes it possible to adjust the bandwidth while still using the a bandwidth estimator. For example, adjust = 1/2 means use half of the default bandwidth. Default, adjust = 1/5.
heights      -  A numeric vector of length=2 to indicate the ratio of each track (isoform track and 3' ends track). Default, heights = c(0.5,0.5).
legend.position - The position of legends ("none", "left", "right", "bottom", "top", or two-element numeric vector). Default, legend.position = c(0.8,0.8).
coord.lim    -  Two numeric values, specifying the left and right limit of X-axis, e.g. coord.lim = c(1000,2000).

↑Back To Top


4. Demo of running scDAPA


Steps of using scDAPA to process test data from 10x Genomics:

  • Preparation of necessary data

 1) download public dataset Genome-aligned BAM and Clustering analysis from 10x Genomics.
 2) download corresponding genome annotation file Homo_sapiens.GRCh38.86.gtf.gz from Ensembl.
 3) move the downloaded files 'pbmc_10k_v3_possorted_genome_bam.bam', 'pbmc_10k_v3_analysis.tar.gz' and 'Homo_sapiens.GRCh38.86.gtf.gz' to the folder 'scDAPA' and unzip 'pbmc_10k_v3_analysis.tar.gz' and 'Homo_sapiens.GRCh38.86.gtf.gz'.

  • **1st step: Extraction and annotation of 3′ ends from scRNA-seq data **

 1) extract valid mapping records (~2 hrs),

$ ./extractReads.sh -r pbmc_10k_v3_possorted_genome_bam.bam -c ./analysis/clustering/kmeans_10_clusters/clusters.csv -o ./result

 2) extract gene annotation,

$ ./extractGenes.sh -i ./Homo_sapiens.GRCh38.86.gtf -o hg38.gene.gff

 3) annotate 3′ ends (~30 mins),

$ ./annotate3Ends.sh -d ./result -g hg38.gene.gff
  • **2nd step: Detection of dynamic APA **
> library(scDAPAminer)
> # creat a folder named 'stat'
> # 1. only compare two specific cell groups
> scDAPAdetect(file1='./result/1.anno',file2='./result/2.anno',type='f2f',output_dir='./stat')
> 
> # 2. compare every two cell groups stored in the ./result directory
> scDAPAdetect(dir='./result',type='d',output_dir='./stat',bin_size=100,count_cutoff=20)
  • **3rd step: Visualization of dynamic APA **
> gtf = import('./Homo_sapiens.GRCh38.86.gtf')
> dp = scDAPAview(files=c('./result/1.anno','./result/2.anno'),alt_names=c('cell_A','cell_B'),gtf=gtf,gene_id='ENSG00000160062',legend.position = c(0.2,0.8))
> 
> # customize colour theme
> library(ggsci)
> dp + scale_colour_aaas()
> 
> # customize legend title
> dp + labs(colour = "Cell type")
> 
> # customize legend position
> dp + theme(legend.position = c(0.6, 0.9))
> 
> # customize simultaneuouly
> dp + scale_colour_aaas() + labs(colour = "Cell type") + theme(legend.position = c(0.6, 0.9))

Figure 1

↑Back To Top


5. Inputs of scDAPA


The main inputs of scDAPA including "short reads mapping result" in BAM/SAM format and "cell classification result" in 2 columns comma-separated values (CSV) file.

[1] Generate short reads mapping result and cell classification result

Use the commonly used pipelines and tools, e.g. Cell Ranger, Seurat, STAR, and SC3 etc., to align reads, perform clustering.


[2] Format of cell classification result

A comma-separated values (CSV) file, the first column is cell barcodes, and the second column is cluster labels.

Column Name Explanation
Barcode cell barcodes, e.g., AAACCCAAGCGCCCAT-1
Cluster cell type labels, e.g., 1 or MG0

↑Back To Top


6. Outputs of scDAPA


[1] Output of annotate3Ends.sh is a 11 columns tab-separated values file (.anno):

For "+" strand gene, 3′ ends is stored in the column 'end of read'; For "-" strand gene, 3′ ends is stored in the column 'start of read'.

Column Name Explanation
seqname The name of the sequence
source The program that generated this feature
feature The name of this type of feature
start The starting position of the feature in the sequence
end The ending position of the feature
score A score between 0 and 1000
strand Valid entries include "+", "-", or "."
frame If the feature is not a coding exon, the value should be "."
gene Gene ID and name
start of read The starting positions of reads annoted to this gene, separated by comma
end of read The ending positions of reads annoted to this gene, separated by comma


[2] Output of scDAPAdetect is a 7 columns tab-separated values file (.stat):

Users can use the index SDD (e.g., >=0.2)and p.adjust (e.g., <0.05) to select out candidate genes with APA dynamics.

Column Name Explanation
chr Name of the chromosome/scaffold
gene Gene ID and name
meanlen1 Mean length of 3′ ends to gene's start site in cell group 1
meanlen2 Mean length of 3′ ends to gene's start site in cell group 2
SDD Site distribution difference SDD∈[0,1]
p.value Statistical test p values
p.adjust Adjusted p values


[3] Output of scDAPAview is a ggplot2 object

Users can use relevant functions of ggplot2 package to customize the output.
Examples of how to customize the plot could be found at section 4. Demo of running scDAPA.

↑Back To Top

Project Members: