scDAPA Wiki

Detection and visualization of dynamic alternative polyadenylation

Brought to you by: yec

User Manual

Labels: Example (1)

Authors:

Attachments

1.png (50754 bytes)

scDAPA User Manual

What is scDAPA?
Installation of scDAPA
How to run scDAPA?
3.1 Extraction and annotation of 3′ ends
3.2 Detection of dynamic APA
3.3 Visualization of dynamic APA
Demo of running scDAPA
Inputs of scDAPA
Outputs of scDAPA

1. What is scDAPA?

scDAPA is a tool capable of identifying and visualizing dynamic alternative polyadenylation (APA) from scRNA-seq data. If you have any question or comment, please contact with Dr. Congting Ye(yec@xmu.edu.cn). You can also report a bug as a Ticket request, or start a topic session in the Discussion webpage of this website.

Ye C, Zhou Q, Wu X, Yu C, Ji G, Saban D. R., Li Q. Q. (2020) scDAPA: detection and visualization of dynamic alternative polyadenylation from single cell RNA-seq data. Bioinformatics 36(4): 1262–1264.

↑Back To Top

2. Installation of scDAPA?

[1]. Install dependencies of scDAPA

Install softwares: GNU Awk, SAMtools, bedtools, and R.
Install R packages: tools, stringr, rtracklayer, ggplot2, ggbio. If R >=3.5.0. ,

> if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
> if (!require("rtracklayer")) BiocManager::install("rtracklayer")
> if (!require("ggplot2")) install.packages("ggplot2")
> if (!require("ggbio")) BiocManager::install("ggbio")
> if (!require("stringr")) install.packages("stringr")
> if (!require("tools")) install.packages("tools")

If R < 3.5.0. ,

> source("https://bioconductor.org/biocLite.R")
> if (!require("rtracklayer")) BiocInstaller::biocLite("rtracklayer")
> if (!require("ggplot2")) install.packages("ggplot2")
> if (!require("ggbio"))  BiocInstaller::biocLite("ggbio")
> if (!require("stringr")) install.packages("stringr")
> if (!require("tools")) install.packages("tools")

[2]. Download and unzip our package - 'scDAPA.20210901.zip'

After you download and unzip our scDAPA package, you will see a folder named scDAPA, where 3 Shell scripts and and 1 R package ('scDAPAminer_1.1.tar.gz') are located.

[3]. Add executable permissions

$ chmod +x extractReads.sh extractGenes.sh annotate3Ends.sh

[4]. Install R package 'scDAPAminer_1.1.tar.gz'

You shoule make sure the R environment is installed. After opening the R, change the R's Working Path to the path where 'scDAPAminer_1.1.tar.gz' is located (e.g., './scDAPA'). Type the following command in the Command Window of R:

> install.packages("scDAPAminer_1.1.tar.gz",repos = NULL, type = "source")

↑Back To Top

3. How to run scDAPA?

There are 3 major steps of running scDAPA: (1) extraction and annotation of 3′ ends from scRNA-seq data; (2) detection of dynamic APA ; and (3) visualization of dynamic APA. The step (1) can only run on Linux, the step (2) and (3) can run on Linux, Windows or Mac OS.

3.1 Extraction and annotation of 3′ ends

3.1.1 extractReads.sh

Extract valid mapping records from a bam/sam file.

Print usage

extractReads.sh -h

Parameters:

-r|-read    a file of short reads mapping result in bam/sam format.
-c|-cluster a csv format file of cell clustering result, the first column is cell barcode, and the second column is cluster label.
-o|-output  a output directory, if not set, files of extracted reads will be stored in current working path.

Example:

$ ./extractReads.sh -r pbmc_10k_v3_possorted_genome_bam.bam -c ./analysis/clustering/kmeans_10_clusters/clusters.csv -o ./result

3.1.2 extractGenes.sh

Extract gene annotation from a gff/gtf file.

Print usage

extractGenes.sh -h

Parameters:

-i|-input   a genome annotation file in gff/gtf format.
-o|-output  a file to store the gene annotations.

Example:

$ ./extractGenes.sh -i ./Homo_sapiens.GRCh38.86.gtf -o hg38.gene.gff

3.1.3 annotate3Ends.sh

Annotate valid mapping reads from a/multiple sam file(s).

Print usage

annotate3Ends.sh -h

Parameters:

-f|-file    a single sam file generated by extractReads.sh.
-d|-dir     a directory containing the sam files generated by extractReads.sh.
-g|-gene    a file of gene annotation extracted by extractGenes.sh.

Example:

1) Annotate a single file,

$./annotate3Ends.sh -f celltype_a.sam -g hg38.gene.gff

2) Annotate multiple files in a directory,

$./annotate3Ends.sh -d ./result -g hg38.gene.gff

↑Back To Top

3.2 scDAPAdetect

Identify genes with dynamic APA usage using scRNA-seq data.

Usage

1) type='f2f', compare two cell groups stored in two different files,

scDAPAdetect(file1='cell_A.anno',file2='cell_B.anno',type='f2f',output_dir='./',bin_size=100,count_cutoff=20)

2) type='d2d', compare two same cell groups stored in two different directories,

scDAPAdetect(dir1='./control',dir2='./treatment',type='d2d',output_dir='./stat',bin_size=100,count_cutoff=20)

3) type='d', compare every two cell groups stored in one directory,

scDAPAdetect(dir='./anno_result',type='d',output_dir='./stat',bin_size=100,count_cutoff=20)

Arguments

file1         -  string, an input file generated by 'annotate3Ends.sh', set when type='f2f'. 
file2         -  string, an input file generated by 'annotate3Ends.sh', set when type='f2f'.
dir1          -  string, a directory of files generated by 'annotate3Ends.sh', set when type='d2d'. 
dir2          -  string, a directory of files generated by 'annotate3Ends.sh', set when type='d2d'.
dir           -  string, a directory of files generated by 'annotate3Ends.sh', set when type='d'.
type          -  string, indicating which type of input(s) is/are used. 
output_dir    -  string, a directory to store the output(s), default is the current working path.
bin_size      -  number, size of bin/window used to quantify the APA usage, default is 100.
count_cutoff  -  number, minimum number of 3' ends required for each gene per cell type, default is 20.

↑Back To Top

3.3 scDAPAview

View 3' ends distributions of gene from scRNA-seq data.

Usage

scDAPAview(files=c('MG0.anno','sMG3.anno'),alt_names=c('MG0','sMG3'),gtf=import('Mus_musculus.GRCm38.84.gtf'),gene_id='ENSMUSG00000073490')

Arguments

files        -  string vector, a string vector of names of files generated by 'annotate3Ends.sh'. 
alt_names    -  string vector, a string vector of alternative names to be shown, instead of corresponding file names.
dir          -  string, a directory of files generated by 'annotate3Ends.sh'.
gtf          -  string, a Granges object of gene model info. It could be generated using the function 'import' of R package 'rtracklayer', e.g. gtf = import('Mus_musculus.GRCm38.84.gtf').
gene_id      -  string, indicates which gene to be visualized. E.g. gene_id = 'ENSMUSG00000073490'.
adjust       -  A multiplicate bandwidth adjustment of the 3' ends density plot. This makes it possible to adjust the bandwidth while still using the a bandwidth estimator. For example, adjust = 1/2 means use half of the default bandwidth. Default, adjust = 1/5.
heights      -  A numeric vector of length=2 to indicate the ratio of each track (isoform track and 3' ends track). Default, heights = c(0.5,0.5).
legend.position - The position of legends ("none", "left", "right", "bottom", "top", or two-element numeric vector). Default, legend.position = c(0.8,0.8).
coord.lim    -  Two numeric values, specifying the left and right limit of X-axis, e.g. coord.lim = c(1000,2000).

↑Back To Top

4. Demo of running scDAPA

Steps of using scDAPA to process test data from 10x Genomics:

Preparation of necessary data

1) download public dataset Genome-aligned BAM and Clustering analysis from 10x Genomics.
2) download corresponding genome annotation file Homo_sapiens.GRCh38.86.gtf.gz from Ensembl.
3) move the downloaded files 'pbmc_10k_v3_possorted_genome_bam.bam', 'pbmc_10k_v3_analysis.tar.gz' and 'Homo_sapiens.GRCh38.86.gtf.gz' to the folder 'scDAPA' and unzip 'pbmc_10k_v3_analysis.tar.gz' and 'Homo_sapiens.GRCh38.86.gtf.gz'.

**1st step: Extraction and annotation of 3′ ends from scRNA-seq data **

1) extract valid mapping records (~2 hrs),

$ ./extractReads.sh -r pbmc_10k_v3_possorted_genome_bam.bam -c ./analysis/clustering/kmeans_10_clusters/clusters.csv -o ./result

2) extract gene annotation,

$ ./extractGenes.sh -i ./Homo_sapiens.GRCh38.86.gtf -o hg38.gene.gff

3) annotate 3′ ends (~30 mins),

$ ./annotate3Ends.sh -d ./result -g hg38.gene.gff

**2nd step: Detection of dynamic APA **

> library(scDAPAminer)
> # creat a folder named 'stat'
> # 1. only compare two specific cell groups
> scDAPAdetect(file1='./result/1.anno',file2='./result/2.anno',type='f2f',output_dir='./stat')
> 
> # 2. compare every two cell groups stored in the ./result directory
> scDAPAdetect(dir='./result',type='d',output_dir='./stat',bin_size=100,count_cutoff=20)

**3rd step: Visualization of dynamic APA **

> gtf = import('./Homo_sapiens.GRCh38.86.gtf')
> dp = scDAPAview(files=c('./result/1.anno','./result/2.anno'),alt_names=c('cell_A','cell_B'),gtf=gtf,gene_id='ENSG00000160062',legend.position = c(0.2,0.8))
> 
> # customize colour theme
> library(ggsci)
> dp + scale_colour_aaas()
> 
> # customize legend title
> dp + labs(colour = "Cell type")
> 
> # customize legend position
> dp + theme(legend.position = c(0.6, 0.9))
> 
> # customize simultaneuouly
> dp + scale_colour_aaas() + labs(colour = "Cell type") + theme(legend.position = c(0.6, 0.9))

↑Back To Top

5. Inputs of scDAPA

The main inputs of scDAPA including "short reads mapping result" in BAM/SAM format and "cell classification result" in 2 columns comma-separated values (CSV) file.

[1] Generate short reads mapping result and cell classification result

Use the commonly used pipelines and tools, e.g. Cell Ranger, Seurat, STAR, and SC3 etc., to align reads, perform clustering.

[2] Format of cell classification result

A comma-separated values (CSV) file, the first column is cell barcodes, and the second column is cluster labels.

Column Name	Explanation
Barcode	cell barcodes, e.g., AAACCCAAGCGCCCAT-1
Cluster	cell type labels, e.g., 1 or MG0

↑Back To Top

6. Outputs of scDAPA

[1] Output of annotate3Ends.sh is a 11 columns tab-separated values file (.anno):

For "+" strand gene, 3′ ends is stored in the column 'end of read'; For "-" strand gene, 3′ ends is stored in the column 'start of read'.

Column Name	Explanation
seqname	The name of the sequence
source	The program that generated this feature
feature	The name of this type of feature
start	The starting position of the feature in the sequence
end	The ending position of the feature
score	A score between 0 and 1000
strand	Valid entries include "+", "-", or "."
frame	If the feature is not a coding exon, the value should be "."
gene	Gene ID and name
start of read	The starting positions of reads annoted to this gene, separated by comma
end of read	The ending positions of reads annoted to this gene, separated by comma

[2] Output of scDAPAdetect is a 7 columns tab-separated values file (.stat):

Users can use the index SDD (e.g., >=0.2)and p.adjust (e.g., <0.05) to select out candidate genes with APA dynamics.

Column Name	Explanation
chr	Name of the chromosome/scaffold
gene	Gene ID and name
meanlen1	Mean length of 3′ ends to gene's start site in cell group 1
meanlen2	Mean length of 3′ ends to gene's start site in cell group 2
SDD	Site distribution difference SDD∈[0,1]
p.value	Statistical test p values
p.adjust	Adjusted p values

[3] Output of scDAPAview is a ggplot2 object

Users can use relevant functions of ggplot2 package to customize the output.
Examples of how to customize the plot could be found at section 4. Demo of running scDAPA.

↑Back To Top

Project Members:

Congting Ye (admin)

scDAPA Wiki

Detection and visualization of dynamic alternative polyadenylation

User Manual

scDAPA User Manual

Table of Contents

[1]. Install dependencies of scDAPA

[2]. Download and unzip our package - 'scDAPA.20210901.zip'

[3]. Add executable permissions

[4]. Install R package 'scDAPAminer_1.1.tar.gz'

3.1 Extraction and annotation of 3′ ends

3.1.1 extractReads.sh

Print usage

Parameters:

Example:

3.1.2 extractGenes.sh

Print usage

Parameters:

Example:

3.1.3 annotate3Ends.sh

Print usage

Parameters:

Example:

3.2 scDAPAdetect

Usage

Arguments

3.3 scDAPAview

Usage

Arguments

[1] Generate short reads mapping result and cell classification result

[2] Format of cell classification result

[1] Output of annotate3Ends.sh is a 11 columns tab-separated values file (.anno):

[2] Output of scDAPAdetect is a 7 columns tab-separated values file (.stat):

[3] Output of scDAPAview is a ggplot2 object

Project Members: