Name | Modified | Size | Downloads / Week |
---|---|---|---|
PhaseDel-v1.0.0.tar.gz | 2022-02-22 | 107.4 MB | |
README.md | 2022-02-22 | 23.7 kB | |
Totals: 2 Items | 107.4 MB | 0 |
PhaseDel
PhaseDel is a Java-based variant caller designed for detecting somatic deletions from high-coverage (~30x) single-cell whole-genome sequencing (scWGS) data. It is highly specialized for discriminating genuine somatic focal deletions (several bases to kilo-bases in length) from excessive SV-like artifacts, which are inevitably occurring during single-cell whole-genome amplification. PhaseDel accurately detects such true deletions based on the linkage information between deletion breakpoints and nearby germline heterozygous SNP sites. The important features of PhaseDel are:
- Accurate identification of somatic focal deletions at single-base-pair resolution in scWGS using phasing information
- Estimation of the genome-wide somatic deletion rate for a given cell with the controlled FDR level
- Characterization of underlying DSB repair mechanisms for identified deletion candidates
Get the most recent version of PhaseDel here: Download
Prerequisite softwares
The followings include tested versions in parenthesis when applicable; later versions are likely to still work. These instructions are designed to enable use of PhaseDel on human sequencing data aligned to GRCh37 (hg19).
**Java (version 1.8) **
- Java version 1.8 or higher
- HTSJDK library (included in the package)
**R packages (version 4.0.1) **
rstan
plyr
tidyr
dplyr
ggplot2
Installation of PhaseDel
PhaseDel was developed using JAVA JDK 8 64bit. To run PhaseDel, Java Runtime Environment (JRE) version 1.8.x or later is required. Download the most recent version of PhaseDel program in the code page and extract the gzipped archive.
tar -zxvf PhaseDel-v1.x.y.tar.gz
It contains the followings:
- PhaseDel.jar : Executable JAR file of PhaseDel
- README.md : Same as the wiki page
- lib/ : A folder containing other JAR libraries for PhaseDel
- data/ : A folder containing R codes and data files for phasing analysis and annotation
- demo/
: A folder containing scripts and data files for PhaseDel demo running
Running PhaseDel
To run PhaseDel, you can simply run the JAR executable file like below:
java -jar PhaseDel.jar
The -h or -? options will bring the following usage of PhaseDel. If you see this, you are ready to run PhaseDel.
PhaseDel: Phasing-based somatic deletion caller from single-cell WGS data
Version: 1.0.0
Usage: java -jar PhaseDel.jar -m <analysis_module>
Analysis modules:
---------------------------------------------------------------------------------------------
GenerateHetSNPAndIndelList Generate heterozygous SNP list and ins./del. sets from the gnotyped GATK GVCF file
MergeDel Merge deletion calls for initial candidate set (GATK and Delly calls)
LinkageAnalysis Linkage analysis to discriminate genuine deletion calls (both germline and somatic)
MakeDuplicateList (optional) Make a list of duplicated phased candidates between different individuals (possible germline/artifact list)
CallSomatic Call somatic deletion candidates from phased candidates and estimate genome-wide deletion rate
AnalyzeMechanism Analyze underlying mechanisms for selected somatic deletion candidates
Running the PhaseDel demo
The 'demo' folder includes input data files and running commands for five steps to achieve final output of annotated somatic deletion calls. Please download following BAMs and locate them under the demo BAM folder (demo/input/bam/) before running. These are ~8.4 Gb BAMs for chromosome 17 of the public MDA data of single fibroblasts (Dong et al., Nat Methods 2017). Hunamp_bulk.chr17.bam Hunamp_bulk.chr17.bam.bai IL-11.chr17.bam IL-11.chr17.bam.bai
Execute command.sh in each subdirectory under the demo folder to run all steps of PhaseDel sequentially. Followings are the actual commands written in command.sh to run each step of PhaseDel. Replace /path/to/... with the correct paths for the demo data before running. Each step runs in less than a minute with a single core except step 3 and 4, which takes about an hour and 10 minutes, respectively.
step1_GenerateHetSNPAndIndelList (<1 min)
java -jar /path/to/PhaseDel.jar -m GenerateHetSNPAndIndelList \
-v ../input/gatk/dong_et_al.NatMethod.2017.HC_G.g.chr17.vcf.gz \
-b ../input/gatk/bulk.sampleID \
-o ./
step2_MergeDel (<1 min)
# bulk
java -jar /path/to/PhaseDel.jar -m MergeDel \
-v ../step1_GenerateHetSNPAndIndelList/hetSNP_bulk/Hunamp_bulk.bulk.het.snps.vcf \
-G ../step1_GenerateHetSNPAndIndelList/indel/Hunamp_bulk.del.call \
-D ../input/delly/Hunamp_bulk.delly_del.chr17.vcf.gz \
-d ./Hunamp_bulk.merged.del
# single cell
java -jar /path/to/PhaseDel.jar -m MergeDel \
-v ../step1_GenerateHetSNPAndIndelList/hetSNP_bulk/Hunamp_bulk.bulk.het.snps.vcf \
-G ../step1_GenerateHetSNPAndIndelList/indel/IL-11.del.call \
-D ../input/delly/IL-11.delly_del.chr17.vcf.gz \
-d ./IL-11.merged.del
step3_LinkageAnalysis (~1 hour)
REF=/path/to/hs37d5/genome.fa
# bulk
java -jar /path/to/PhaseDel.jar -m LinkageAnalysis \
-r $REF \
-b ../input/bam/Hunamp_bulk.chr17.bam \
-v ../step1_GenerateHetSNPAndIndelList/hetSNP_bulk/Hunamp_bulk.bulk.het.snps.vcf \
-d ../step2_MergeDel/Hunamp_bulk.merged.del \
-o ./
# single cell
java -jar /path/to/PhaseDel.jar -m LinkageAnalysis \
-r $REF \
-b ../input/bam/IL-11.chr17.bam \
-v ../step1_GenerateHetSNPAndIndelList/hetSNP_bulk/Hunamp_bulk.bulk.het.snps.vcf \
-d ../step2_MergeDel/IL-11.merged.del \
-o ./
step4_CallSomatic (~10 min)
REF=/path/to/hs37d5/genome.fa
RSCRIPT=/path/to/Rscript
java -jar /path/to/PhaseDel.jar -m CallSomatic \
-r $REF \
-v ../step1_GenerateHetSNPAndIndelList/hetSNP_bulk/Hunamp_bulk.bulk.het.snps.vcf \
-b ../input/bam/IL-11.chr17.bam \
-B ../input/bam/Hunamp_bulk.chr17.bam \
-p ../step3_LinkageAnalysis/IL-11.chr17.filtered.phased.del.list \
-P ../step3_LinkageAnalysis/Hunamp_bulk.chr17.filtered.phased.del.list \
-R $RSCRIPT \
-o ./
step5_AnalyzeMechanism (<1 min)
REF=/path/to/hs37d5/genome.fa
java -jar /path/to/PhaseDel.jar -m AnalyzeMechanism \
-r $REF \
-d ../step4_CallSomatic/IL-11.chr17.selected.somatic.del.candidates.cc.controlled.txt \
-i ../step1_GenerateHetSNPAndIndelList/indel/IL-11.ins.call \
-o ./
The final output files are IL-11.chr17.cc.estimation.FDR.added.txt (step4, estimated rates) and IL-11.chr17.selected.somatic.del.candidates.cc.controlled.mec.annotated.txt (step5, final deletion candidates). See next section for more details on configuration options and input parameters.
Analysis modules and arguments
Analysis modules
PhaseDel have six divided modules to complete the entire analysis. Use -m option to select the analysis module. You need to run all the modules in the provided order to perform a complete analysis. Two modules—GenerateHetSNPList and MakeDuplicateList—should be applied for a set of samples, and other modules—MergeDel, LinkageAnalysis, CallSomatic, and AnalyzeMechanism—should be applied per cell.
Modules | Description |
---|---|
GenerateHetSNPAndIndelList | Generate heterozygous SNP list and insertion/deletion call sets from genotyped GATK GVCF file. Required for other modules. |
MergeDel | Merge GATK and Delly deletion calls to make an initial deletion candidate set. Merged output is used for linkage analysis. |
LinkageAnalysis | Linkage analysis to discriminate genuine deletion calls (both germline and somatic) from amplification artifacts. |
MakeDuplicateList | Make a list of duplicated phased candidates between different individuals. Duplicates will be considered to be possible germline variants/systematic artifacts and filtered out from the somatic candidate list. |
CallSomatic | Discriminate somatic deletions from the phased candidates and estimate genome-wide somatic deletion rate for a given cell. |
AnalyzeMechanism | Analyze underlying mechanisms for selected somatic deletion candidates. |
GenerateHetSNPAndIndelList
GenerateHetSNPAndIndelList module takes a genotyped GVCF file that includes multiple samples (single cells and/or matched bulk). This module generates two types of output files: (1) germline heterozygous common SNP list and (2) separated insertion and deletion call sets for each sample. Since both somatic and germline mutations would be detected by GATK from single cell data, this module generates (1) only from matched bulk samples. To indicate which samples are from matched bulk, the module accepts a file that contains SM tags list of bulk samples. Note that the variants in the inuput GVCF should be annotated by dbSNP (indicated by ID column, e.g. rs75454623), to select common SNPs from the entire calls.
Mandatory arguments:
Input | Option | Description |
---|---|---|
Genotyped GATK GVCF | -v, --gatk_vcf | dbSNP-annotated genotyped GVCF file that includes genotyping results for multiple samples (single cells and/or matched bulk) |
Output directory | -o, --output_dir | Path for output directory. This module will generate two subdirectories (hetSNP_bulk and indel) for outputs (1) and (2), respectively. |
Optional arguments:
Option | Default Value | Description |
---|---|---|
-b, --bulk_samples | A file containing SM tag list for bulk samples (line-separated). SM tags should match to the sample ID in the GVCF file. |
Output files:
- hetSNP_bulk/ID.bulk.het.snps.vcf: Germline heterozygous common SNP list for a given bulk sample indicated by the provided SM tag (required for multiple follow-up modules)
- indel/ID.del.call: GATK deletion calls for each sample (required for MergeDel module)
- indel/ID.ins.call: GATK deletion calls for each sample (required for AnalyzeMechanism module)
MergeDel
MergeDel module takes two deletion call files from GATK and Delly from a given sample and merge them to make an integrated deletion list, an initial deletion call set for linkage analysis. For GATK, deletion calls generated by GenerateHetSNPAndIndelList module should be provided (indel/ID.del.call). For Delly, a VCF file generated by Delly deletion calling should be provided. The next step—LinkageAnalysis—will perform phasing analysis to discriminate genuine deletions from these merged calls.
Mandatory arguments:
Input | Option | Description |
---|---|---|
Germline het. SNP list | -v, --hetSNPList | Germline heterozygous SNP list for a given individual (generated from GenerateHetSNPAndIndelList module). |
GATK-derived deletion call | -G, --gatk_del | GATK-derived deletion calls for a given sample (generated by GenerateHetSNPAndIndelList module, indel/ID.del.call file) |
Delly deletion call | -D, --delly_del | Delly VCF file for deletion call for a given sample (.vcf) |
Output file for merged deletion candidates | -d, --mergedDel | Merged deletion candidate list (Path for output file) |
Output files:
- Merged deletion candidates (filename given by a user): Merged deletion candidate list (required for LinkageAnalysis module)
LinkageAnalysis
LinkageAnalysis module takes an initial deletion candidate list and a bulk het. SNP list for a given cell, and performs phasing analysis to discriminate genuine deletions from whole-genome amplification artifacts. The output is the filtered list of true deletion candidates including both somatic and germline deletions. The next step—CallSomatic—will select high-confidence somatic candidates from this output.
Mandatory arguments:
Input | Option | Description |
---|---|---|
Reference sequence | -r, --reference | FASTA formatted reference sequence file. The reference must be BWA indexed. |
BAM file | -b, --scBam | A single-cell or a bulk BAM file to be analyzed using phasing. The BAM file must be (coordinate) sorted and indexed. |
Germline het. SNP list | -v, --hetSNPList | Germline heterozygous SNP list for a given individual (generated from GenerateHetSNPAndIndelList module). |
Initial deletion candidates | -d, --mergedDel | Initial deletion candidate list for a given single-cell/bulk data (generated from MergeDel module). |
Output directory | -o, --output_dir | Path for output directory. |
Optional arguments:
Option | Default Value | Description |
---|---|---|
-s, --singleCellID | BAM_prefix | A single-cell/bulk ID for output filename (ID.output). The filename of the input BAM (filename.bam) will be used for this ID if not provided. |
-k, --keep_intermediate_files | FALSE | Keep intermediate files. |
-q, --lowBQThres | 15 | Base quality threshold at the Het.SNP site to be considered as a deletion-supporting read. |
-Q, --lowMQThres | 5 | Mapping quality threshold to be considered as a deletion-supporting read. |
--initialOffset | 5 | Initial offset to assign the same breakpoint between two breakpoints with the same direction. |
--clippedPos_offset | 3 | Offset to determine the clipped position of a given deletion breakpoint. |
--clippingSupCntThres | 2 | Minimum clipped read count to assign a deletion breakpoint. |
--minClippedLen | 5 | Minimum clipped length to be considered as a clipped read. |
--maxClippedLen | 50 | Maximum clipped length to construct a consensus clipped sequence. |
--delSupReadCntThres | 2 | Minimum supporting read count to analyze a given deletion candidate. |
--nmFracThres | 0.1 | Allowable mismatch fraction from a single read for deletion-supporting reads. |
--NFracThres | 0.3 | Allowable fraction of 'N'-base in the consensus clipped sequence. |
--homopolymerFrac | 0.7 | Threshold to be a homopolymeric region considering nearby reference sequences. |
--invertedFrac | 0.4 | Inverted read fraction between the deletion-supporting reads to be considered as an inverted event (likely to be chimeric artifact). |
Output files:
- ID.hetSNP.filtered.supplementaryRemoved.bam: A reduced BAM file including phaseable reads around het. SNPs only. Will be located in the same directory with the input BAM file (required for CallSomatic module)
- ID.phased.del.list: Annotated all phaseable deletion breakpoints (required for MakeDuplicateList module)
- ID.filtered.phased.del.list: Selected genuine deletion candidates including both somatic and germline deletions (required for CallSomatic module)
MakeDuplicateList (optional)
MakeDuplicateList module takes a set of phaseable deletion breakpoints (ID.phased.del.list files generated from LinkageAnalysis module) from different individuals and selects deletion candidates observed in more than one individual with the exact same breakpoints. These duplicated candidates are likely to be germline deletions or systematic artifacts, thus will be filtered out from the final somatic deletion candidates during the next step—CallSomatic. If all single cell data are from the same individual, you don't need to run this module—duplicated candidates can be clonal mutations and thus should not be filtered out.
Mandatory arguments:
Input | Option | Description |
---|---|---|
A list of files for phaseable deletion breakpoints | -l, --phasedCandidateList | A tab-delimited file with the information of phaseable deletion list and its individual e.g. #phased_deletion_list individual_ID 1459_PFC_01.phased.del.list 1459 1459_PFC_02.phased.del.list 1459 1278_PFC_01.phased.del.list 1278 |
Output file for duplicated deletion list | -u, --duplicateList | Output file for a list of duplicated phased candidates between different individuals |
Output files:
- Duplicated deletion candidates (filename given by a user): Possible germline/artifactual candidate list that should be filtered out for somatic deletion calling (used for CallSomatic module)
CallSomatic
CallSomatic module takes a selected deletion candidate list generated from LinkageAnalysis module with the BAM files for a given cell and a matched bulk tissue, and discriminate high-confident somatic deletions. This module also estimates the FDR for a given cell based on the estimated level of amplification bias, and calculates genome-wide somatic deletion rate using two-component model. The output files include the estimated deletion rate, fitted model graphs for a given cell, and the final list of high-confidence somatic deletion candidates. The next step—AnalyzeMechanism—will annotate the predicted undelyring mechanisms for the final deletion candidates.
Mandatory arguments:
Input | Option | Description |
---|---|---|
Reference sequence | -r, --reference | FASTA formatted reference sequence file. The reference must be BWA indexed. |
A single cell BAM file | -b, --scBam | A single cell WGS BAM file corresponding to the provided deletion candidates. The reduced BAM generated from LinkageAnalysis module should be located together within the same folder. |
A matched bulk BAM file | -B, --bulkBam | A matched bulk WGS BAM. The reduced BAM generated from LinkageAnalysis module should be located together within the same folder. |
Germline het. SNP list | -v, --hetSNPList | Germline heterozygous SNP list for a given individual (generated from GenerateHetSNPAndIndelList module). |
Phased deletion candidates from a single cell | -p, --phasedDelList_sc | Phased deletion list (ID.filtered.phased.del.list) from the given single cell (generated from LinkageAnalysis module). |
Phased deletion candidates from a matched bulk | -P, --phasedDelList_bulk | Phased deletion list (ID.filtered.phased.del.list) from the matched bulk (generated from LinkageAnalysis module). |
Rscript path | -R, --Rscript | An absolute path for the Rscript executable file (e.g. /usr/bin/Rscript). |
Output directory | -o, --output_dir | Path for output directory. |
Optional arguments:
Option | Default Value | Description |
---|---|---|
-s, --singleCellID | BAM_prefix | A single-cell/bulk ID for output filename (ID.output). The filename of the input BAM (filename.bam) will be used for this ID if not provided. |
-u, --duplicateList | A list of duplicated phased candidates among different individuals (generated from MakeDuplicateList module). | |
-k, --keep_intermediate_files | FALSE | Keep intermediate files. |
-Q, --lowMQThres | 5 | Mapping quality threshold to be considered as a deletion-supporting read. |
--bpDiff | 10 | Allowable base-pair difference to match deletions from single cell and bulk data. |
--clippedPos_offset | 3 | Offset to match the clipped position for a given deletion breakpoint. |
--nonVarSupCntThres | 1 | Required minimum non-deletion-supporting read count at the breakpoint in the matched bulk data (to be called as somatic). If there is no non-deletion-supporting read in the bulk data, this candidate will be considered as a germline deletion. |
--avgABratioThres | 0.1 | Maximum allowable avg.MAF for germline het. SNPs located within the deletion region in single cell data. |
--sizeRatioThres | 0.9 | Minimum size ratio to match the same deletion from the single cell and bulk data. |
--nonVarSupReadFracThres | 0.2 | Required fraction of non-deletion-supporting read at the breakpoint in the matched bulk data (to be called as somatic). For a given breakpoint, if there are non-deletion-supporting reads with the fraction less than this threshold in the bulk data, this candidate will be considered as a germline deletion. |
--delVAFThres | 0.7 | Minimum read-depth ratio between deleted and adjacent region in the bulk data (to be called as somatic). |
Output files:
- ID.cc.estimation.FDR.added.txt: Estimated genome-wide somatic deletion rate for a given cell
- ID.selected.somatic.del.candidates.cc.controlled.txt: A final candidate list for high-confidence somatic deletions (required for AnalyzeMechanism module)
AnalyzeMechanism
AnalyzeMechanism module takes a final somatic deletion list generated from CallSomatic module and predicts their underlying mechanisms for deletion formation. The prediction is made following criteria in a previous cancer study Yang et al. The output is annotated list of somatic deletions with their predicted underlying mechanisms.
Mandatory arguments:
Input | Option | Description |
---|---|---|
Reference sequence | -r, --reference | FASTA formatted reference sequence file. The reference must be BWA indexed. |
A list of selected somatic deletion candidates | -d, --somatic_deletions | A final list of selected high-confidence somatic deletions for a given cell (generated from CallSomatic module (ID.selected.somatic.del.candidates.cc.controlled.txt)) |
A list of insertion candidates | -i, --insertionList | An insertion candidate list for a given cell (generated from GenerateHetSNPAndIndelList module) |
Output directory | -o, --output_dir | Path for output directory. |
Optional arguments:
Option | Default Value | Description |
---|---|---|
-s, --singleCellID | BAM_prefix | A single-cell/bulk ID for output filename (ID.output). The filename of the input BAM (filename.bam) will be used for this ID if not provided. |
-k, --keep_intermediate_files | FALSE | Keep intermediate files. |
--MEI_gnomad | gnomAD VCF for MEI sites (provided default under the data folder). | |
--repeat_masker | RepeatMasker annotation from UCSC table browser (provided default under the data folder). | |
--homologySearch_offset | 100 | Offset to search sequence homology around the pair of deletion breakpoints. |
--homologySearchLen | 300 | Maximum length for sequence homology to search. |
--homologyConcordance | 0.9 | Required concordance to determine sequence homology between the breakpoints for a given deletion. |
--repeatOverlapThres | 0.8 | Mutually overlapped fraction with a known repeat to be annotated |
Output files:
- ID.selected.somatic.del.candidates.cc.controlled.mec.annotated.txt: An annotated list of somatic deletions with their predicted underlying mechanisms