Download Latest Version PhaseDel-v1.0.0.tar.gz (107.4 MB)
Email in envelope

Get an email when there's a new version of PhaseDel

Home
Name Modified Size InfoDownloads / Week
PhaseDel-v1.0.0.tar.gz 2022-02-22 107.4 MB
README.md 2022-02-22 23.7 kB
Totals: 2 Items   107.4 MB 0

PhaseDel

PhaseDel is a Java-based variant caller designed for detecting somatic deletions from high-coverage (~30x) single-cell whole-genome sequencing (scWGS) data. It is highly specialized for discriminating genuine somatic focal deletions (several bases to kilo-bases in length) from excessive SV-like artifacts, which are inevitably occurring during single-cell whole-genome amplification. PhaseDel accurately detects such true deletions based on the linkage information between deletion breakpoints and nearby germline heterozygous SNP sites. The important features of PhaseDel are:

  • Accurate identification of somatic focal deletions at single-base-pair resolution in scWGS using phasing information
  • Estimation of the genome-wide somatic deletion rate for a given cell with the controlled FDR level
  • Characterization of underlying DSB repair mechanisms for identified deletion candidates

Get the most recent version of PhaseDel here: Download

Prerequisite softwares

The followings include tested versions in parenthesis when applicable; later versions are likely to still work. These instructions are designed to enable use of PhaseDel on human sequencing data aligned to GRCh37 (hg19).

**Java (version 1.8) **

  • Java version 1.8 or higher
  • HTSJDK library (included in the package)

**R packages (version 4.0.1) ** rstan plyr tidyr dplyr ggplot2

Installation of PhaseDel

PhaseDel was developed using JAVA JDK 8 64bit. To run PhaseDel, Java Runtime Environment (JRE) version 1.8.x or later is required. Download the most recent version of PhaseDel program in the code page and extract the gzipped archive.

tar -zxvf PhaseDel-v1.x.y.tar.gz

It contains the followings:

  1. PhaseDel.jar : Executable JAR file of PhaseDel
  2. README.md : Same as the wiki page
  3. lib/ : A folder containing other JAR libraries for PhaseDel
  4. data/ : A folder containing R codes and data files for phasing analysis and annotation
  5. demo/ : A folder containing scripts and data files for PhaseDel demo running

Running PhaseDel

To run PhaseDel, you can simply run the JAR executable file like below:

java -jar PhaseDel.jar

The -h or -? options will bring the following usage of PhaseDel. If you see this, you are ready to run PhaseDel.


PhaseDel: Phasing-based somatic deletion caller from single-cell WGS data 
Version: 1.0.0

Usage: java -jar PhaseDel.jar -m <analysis_module>

Analysis modules:
---------------------------------------------------------------------------------------------
    GenerateHetSNPAndIndelList                  Generate heterozygous SNP list and ins./del. sets from the gnotyped GATK GVCF file
    MergeDel                                    Merge deletion calls for initial candidate set (GATK and Delly calls)
    LinkageAnalysis                             Linkage analysis to discriminate genuine deletion calls (both germline and somatic)
    MakeDuplicateList (optional)                Make a list of duplicated phased candidates between different individuals (possible germline/artifact list)
    CallSomatic                                 Call somatic deletion candidates from phased candidates and estimate genome-wide deletion rate
    AnalyzeMechanism                            Analyze underlying mechanisms for selected somatic deletion candidates


Running the PhaseDel demo

The 'demo' folder includes input data files and running commands for five steps to achieve final output of annotated somatic deletion calls. Please download following BAMs and locate them under the demo BAM folder (demo/input/bam/) before running. These are ~8.4 Gb BAMs for chromosome 17 of the public MDA data of single fibroblasts (Dong et al., Nat Methods 2017). Hunamp_bulk.chr17.bam Hunamp_bulk.chr17.bam.bai IL-11.chr17.bam IL-11.chr17.bam.bai

Execute command.sh in each subdirectory under the demo folder to run all steps of PhaseDel sequentially. Followings are the actual commands written in command.sh to run each step of PhaseDel. Replace /path/to/... with the correct paths for the demo data before running. Each step runs in less than a minute with a single core except step 3 and 4, which takes about an hour and 10 minutes, respectively.

step1_GenerateHetSNPAndIndelList (<1 min)
java -jar /path/to/PhaseDel.jar -m GenerateHetSNPAndIndelList \
    -v ../input/gatk/dong_et_al.NatMethod.2017.HC_G.g.chr17.vcf.gz \
    -b ../input/gatk/bulk.sampleID \
    -o ./
step2_MergeDel (<1 min)
# bulk
java -jar /path/to/PhaseDel.jar -m MergeDel \
    -v ../step1_GenerateHetSNPAndIndelList/hetSNP_bulk/Hunamp_bulk.bulk.het.snps.vcf \
    -G ../step1_GenerateHetSNPAndIndelList/indel/Hunamp_bulk.del.call \
    -D ../input/delly/Hunamp_bulk.delly_del.chr17.vcf.gz \
    -d ./Hunamp_bulk.merged.del

# single cell
java -jar /path/to/PhaseDel.jar -m MergeDel \
    -v ../step1_GenerateHetSNPAndIndelList/hetSNP_bulk/Hunamp_bulk.bulk.het.snps.vcf \
    -G ../step1_GenerateHetSNPAndIndelList/indel/IL-11.del.call \
    -D ../input/delly/IL-11.delly_del.chr17.vcf.gz \
    -d ./IL-11.merged.del
step3_LinkageAnalysis (~1 hour)
REF=/path/to/hs37d5/genome.fa

# bulk
java -jar /path/to/PhaseDel.jar -m LinkageAnalysis \
    -r $REF \
    -b ../input/bam/Hunamp_bulk.chr17.bam \
    -v ../step1_GenerateHetSNPAndIndelList/hetSNP_bulk/Hunamp_bulk.bulk.het.snps.vcf \
    -d ../step2_MergeDel/Hunamp_bulk.merged.del \
    -o ./

# single cell
java -jar /path/to/PhaseDel.jar -m LinkageAnalysis \
    -r $REF \
    -b ../input/bam/IL-11.chr17.bam \
    -v ../step1_GenerateHetSNPAndIndelList/hetSNP_bulk/Hunamp_bulk.bulk.het.snps.vcf \
    -d ../step2_MergeDel/IL-11.merged.del \
    -o ./
step4_CallSomatic (~10 min)
REF=/path/to/hs37d5/genome.fa
RSCRIPT=/path/to/Rscript

java -jar /path/to/PhaseDel.jar -m CallSomatic \
    -r $REF \
    -v ../step1_GenerateHetSNPAndIndelList/hetSNP_bulk/Hunamp_bulk.bulk.het.snps.vcf \
    -b ../input/bam/IL-11.chr17.bam \
    -B ../input/bam/Hunamp_bulk.chr17.bam \
    -p ../step3_LinkageAnalysis/IL-11.chr17.filtered.phased.del.list \
    -P ../step3_LinkageAnalysis/Hunamp_bulk.chr17.filtered.phased.del.list \
    -R $RSCRIPT \
    -o ./
step5_AnalyzeMechanism (<1 min)
REF=/path/to/hs37d5/genome.fa

java -jar /path/to/PhaseDel.jar -m AnalyzeMechanism \
    -r $REF \
    -d ../step4_CallSomatic/IL-11.chr17.selected.somatic.del.candidates.cc.controlled.txt \
    -i ../step1_GenerateHetSNPAndIndelList/indel/IL-11.ins.call \
    -o ./

The final output files are IL-11.chr17.cc.estimation.FDR.added.txt (step4, estimated rates) and IL-11.chr17.selected.somatic.del.candidates.cc.controlled.mec.annotated.txt (step5, final deletion candidates). See next section for more details on configuration options and input parameters.

Analysis modules and arguments

Analysis modules

PhaseDel have six divided modules to complete the entire analysis. Use -m option to select the analysis module. You need to run all the modules in the provided order to perform a complete analysis. Two modules—GenerateHetSNPList and MakeDuplicateList—should be applied for a set of samples, and other modules—MergeDel, LinkageAnalysis, CallSomatic, and AnalyzeMechanism—should be applied per cell.

Modules Description
GenerateHetSNPAndIndelList Generate heterozygous SNP list and insertion/deletion call sets from genotyped GATK GVCF file. Required for other modules.
MergeDel Merge GATK and Delly deletion calls to make an initial deletion candidate set. Merged output is used for linkage analysis.
LinkageAnalysis Linkage analysis to discriminate genuine deletion calls (both germline and somatic) from amplification artifacts.
MakeDuplicateList Make a list of duplicated phased candidates between different individuals. Duplicates will be considered to be possible germline variants/systematic artifacts and filtered out from the somatic candidate list.
CallSomatic Discriminate somatic deletions from the phased candidates and estimate genome-wide somatic deletion rate for a given cell.
AnalyzeMechanism Analyze underlying mechanisms for selected somatic deletion candidates.


GenerateHetSNPAndIndelList

GenerateHetSNPAndIndelList module takes a genotyped GVCF file that includes multiple samples (single cells and/or matched bulk). This module generates two types of output files: (1) germline heterozygous common SNP list and (2) separated insertion and deletion call sets for each sample. Since both somatic and germline mutations would be detected by GATK from single cell data, this module generates (1) only from matched bulk samples. To indicate which samples are from matched bulk, the module accepts a file that contains SM tags list of bulk samples. Note that the variants in the inuput GVCF should be annotated by dbSNP (indicated by ID column, e.g. rs75454623), to select common SNPs from the entire calls.

Mandatory arguments:
Input Option Description
Genotyped GATK GVCF -v, --gatk_vcf dbSNP-annotated genotyped GVCF file that includes genotyping results for multiple samples (single cells and/or matched bulk)
Output directory -o, --output_dir Path for output directory. This module will generate two subdirectories (hetSNP_bulk and indel) for outputs (1) and (2), respectively.
Optional arguments:
Option Default Value Description
-b, --bulk_samples A file containing SM tag list for bulk samples (line-separated). SM tags should match to the sample ID in the GVCF file.
Output files:
  1. hetSNP_bulk/ID.bulk.het.snps.vcf: Germline heterozygous common SNP list for a given bulk sample indicated by the provided SM tag (required for multiple follow-up modules)
  2. indel/ID.del.call: GATK deletion calls for each sample (required for MergeDel module)
  3. indel/ID.ins.call: GATK deletion calls for each sample (required for AnalyzeMechanism module)

MergeDel

MergeDel module takes two deletion call files from GATK and Delly from a given sample and merge them to make an integrated deletion list, an initial deletion call set for linkage analysis. For GATK, deletion calls generated by GenerateHetSNPAndIndelList module should be provided (indel/ID.del.call). For Delly, a VCF file generated by Delly deletion calling should be provided. The next step—LinkageAnalysis—will perform phasing analysis to discriminate genuine deletions from these merged calls.

Mandatory arguments:
Input Option Description
Germline het. SNP list -v, --hetSNPList Germline heterozygous SNP list for a given individual (generated from GenerateHetSNPAndIndelList module).
GATK-derived deletion call -G, --gatk_del GATK-derived deletion calls for a given sample (generated by GenerateHetSNPAndIndelList module, indel/ID.del.call file)
Delly deletion call -D, --delly_del Delly VCF file for deletion call for a given sample (.vcf)
Output file for merged deletion candidates -d, --mergedDel Merged deletion candidate list (Path for output file)
Output files:
  1. Merged deletion candidates (filename given by a user): Merged deletion candidate list (required for LinkageAnalysis module)

LinkageAnalysis

LinkageAnalysis module takes an initial deletion candidate list and a bulk het. SNP list for a given cell, and performs phasing analysis to discriminate genuine deletions from whole-genome amplification artifacts. The output is the filtered list of true deletion candidates including both somatic and germline deletions. The next step—CallSomatic—will select high-confidence somatic candidates from this output.

Mandatory arguments:
Input Option Description
Reference sequence -r, --reference FASTA formatted reference sequence file. The reference must be BWA indexed.
BAM file -b, --scBam A single-cell or a bulk BAM file to be analyzed using phasing. The BAM file must be (coordinate) sorted and indexed.
Germline het. SNP list -v, --hetSNPList Germline heterozygous SNP list for a given individual (generated from GenerateHetSNPAndIndelList module).
Initial deletion candidates -d, --mergedDel Initial deletion candidate list for a given single-cell/bulk data (generated from MergeDel module).
Output directory -o, --output_dir Path for output directory.
Optional arguments:
Option Default Value Description
-s, --singleCellID BAM_prefix A single-cell/bulk ID for output filename (ID.output). The filename of the input BAM (filename.bam) will be used for this ID if not provided.
-k, --keep_intermediate_files FALSE Keep intermediate files.
-q, --lowBQThres 15 Base quality threshold at the Het.SNP site to be considered as a deletion-supporting read.
-Q, --lowMQThres 5 Mapping quality threshold to be considered as a deletion-supporting read.
--initialOffset 5 Initial offset to assign the same breakpoint between two breakpoints with the same direction.
--clippedPos_offset 3 Offset to determine the clipped position of a given deletion breakpoint.
--clippingSupCntThres 2 Minimum clipped read count to assign a deletion breakpoint.
--minClippedLen 5 Minimum clipped length to be considered as a clipped read.
--maxClippedLen 50 Maximum clipped length to construct a consensus clipped sequence.
--delSupReadCntThres 2 Minimum supporting read count to analyze a given deletion candidate.
--nmFracThres 0.1 Allowable mismatch fraction from a single read for deletion-supporting reads.
--NFracThres 0.3 Allowable fraction of 'N'-base in the consensus clipped sequence.
--homopolymerFrac 0.7 Threshold to be a homopolymeric region considering nearby reference sequences.
--invertedFrac 0.4 Inverted read fraction between the deletion-supporting reads to be considered as an inverted event (likely to be chimeric artifact).
Output files:
  1. ID.hetSNP.filtered.supplementaryRemoved.bam: A reduced BAM file including phaseable reads around het. SNPs only. Will be located in the same directory with the input BAM file (required for CallSomatic module)
  2. ID.phased.del.list: Annotated all phaseable deletion breakpoints (required for MakeDuplicateList module)
  3. ID.filtered.phased.del.list: Selected genuine deletion candidates including both somatic and germline deletions (required for CallSomatic module)

MakeDuplicateList (optional)

MakeDuplicateList module takes a set of phaseable deletion breakpoints (ID.phased.del.list files generated from LinkageAnalysis module) from different individuals and selects deletion candidates observed in more than one individual with the exact same breakpoints. These duplicated candidates are likely to be germline deletions or systematic artifacts, thus will be filtered out from the final somatic deletion candidates during the next step—CallSomatic. If all single cell data are from the same individual, you don't need to run this module—duplicated candidates can be clonal mutations and thus should not be filtered out.

Mandatory arguments:
Input Option Description
A list of files for phaseable deletion breakpoints -l, --phasedCandidateList A tab-delimited file with the information of phaseable deletion list and its individual
 e.g. #phased_deletion_list individual_ID
  1459_PFC_01.phased.del.list 1459
  1459_PFC_02.phased.del.list 1459
  1278_PFC_01.phased.del.list 1278
Output file for duplicated deletion list -u, --duplicateList Output file for a list of duplicated phased candidates between different individuals
Output files:
  1. Duplicated deletion candidates (filename given by a user): Possible germline/artifactual candidate list that should be filtered out for somatic deletion calling (used for CallSomatic module)

CallSomatic

CallSomatic module takes a selected deletion candidate list generated from LinkageAnalysis module with the BAM files for a given cell and a matched bulk tissue, and discriminate high-confident somatic deletions. This module also estimates the FDR for a given cell based on the estimated level of amplification bias, and calculates genome-wide somatic deletion rate using two-component model. The output files include the estimated deletion rate, fitted model graphs for a given cell, and the final list of high-confidence somatic deletion candidates. The next step—AnalyzeMechanism—will annotate the predicted undelyring mechanisms for the final deletion candidates.

Mandatory arguments:
Input Option Description
Reference sequence -r, --reference FASTA formatted reference sequence file. The reference must be BWA indexed.
A single cell BAM file -b, --scBam A single cell WGS BAM file corresponding to the provided deletion candidates. The reduced BAM generated from LinkageAnalysis module should be located together within the same folder.
A matched bulk BAM file -B, --bulkBam A matched bulk WGS BAM. The reduced BAM generated from LinkageAnalysis module should be located together within the same folder.
Germline het. SNP list -v, --hetSNPList Germline heterozygous SNP list for a given individual (generated from GenerateHetSNPAndIndelList module).
Phased deletion candidates from a single cell -p, --phasedDelList_sc Phased deletion list (ID.filtered.phased.del.list) from the given single cell (generated from LinkageAnalysis module).
Phased deletion candidates from a matched bulk -P, --phasedDelList_bulk Phased deletion list (ID.filtered.phased.del.list) from the matched bulk (generated from LinkageAnalysis module).
Rscript path -R, --Rscript An absolute path for the Rscript executable file (e.g. /usr/bin/Rscript).
Output directory -o, --output_dir Path for output directory.
Optional arguments:
Option Default Value Description
-s, --singleCellID BAM_prefix A single-cell/bulk ID for output filename (ID.output). The filename of the input BAM (filename.bam) will be used for this ID if not provided.
-u, --duplicateList A list of duplicated phased candidates among different individuals (generated from MakeDuplicateList module).
-k, --keep_intermediate_files FALSE Keep intermediate files.
-Q, --lowMQThres 5 Mapping quality threshold to be considered as a deletion-supporting read.
--bpDiff 10 Allowable base-pair difference to match deletions from single cell and bulk data.
--clippedPos_offset 3 Offset to match the clipped position for a given deletion breakpoint.
--nonVarSupCntThres 1 Required minimum non-deletion-supporting read count at the breakpoint in the matched bulk data (to be called as somatic). If there is no non-deletion-supporting read in the bulk data, this candidate will be considered as a germline deletion.
--avgABratioThres 0.1 Maximum allowable avg.MAF for germline het. SNPs located within the deletion region in single cell data.
--sizeRatioThres 0.9 Minimum size ratio to match the same deletion from the single cell and bulk data.
--nonVarSupReadFracThres 0.2 Required fraction of non-deletion-supporting read at the breakpoint in the matched bulk data (to be called as somatic). For a given breakpoint, if there are non-deletion-supporting reads with the fraction less than this threshold in the bulk data, this candidate will be considered as a germline deletion.
--delVAFThres 0.7 Minimum read-depth ratio between deleted and adjacent region in the bulk data (to be called as somatic).
Output files:
  1. ID.cc.estimation.FDR.added.txt: Estimated genome-wide somatic deletion rate for a given cell
  2. ID.selected.somatic.del.candidates.cc.controlled.txt: A final candidate list for high-confidence somatic deletions (required for AnalyzeMechanism module)

AnalyzeMechanism

AnalyzeMechanism module takes a final somatic deletion list generated from CallSomatic module and predicts their underlying mechanisms for deletion formation. The prediction is made following criteria in a previous cancer study Yang et al. The output is annotated list of somatic deletions with their predicted underlying mechanisms.

Mandatory arguments:
Input Option Description
Reference sequence -r, --reference FASTA formatted reference sequence file. The reference must be BWA indexed.
A list of selected somatic deletion candidates -d, --somatic_deletions A final list of selected high-confidence somatic deletions for a given cell (generated from CallSomatic module (ID.selected.somatic.del.candidates.cc.controlled.txt))
A list of insertion candidates -i, --insertionList An insertion candidate list for a given cell (generated from GenerateHetSNPAndIndelList module)
Output directory -o, --output_dir Path for output directory.
Optional arguments:
Option Default Value Description
-s, --singleCellID BAM_prefix A single-cell/bulk ID for output filename (ID.output). The filename of the input BAM (filename.bam) will be used for this ID if not provided.
-k, --keep_intermediate_files FALSE Keep intermediate files.
--MEI_gnomad gnomAD VCF for MEI sites (provided default under the data folder).
--repeat_masker RepeatMasker annotation from UCSC table browser (provided default under the data folder).
--homologySearch_offset 100 Offset to search sequence homology around the pair of deletion breakpoints.
--homologySearchLen 300 Maximum length for sequence homology to search.
--homologyConcordance 0.9 Required concordance to determine sequence homology between the breakpoints for a given deletion.
--repeatOverlapThres 0.8 Mutually overlapped fraction with a known repeat to be annotated
Output files:
  1. ID.selected.somatic.del.candidates.cc.controlled.mec.annotated.txt: An annotated list of somatic deletions with their predicted underlying mechanisms
Source: README.md, updated 2022-02-22