SplitFusion Code

Detection of gene fusion based on split alignments

Brought to you by: baifeng

Tree [9abad6] master /

History

HTTPS access

File	Date	Author	Commit
R	2019-02-19	zhangbaifeng	[c3603f] Add function description
exec	2019-02-18	zhangbaifeng	[a29589] update
inst	2019-02-18	zhangbaifeng	[a29589] update
man	2019-02-19	zhangbaifeng	[c3603f] Add function description
.Rbuildignore	2019-02-18	zhangbaifeng	[a29589] update
.gitattributes	2019-02-18	zhangbaifeng	[a29589] update
.gitignore	2019-02-18	zhangbaifeng	[a29589] update
DESCRIPTION	2019-02-19	zhangbaifeng	[c3603f] Add function description
NAMESPACE	2019-02-18	zhangbaifeng	[a29589] update
README.md	2019-02-19	baifeng	[9abad6] Update README.md
SplitFusion.Rproj	2019-02-18	zhangbaifeng	[a29589] update

Read Me

SplitFusion - a fast pipeline for detection of gene fusion based on fusion-supporting split alignment.

Gene fusion is a hallmark of cancer. Many gene fusions are effective therapeutic targets such as BCR-ABL in chronic myeloid leukemia, EML4-ALK in lung cancer, and any of a number of partners-ROS1 in lung cancer. Accurate detection of gene fusion plays a pivotal role in precision medicine by matching the right drugs to the right patients.

Challenges in the diagnosis of gene fusions include poor sample quality, limited amount of available clinical specimens, and complicated gene rearrangements. The anchored multiplex PCR (AMP) is a clinically proven technology designed, in one purpose, for robust detection of gene fusions across clinical samples of different types and varied qualities, including RNA extracted from FFPE samples.

SplitFusion is a companion data pipeline for AMP, for the detection of gene fusion based on split alignments, i.e. reads crossing fusion breakpoints, with the ability to accurately infer in-frame or out-of-frame of fusion partners of a given fusion candidate. SplitFusion also outputs example breakpoint-supporting seqeunces in FASTA format, allowing for further investigations.

Reference publication

Zheng Z, et al. Anchored multiplex PCR for targeted next-generation sequencing. Nat Med. 2014

How does SplitFusion work?

The analysis consists of ## computational steps:

Retrive all alignments that have secondary alignments (the 'SA' tag in SAM format) from bam files generated by BWA MEM.
Remove alignments with low mapping quality (default 20).
...
...

Lastly, outputs a summary table and breakpoint-spanning reads.

The dependency data (e.g. in 'data') should contain:

Filename	Content
Homo_sapiens_assembly19.fasta	Contains a list of human genome reference, please mannually downloaded from ucsc or other official site.
panel-name.target.genes.txt	Contains a list of targets (gene name, e.g. ALK, ROS1, etc.), e.g: ITFTNA.target.genes.txt
fusion.gene-exon.filter.txt	Contains recurrent breakpoints identified as data accumulates, but are not of interest.
fusion.gene-exon.txt	By default, SplitFusion only outputs fusions that are in-frame fusion of two different genes or when number of breakpoint-supporting reads exceed predefined threashold. This file contains known breakpoints that do not belong to the above two kinds, but are clinically relevant, e.g. "MET_exon13---MET_exon15" an exon-skipping event forms an important theraputic target. Many exon skipping/alternative splicing events are normal or of unknown clinical relevance and are thus not output by default.
fusion.partners.txt	Contains a list of known fusion partners of targets.
ENSEMBL.orientation.txt	Due to the lack of transcript orientation in snpEff annotation, so this file include two columns, Orientation (+ or –) and transcript ID (ENST*).

The above files could be updated periodically as a backend supporting database that facilitates automatc filtering and outputing of fusion candidates.

Installation

1. Installing requirements

1.1 Required tools:

Below is included in 'data/database' directory of SplitFusion packages:

R
samtools
bedtools
bwa

Below need to be installed by yourself:

java

###Installation
1. Go to http://java.com and click on the Download button

2. cd directory_path_name

3. Move the .tar.gz archive binary to the current directory.

4. tar zxvf jre-8u73-linux-i586.tar.gz

5. The Java files are installed in a directory called jre1.8.0_73 in the current directory.
In this example, it is installed in the /usr/java/jre1.8.0_73 directory.

R packages ("plyr", "data.table", "parallel", "dplyr", "tidyr", "ggplot2")

> install.packages(c("plyr", "data.table", "parallel", "dplyr", "tidyr", "ggplot2"))

snpEff

###Installing snpEff Database

wget http://sourceforge.net/projects/snpeff/files/snpEff_latest_core.zip

unzip snpEff_latest_core.zip

java -jar snpEff/snpEff.jar download hg19

1.2 Required files:

Below is included in 'data/' directory of SplitFusion packages:

panel-name.target.genes.txt
fusion.gene-exon.filter.txt
fusion.gene-exon.txt
fusion.partners.txt
ENSEMBL.orientation.txt

Below need to be installed by yourself:

- Homo_sapiens_assembly19.fasta # Contains a list of human genome reference, please mannually downloaded from ucsc or other official site.

2. Installing SplitFusion

git clone https://github.com/Zheng-NGS-Lab/SplitFusion.git

R CMD INSTALL SplitFusion

Run

1. Preparing Input file (Example):

1.1 sampleInfo (table separated): Sample information table. Sample name (prefixed name in bam file), Cancer type or project name ( not used in script, just for user labeling ), Panel name (prefixed panel name in panel-name.target.genes), cpuBWA number.

AP7 Sample_ID Panel cpuBWA

example LungFusion ITFTNA 2

example LungFusion ITFTNA 2

...

1.2 example.runInfo: Config file. You can set the path and parameters of depended tools in this file.

### Input file
SplitFusionPath="The installed library path of SplitFusion R Package/SplitFusion" ### .libPaths() command in R environment

sampleInfo="$SplitFusionPath/data/example_data/sampleInfo"

runInfo="$SplitFusionPath/data/example_data/example.runInfo"

...

2. run SplitFusion

> Library(SplitFusion)

> runSplitFusion(runInfo= "/path/example.runInfo", output= "/path/result/", sample.id="example") ### ?runSplitFusion to study how to use this function.

Output

An example brief output table:

AP7	GeneExon5'---GeneExon3'	num_unique_reads	frame	Gene_Exon_cDNA_5'_3'
A01-P701	KIF5B_exon15---RET_exon12	7	in-frame	KIF5B exon15 c.1723 .NM_004521.---RET exon12 c.2138 .NM_020630.
A02-P702	EML4_intronic---ALK_exon20	9	NA	EML4 intronic c.NA .NM_001145076.---ALK exon20 c.3171 .NM_004304.
A02-P702	EML4_intronic---ALK_exon20	10	NA	EML4 intronic c.NA .NM_001145076.---ALK exon20 c.3173 .NM_004304.
A02-P702	EML4_exon4---ALK_exon20	64	in-frame	EML4 exon4 c.468 .NM_001145076.---ALK exon20 c.3171 .NM_004304.

An example output fastq file for the KIF5B_exon15---RET_exon12 fusion of sample A01-P701 is:

CL100059760L2C005R002_288074
TTCCCACTTTGGATCCTCCTTTACATCATTATTTCCCACAGCAATTCCTATTTCTGCAAGGTCTTTTAGTAAAGATGC
CL100059760L2C008R017_227619
TTCCGAGGGAATTCCCACTTTGGATCCTCCTTTACATCATTATTTCCCACAGCAATTCCTATTTCTGCAAGGTCTTTT
CL100059760L2C005R026_187567
TTGACTGGAGTTCAGACGTGTGCTCTTCCGAAAGCCCTCCCCGGTGCGCATGTTGGCAGGCTCAGACAAGGCCCTGG
CL100059760L2C003R075_538109
TAGGAATTGCTGTGGGAAATAATGATGTAAAGGAGGATCCAAAGTGGGAATTCCCTCGGAAGAACTTGGTTCTTGGAA
CL100059760L2C006R047_14778
ATAGGAATTGCTGTGGGAAATAATGATGTAAAGGAGGATCCAAAGTGGGAATTCCCTCGGAAGAACTTGGTTCTTGGA
CL100059760L2C012R012_483791
GAATTGCTGTGGGAAATAATGATGTAAAGGAGGATCCAAAGTGGGAATTCCCTCGGAAGAACTTGGTTCTTGGAAAAA
CL100059760L2C013R006_484169
AATTGCTGTGGGAAATAATGATGTAAAGGAGGATCCAAAGTGGGAATTCCCTCGGAAGAACTTGGTTCTTGGAAAAAC
CL100059760L2C017R020_507274
ATAGGAATTGCTGTGGGAAATAATGATGTAAAGGAGGATCCAAAGTGGGAATTCCCTCGGAAGAACTTGGTTCTTGGA
CL100059760L2C017R091_315407
GAATTGCTGTGGGAAATAATGATGTAAAGGAGGATCCAAAGTGGGAATTCCCTCGGAAGAGCTTGGTTCTTGGAAAAA
CL100059760L2C003R015_252083
GGGAATTCCCACTTTGGATCCTCCTATGTTGGAATTCCCTCGGAAGAACTTGGTTCTTGGAAAAACTCTAAGATCGGA

Visualization

An visualization of example output fastq for the EML4_intron6---ALK_exon20: