ISVASE - Browse Files at SourceForge.net

The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.
Name	Modified	Size
ISVASE1.1.tar.gz	2016-11-23	31.3 MB
README.txt	2016-11-23	28.6 kB
fish.pl	2016-06-19	5.8 kB
Totals: 3 Items		31.3 MB
NAME
====

ISVASE - Identification of Sequence Variant Associated with Splicing Event 

VERSION
=======

Version 1.1

History of modification:
* After calculating the raw P values for SVASE, the Benjamini-Hochberg procedure is used to control the false discovery rate (FDR).
* Identification SVASEs for known splicing events in default
* Support single-end read data
* Fix the bug for figure generation
* Recover SVASEs with less than 50& alternative allele reads


COPYRIGHT
=========

Copyright 2016 - Wanfei Liu & Qiang Lin - Beijing Institute of Genomics, Chinese Academy of Sciences.

This tool is a free package; you can redistribute it and/or modify it freely.

INTRODUCTION
============

This program can identify the sequence variant (DNA snp or RNA editing) associating with splicing event based on RNA-seq data. The recent widely used RNA-seq has became a routine method for transcriptome study. Based on the experiment design, RNA-seq can be used for transcriptome profile, differential expression analysis, splicing variantion, single nucleotide variation, fusion gene detection, long non-coding RNA analysis, RNA editing identification and others due to its excellent characteristics such as high-throughput, single base resolution, wide detection ability, low cost, high accuracy and new transcript identification. Splicing variantion is not only increased the transcriptome diversity, but also offered the possibility to cause human diseases. Some studies shown that single nucleotide variation produced by DNA mutation or RNA editing can affect the splicing process which leads to spurious splicing events. Although spurious splicing can result in disease, there is little bioinformatics tools available for identifying DNA or RNA sequence variant associated with splicing event in genome or transcriptome level. We developed ISVASE, a tool for identifying sequence variant associated with splicing event based on RNA-seq data. 

ISVASE includes six steps: (i) read alignment result (sorted/unsorted SAM file or standard input, eg. samtools view *.bam | perl ISVASE.pl [options] or gzip -d -c *.sam.gz | perl ISVASE.pl [options]) and identify splicing events; (ii) identify sequence variant using reads mapped to splicing event; (iii) compare sequence variant identified by all junction reads and target junction reads and identify the probability of association between sequence variant and splicing event in two parts, the significance of truly splicing event and the significantce of splicing event caused by sequence variant (fisher excet test, pvalue<=0.05); (iv) identify sequence variant type (DNA mutation or RNA editing) using known DNA mutation and RNA editing records (VCF for DNA mutation (dpsnp) and DARNED/RADAR records for RNA editing) (optinal); (v) filter the false splicing event based on the following criteria: 1 remove the splicing event caused by junction shift, 2 remove the false splicing event according to splicing signal, splicing score based on splicing motif sequence, and the reads number supported splicing event; (vi) print result in three levels, sequence variant level, splicing event level (junction) and reads level (SAM & BAM file).

The only necessary input files for ISVASE are genome sequence file (FASTA format) and sequence alignment file in BAM or SAM format. Sequence alignment file can be produced by popular read mapping software such as GSNAP and TopHat which can support split-read mapping. Genome sequence file can be downloaded from NCBI, UCSC and ENSEMBL. User also can provide gene annotation file (GTF, GFF, GFF3 and BED format), DNA mutation profile (VCF format, dbsnp or DNA variants from DNA-seq data) and RNA editing profile (DARNED, RADAR or RNA editing sites). 

In default, ISVASE only identify sequence variants in new splicing events for saving time. If you want to identify sequence variants in all splicing events, please use "-k yes" parameter.

ISVASE is available at https://sourceforge.net/projects/isvase/.

PREREQUISITES
=============

Before running, the following software or program are required:
-	Perl Environment (perl v5.18.4 or later (it can be lower))
-	Perl module: Getopt::Long; Text::NSP
-	R Environment (tested on R 3.1.2 or later (it can be lower))
-	samtools (tested on samtools Version: 1.2)

ISVASE have been tested on Linux.

INSTALLATION
============

This package doesn't need install. You can run it using absolute path after decompressed it.

COMMAND LINE
============
	
Run ISVASE.pl without any parameter, it will print below usage on the screen.

        Author: Wanfei Liu
        Email: <liuwf@big.ac.cn>
        Date: May 4, 2016
        Version: 1.0 version

        Introduction
        This program can identify the sequence variant (DNA snp or RNA editing) associating with splicing event based on RNA-seq data. The recent widely used RNA-seq has became a routine method for transcriptome study. Based on the experiment design, RNA-seq can be used for transcriptome profile, differential expression analysis, splicing variantion, single nucleotide variation, fusion gene detection, long non-coding RNA analysis, RNA editing identification and others due to its excellent characteristics such as high-throughput, single base resolution, wide detection ability, low cost, high accuracy and new transcript identification. Splicing variantion is not only increased the transcriptome diversity, but also offered the possibility to cause human diseases. Some studies shown that single nucleotide variation produced by DNA mutation or RNA editing can affect the splicing process which leads to spurious splicing events. Although spurious splicing can result in disease, there is little bioinformatics tools available for identifying DNA or RNA sequence variant associated with splicing event in genome or transcriptome level. We developed ISVASE, a tool for identifying sequence variant associated with splicing event based on RNA-seq data. 
        ISVASE includes six steps: (i) read alignment result (sorted/unsorted SAM file or standard input, eg. samtools view *.bam | perl ISVASE.pl [options] or gzip -d -c *.sam.gz | perl ISVASE.pl [options]) and identify splicing events; (ii) identify sequence variant using reads mapped to splicing event; (iii) compare sequence variant identified by all junction reads and target junction reads and identify the probability of association between sequence variant and splicing event in two parts, the significance of truly splicing event and the significantce of splicing event caused by sequence variant (fisher excet test, pvalue<=0.05); (iv) identify sequence variant type (DNA mutation or RNA editing) using known DNA mutation and RNA editing records (VCF for DNA mutation (dpsnp) and DARNED/RADAR records for RNA editing) (optinal); (v) filter the false splicing event based on the following criteria: 1 remove the splicing event caused by junction shift, 2 remove the false splicing event according to splicing signal, splicing score based on splicing motif sequence, and the reads number supported splicing event; (vi) print result in three levels, sequence variant level, splicing event level (junction) and reads level (SAM & BAM file).

        Usage: perl perl/ISVASE.pl -g genome_sequence(fasta) -dv database_of_dna_variant(VCF) -rv database_of_rna_editing(RADAR/DARNED) -s sam_file(SAM) -an annotation_file(GTF, GFF or BED) -k known_junction_association_identifcation(yes or no) -q minimum_quality -l minimum_read_length -d minimum_read_depth -c minimum_allele_depth -a minimum_anchor_size -p minimum_alt_proportion -f fisher_exect_test(pvalue) -o prefix_of_out_file.

           -g: the absolute path of genome sequence (FASTA format, required).
          -dv: the absolute path of known DNA variant file (VCF format, like dbsnp database, used for assign sequence variant type (DNA mutation), optional).
          -rv: the absolute path of known RNA editing file (RADAR/DARNED RNA editing records, for example, Human_AG_all_hg19_v2.txt in RADAR for human or hg19.txt in DARNED, used for assign sequence variant type (RNA editing), the files can be downloaded in http://rnaedit.com/download/ and http://darned.ucc.ie/download/ respectively, optional).
           -s: the absolute path of SAM file (SAM format, sequence alignment result for RNA-seq, it is mututally exclusive with standard input, required for SAM file or standard input).
          -an: the absolute path of gene annotation file (GTF, GFF or BED format, the known splicing events, required).
           -k: known junction association identification (yes or no, default value is no).
           -q: quality threshold for reads (default value is 30).
           -l: read length threshold (default value is 30).
           -d: read depth threshold for RNA variant (default value is 3).
           -c: allele depth threshold for RNA variant (default value is 3).
           -a: read anchor size threshold for splicing identification (default value is 8).
           -p: allele proportion threshold for RNA variant (default value is 0.1).
           -f: the cutoff pvalue for fisher exect test (default value is 0.05).
           -fs: the flanking sequence size for sequence variant (default value is 10).
           -o: the prefix of output file (required).

        Note: For false splicing event filter, we use different splice score cutoff for different splicing signal (any value for GTAG|CTAC, 0 for GCAG|CTGC and ATAC|GTAT, 6 for other splicing signal). All files should be uncompressed files except for SAM/BAM file (see Introduction). If you want run ISVASE multiple times with different parameters, please use "-o" option to assign different prefix name for output files. To filter false positive sequence variants, we use 3 as the default minimum reads depth and 3 as the minimum alt allele reads coverage, because most of false positive variants come from the low depth and low alt coverage records according to our previous study. 

INPUTFILES
==========

ISVASE need two necessary input files:
-	sequence alignment file (SAM or BAM format, the mapping result of GSNAP or TopHat).
-	Reference genome in Fasta format.

Note: User also can provide gene annotation file (GTF, GFF, GFF3 and BED format), DNA mutation profile (VCF format, dbsnp or DNA variants from DNA-seq data) and RNA editing profile (DARNED, RADAR or RNA editing sites).

OUTPUTFILES
===========

ISVASE mainly produces output files in simple tab-delimited table files.

1.	*.var (main result)
The identified sequence variants associated with splicing events.
Main fields are:
-	CHROM: chromosome name;
-	Position: nucleotide position in chromosome (1-based);
-	Ref: reference allele;
-	Alt: alternative allele;
-	RefAll: total reads supported reference allele;
-	AltAll: total reads supported alternative allele;
-	RefTarget: target junction (splicing event) reads supported reference allele;
-	AltTarget: target junction reads supported alternative allele;
-	Type: the source type of sequence variant;
-	Pvalue(Variant): p-value for sequence variant using the Fisher¡¯s exact test;
-	Pvalue(TargetJun): p-value for splicing event using the Fisher¡¯s exact test;
-	Pvalue(Association): p-value for association between sequence variant adn splicing event using the Fisher¡¯s exact test;
-	Distance2Jun: distance between sequence variant and junction breakpoint (junction start or end);
-	JunStart: junction start position;
-	JunEnd: junction end position;
-	JunType: junction type (Known, SemiNovel and Novel);

2.	*.jun (main result)
The splicing events associated with sequence variants.
Main fields are:
-	CHROM: chromosome name;
-	JunStart: junction start position;
-	JunEnd: junction end position;
-	ReadNumber: reads number of junction;
-	SpliceSignal: splice signal of junction;
-	VarScore: splice signal score using alternative allele;
-	RefScore: splice signal score using reference allele;
-	JunType: junction type (Known, SemiNovel and Novel);
-	VariantNumber: sequence variant number in junction;
-	VariantInfo: sequence variant information, formated like "Position:  Ref->Alt:RefAll,AltAll,RefTarget,AltTarget:Type:Pvalue(Variant),Pvalue(TargetJun),Pvalue(Association):Distance2Jun". If multiple sequence variant provided, seperated them by ";";

3.	*.alljunread & *.junread
The junction reads for total junctions (*.alljunread) and junctions with associated sequence variants (*.junread). This file use standard SAM format for sequence alignment file (https://samtools.github.io/hts-specs/SAMv1.pdf). The junction location was annotated for each alignment using an additional tag "SJ". For example, "SJ:Z:chr9:101802838-101806850" means this read supported the junction in chromosome 9 with location 101802838-101806850.

4.	*.alljunread.sorted.bam & *.junread.sorted.bam
The sequence alignment result for junction reads (sorted by coordinate) in BAM format. It can be easily viewed using popular software like IGV. 

5.	*.eses
The flanking sequence for ref and alt allele (upstream flanking sequence, ref/alt, and downstream flanking sequence) in FASTA format (http://zhanglab.ccmb.med.umich.edu/FASTA/). These sequence can be conveniently used to predict ESE motifs using tools like ESEfinder and Human Splicing Finder. 

Examples:
#>refAccession	CHROM	Position	SpliceSignal	Strand	Distance2Jun	Location(Exon5' or Exon3')
>ref1	chr9	76192862	GTAG	+	6	Exon3'
CCATTGAGCAAGCTCAGGTCA
>alt1	chr9	76192862	GTAG	+	6	Exon3'
CCATTGAGCAGGCTCAGGTCA

6.	*.stat
The statistics result file.

For example: 
#Splicing signal type
CTAC	4
GTAG	4

#Junction type
SemiNovel	5
Novel	3

#Variant type
Unknown	2
DNAsnp	7

#Variant change type
A->G	4
C->A	1
T->C	1
A->T	2
G->C	1

#Reads information
Total reads are 7260395.
The filtered reads are as following:
ReadQuality	303875
PairedUnmap	0
SingleUnmap	129419
PCR_duplication	0
MappingQuality	0
Multiple	1699140
Mismatch	0
ReadLength	222
AnchorSize	202198

#Junction information
Total junctions are 7136.
The filtered junctions are as following:
LowCoverage	907
NoVariant	379
KnownJunction	5842
JunctionShift	0
JunctionSignal	0

7.	*.R *.Rout and *.jpeg
*.R is the R script for drawing the characteristics of splicing events and sequence variants. *.Rout is the logfile for *.R. *.jpeg is the figure produced by *.R, which includes 8 subfigures. Users can easily modify *.R according to the original picture for producing more beautiful figure (such as using xlim=c(a,b) to limit the x-axis range).

Note: In default, the error messages or running messages are printed in screen. You can redirect the screen output into a file.

TEST & EXAMPLE
==============

ISVASE has been tested in two data, one is the sequence alignment result of RNA-seq data (only chromosome 9 in human) from PVAAS software website (http://pvaas.sourceforge.net/) while the other is RNA-seq data of a human glioblastoma cell line U87MG (SRR388226 and SRR388227 are control samples and SRR388228 and SRR388229 are ADAR knockdown samples, GEO accession no. GSE28040). For second data, the raw data was filtered by Trimmomatic and  mapped by GSNAP. All data were run in a Linux system.

When you get the ISVASE package, decompress the directory, then you can test the program by following commands (just selected one command for each step).

1. Data pre-processing
1.1	Filter
java -jar trimmomatic-0.33.jar PE -phred33 SRR388226_1.fastq SRR388226_2.fastq SRR388226_1.paired.fastq.gz SRR388226_1.unpaired.fastq.gz SRR388226_2.paired.fastq.gz SRR388226_2.unpaired.fastq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:25 TRAILING:25 SLIDINGWINDOW:4:20 MINLEN:40 &

1.2	Mapping
gsnap -d hg19_gsnap --gunzip -a off -N 1 -A sam -t 8 --force-xs-dir --split-output=SRR388226 SRR388226_1.paired.fastq.gz SRR388226_2.paired.fastq.gz

2. SVASE identification
*Test one
nohup samtools view chr9.sorted.bam | perl ISVASE.pl -g hg19.fa -k yes -dv common_all_20160408.vcf -rv Human_AG_all_hg19_v2.txt -an gencode.v24lift37.annotation.gff3 -o chr9_4 > chr9_4.nohup &
nohup samtools view chr9.sorted.bam | perl ISVASE.pl -g hg19.fa -dv common_all_20160408.vcf -rv Human_AG_all_hg19_v2.txt -an gencode.v24lift37.annotation.gff3 -o chr9_5 > chr9_5.nohup &
*Test two
nohup samtools view SRR388226_gsnap.bam | perl ISVASE.pl -g hg19.fa -dv common_all_20160408.vcf -rv Human_AG_all_hg19_v2.txt -an gencode.v24lift37.annotation.gff3 -o SRR388226_gsnap5 > SRR388226_gsnap5.nohup &
nohup samtools view SRR388227_gsnap.bam | perl ISVASE.pl -g hg19.fa -dv common_all_20160408.vcf -rv Human_AG_all_hg19_v2.txt -an gencode.v24lift37.annotation.gff3 -o SRR388227_gsnap5 > SRR388227_gsnap5.nohup &
nohup samtools view SRR388229_gsnap.bam | perl ISVASE.pl -g hg19.fa -dv common_all_20160408.vcf -rv Human_AG_all_hg19_v2.txt -an gencode.v24lift37.annotation.gff3 -o SRR388229_gsnap5 > SRR388229_gsnap5.nohup &
nohup samtools view SRR388228_gsnap.bam | perl ISVASE.pl -g hg19.fa -dv common_all_20160408.vcf -rv Human_AG_all_hg19_v2.txt -an gencode.v24lift37.annotation.gff3 -o SRR388228_gsnap5 > SRR388228_gsnap5.nohup &
nohup samtools view SRR388228_gsnap.bam | perl ISVASE.pl -g hg19.fa -dv common_all_20160408.vcf -rv Human_AG_all_hg19_v2.txt -an gencode.v24lift37.annotation.gff3 -o SRR388228_gsnap4 -k yes > SRR388228_gsnap4.nohup &
nohup samtools view SRR388229_gsnap.bam | perl ISVASE.pl -g hg19.fa -dv common_all_20160408.vcf -rv Human_AG_all_hg19_v2.txt -an gencode.v24lift37.annotation.gff3 -o SRR388229_gsnap4 -k yes > SRR388229_gsnap4.nohup &
nohup samtools view SRR388227_gsnap.bam | perl ISVASE.pl -g hg19.fa -dv common_all_20160408.vcf -rv Human_AG_all_hg19_v2.txt -an gencode.v24lift37.annotation.gff3 -o SRR388227_gsnap4 -k yes > SRR388227_gsnap4.nohup &
nohup samtools view SRR388226_gsnap.bam | perl ISVASE.pl -g hg19.fa -dv common_all_20160408.vcf -rv Human_AG_all_hg19_v2.txt -an gencode.v24lift37.annotation.gff3 -o SRR388226_gsnap4 -k yes > SRR388226_gsnap4.nohup &


3. ISVASE result
3.1	Figures
	Figure 1 - Schematic diagram of the ISVASE software. (A) Identify splicing variants in RNA-seq data. All splicing variants can be divided into four types according to relationship between target splicing variant (red colour) and other splicing variants (from left to right): (i) unique splicing variant; (ii) splicing variants with same junction start; (iii) splicing variants with same junction end; and (iv) splicing variants with same junction start or end. (B) Identify sequence variants for each splicing variant and all related splicing variants. To handle all splicing variant types, we identify sequence variants for two parts of splicing separately. In the left part, for junctions with orange, yellow and red colour, the all related splicing variants should be three (all these junctions); however, for junctions with green and blue colour, the total junction is one (itself). Similarly, in the right part, junctions with red, green and blue colour have three all related splicing variants while junctions with orange and yellow colour only has one related junction (itself). (C) Identify associations. This step includes three significant judgements for sequence variants, junction existence and association between sequence variants and junctions, respectively. The example shown two junctions with same junction end. For junction one (top), two sequence variants are identified (left G(ref)->C(alt) and right G(ref)->A(alt)). In sequence variant significant judgement, left is filtered (p value=1) while right passes the test (p value=0.0476). In junction significant judgement and association judgement, p value of top junction is 0.0128 (significant) and 0.0070 (significant) respectively. Dashed lines represent gaps in the alignment.
	Figure 2 - The characteristics of SVASEs identified in new splicing events of test one data.
	Figure 3 - The characteristics of SVASEs identified in all splicing events of test one data.
	Figure 4 - The characteristics of SVASEs identified in new splicing events of SRR388226 data.
	Figure 5 - The characteristics of SVASEs identified in all splicing events of SRR388226 data.

3.2	Tables (only use new splicing events)
Table 1 - chr9_5.var
#CHROM	Position	Ref	Alt	RefAll	AltAll	RefTarget	AltTarget	Type	Pvalue(Variant)	Pvalue(TargetJun)	Pvalue(Association)	Distance2Jun	JunStart	JunEnd	JunType
chr9	76192862	A	G	0	87	0	16	DNAsnp	3.32734195504201e-09	1.42963260116567e-05	3.32734195504201e-09	6	76192868	76193731	SemiNovel
chr9	110210721	A	G	0	10	0	10	DNAsnp	1.08250882244691e-05	1.08250882244691e-05	1.08250882244691e-05	21	110210742	110217093	Novel
chr9	110217097	A	G	0	24	0	24	DNAsnp	6.20201122431976e-14	6.20201122431976e-14	6.20201122431976e-14	4	110210742	110217093	Novel
chr9	100823084	T	C	186	171	3	6	DNAsnp	0.0090497737556561	0.0037119189861946	0.0090497737556561	20	100808121	100823064	SemiNovel
chr9	67331056	A	T	0	12	0	6	Unknown	0.00216450216450216	0.0137299771167048	0.00216450216450216	14	67331070	67334095	SemiNovel
chr9	4676745	G	C	0	4	0	4	DNAsnp	0.0285714285714286	0.0285714285714286	0.0285714285714286	4	4676749	4677935	Novel
chr9	76192862	A	G	0	87	0	13	DNAsnp	1.92296598273177e-07	0.000150549857901352	1.92296598273177e-07	6	76192868	76218280	SemiNovel
chr9	35151594	A	T	0	4	0	4	DNAsnp	0.0285714285714286	0.0285714285714286	0.0285714285714286	6	35151600	35156093	Novel
chr9	96320999	C	A	128	30	0	30	Unknown	1.69112338921449e-17	4.0696995599202e-10	1.82357982937457e-35	1	96321000	96323394	SemiNovel

Table 2 - chr9_5.jun
#CHROM	JunStart	JunEnd	ReadNumber	SpliceSignal	VarScore	RefScore	JunType	VariantNumber	VariantInfo
chr9	76192868	76193731	16	GTAG	17.8440690631199	17.8440690631199	SemiNovel	1	76192862:A->G:0,87,0,16:DNAsnp:3.32734195504201e-09,1.42963260116567e-05,3.32734195504201e-09:6;
chr9	110210742	110217093	24	CTAC	18.066432555019	18.066432555019	Novel	2	110210721:A->G:0,10,0,10:DNAsnp:1.08250882244691e-05,1.08250882244691e-05,1.08250882244691e-05:21;110217097:A->G:0,24,0,24:DNAsnp:6.20201122431976e-14,6.20201122431976e-14,6.20201122431976e-14:4;
chr9	100808121	100823064	11	GTAG	19.9042290083212	19.9042290083212	SemiNovel	1	100823084:T->C:186,171,3,6:DNAsnp:0.0090497737556561,0.0037119189861946,0.0090497737556561:20;
chr9	67331070	67334095	8	CTAC	15.5543328221943	15.5543328221943	SemiNovel	1	67331056:A->T:0,12,0,6:Unknown:0.00216450216450216,0.0137299771167048,0.00216450216450216:14;
chr9	4676749	4677935	4	CTAC	17.6459754228348	17.6459754228348	Novel	1	4676745:G->C:0,4,0,4:DNAsnp:0.0285714285714286,0.0285714285714286,0.0285714285714286:4;
chr9	76192868	76218280	13	GTAG	7.88601403777841	7.88601403777841	SemiNovel	1	76192862:A->G:0,87,0,13:DNAsnp:1.92296598273177e-07,0.000150549857901352,1.92296598273177e-07:6;
chr9	35151600	35156093	4	CTAC	8.8748945801047	8.8748945801047	Novel	1	35151594:A->T:0,4,0,4:DNAsnp:0.0285714285714286,0.0285714285714286,0.0285714285714286:6;
chr9	96321000	96323394	30	GTAG	-9.71938299211128	-11.7552900078596	SemiNovel	1	96320999:C->A:128,30,0,30:Unknown:1.69112338921449e-17,4.0696995599202e-10,1.82357982937457e-35:1;

Table 3 - chr9_5.junread (partial)
UNC14-SN744:268:C10DDACXX:4:1305:17021:151643/2	147	chr9	4676712	56	38M1185N10M	=	4676641	-1304	AGCTGTAGCAGCCATCCTGCAACCATGATGTAACAAGCTTTATGCCAG	<FCFBB>GGGCGAGEHJHCFA+F>HEAEGJBACIGFDFB>?FFDD?B?	XF:Z:CTAC,	RG:Z:120904_UNC14-SN744_0268_AC10DDACXX_4_GTGGCC	IH:i:1	HI:i:1	NM:i:1	XS:A:-	SJ:Z:chr9:4676749-4677935
UNC14-SN744:268:C10DDACXX:4:2304:8367:181136/1	99	chr9	4676713	56	37M1185N11M	=	4679402	2737	GCTGTAGCAGCCATCCTGCAACCATGATGTAACAAGCTTTATGCCAGG	@CCFDFFFHHHFHJIJJJJJJJIJGIIJDHIIGIIJJJJJHGIJJGG2	XF:Z:CTAC,	RG:Z:120904_UNC14-SN744_0268_AC10DDACXX_4_GTGGCC	IH:i:1	HI:i:1	NM:i:1	XS:A:-	SJ:Z:chr9:4676749-4677935
UNC14-SN744:268:C10DDACXX:4:2307:16839:40704/2	163	chr9	4676734	56	16M1185N32M	=	4679022	2336	CCATGATGTAACAAGCTTTATGCCAGGCACTATTCAAAGCACTTTGTG	CCCFFFFFFHHHHJJJJJJJIJJJJJJJJJIJIIJJJGIIJJGIJJFH	XF:Z:CTAC,	RG:Z:120904_UNC14-SN744_0268_AC10DDACXX_4_GTGGCC	IH:i:1	HI:i:1	NM:i:1	XS:A:-	SJ:Z:chr9:4676749-4677935
UNC14-SN744:268:C10DDACXX:4:1101:8572:139850/2	163	chr9	4676734	56	16M1185N32M	=	4679022	2336	CCATGATGTAACAAGCTTTATGCCAGGCACTATTCAAAGCACTTTGTG	CCCFFFFFGHFHHJJJJJJJJJJJJJGJJJIGIJJIJJJJJJJJJIHH	XF:Z:CTAC,	RG:Z:120904_UNC14-SN744_0268_AC10DDACXX_4_GTGGCC	IH:i:1	HI:i:1	NM:i:1	XS:A:-	SJ:Z:chr9:4676749-4677935

Table 4 - chr9_5.stat
#Splicing signal type
CTAC	4
GTAG	4

#Junction type
SemiNovel	5
Novel	3

#Variant type
Unknown	2
DNAsnp	7

#Variant change type
A->G	4
C->A	1
T->C	1
A->T	2
G->C	1

#Reads information
Total reads are 7260395.
The filtered reads are as following:
ReadQuality	303875
PairedUnmap	0
SingleUnmap	129419
PCR_duplication	0
MappingQuality	0
Multiple	1699140
Mismatch	0
ReadLength	222
AnchorSize	202198

#Junction information
Total junctions are 7136.
The filtered junctions are as following:
LowCoverage	907
NoVariant	379
KnownJunction	5842
JunctionShift	0
JunctionSignal	0

Table 5 - chr9_5.eses (partial).
>ref1	chr9	76192862	GTAG	+	6	Exon3'
CCATTGAGCAAGCTCAGGTCA
>alt1	chr9	76192862	GTAG	+	6	Exon3'
CCATTGAGCAGGCTCAGGTCA
>ref2	chr9	110210721	CTAC	-	21	Exon5'
ACTAAAAATATCAAAAGTAGC
>alt2	chr9	110210721	CTAC	-	21	Exon5'
ACTAAAAATACCAAAAGTAGC

Table 6 - chr9_5.avinput (for annotation using ANNOVAR)
9	100823084	100823084	T	C
9	76192862	76192862	A	G
9	35151594	35151594	A	T
9	110210721	110210721	A	G
9	110217097	110217097	A	G
9	67331056	67331056	A	T
9	4676745	4676745	G	C
9	76192862	76192862	A	G
9	96320999	96320999	C	A

Table 7 - chr9_5.nohup (The log file for running ISVASE, partial).
Start time: Sat May 14 11:13:59 2016.
/lustre/liuwf/splicevariant/perl/ISVASE.pl -o chr9_5 -rv Human_AG_all_hg19_v2.txt -an gencode.v24lift37.annotation.gff3 -dv common_all_20160408.vcf -g /lustre/liuwf/database/ucsc/hg19/hg19.fa
Read alignment file and identify splicing events.
Processed 1000000 reads.
Processed 2000000 reads.
Processed 3000000 reads.
Processed 4000000 reads.
Processed 5000000 reads.
Processed 6000000 reads.
Processed 7000000 reads.
The splicing events have been identified: Sat May 14 11:18:23 2016.

Read gene annotation file.
The gene annotation file has been readed: Sat May 14 11:18:41 2016.

Identify the association between sequence variant and spurious splicing.
Process the chromosome chr9.
Read DNA mutation variant in chr9 (1504675): Sat May 14 11:20:36 2016.
Read RNA editing in chr9 (99381): Sat May 14 11:21:02 2016.
Read chromosome chr9 sequence: Sat May 14 11:21:27 2016.
Identifying the association between sequence variant and spurious splicing in chr9.
Knownsplice	chr9	94519841	94538023	CT	AC
Knownsplice	chr9	130145824	130147306	GT	AG
Knownsplice	chr9	88577095	88589882	GT	AG
Knownsplice	chr9	110074018	110081033	GT	AG
Knownsplice	chr9	136328677	136333043	GT	AG
......
Chromosome chr9 have been processed.
All chromosomes have been processed: Sat May 14 11:22:41 2016.

#Total reads are 7260395.
#The filtered reads are as following:
ReadQuality	303875
PairedUnmap	0
SingleUnmap	129419
PCR_duplication	0
MappingQuality	0
Multiple	1699140
Mismatch	0
ReadLength	222
AnchorSize	202198
#Total junctions are 7136.
#The filtered junctions are as following:
LowCoverage	907
NoVariant	379
KnownJunction	5842
JunctionShift	0
JunctionSignal	0
create sorted bam files and remove the temp files.
The sorted bam files are created: Sat May 14 11:23:00 2016.

Plot figures to illustrate the characteristics of sequence variants and junctions.
Figure plot are finished: Sat May 14 11:23:06 2016.

End time: Sat May 14 11:23:06 2016.

SUBPROGRAM
==========

We also provide a program named var2avinput.pl for convert *.var file to *.avinput file for annotation using ANNOVAR. 
The program extract_genename.pl can be used to extract gene list from ANNOVAR annotation result.
The program fish.pl can be used to comparison ISVASE results from different samples/data easily and quickly. 

It can get unique bait/baits (one or multiple items in column) in bait file at first (such as chromosome, position, strand, and others, or combination of mulitple items), then these bait/baits are used to search the corresponding column in fish file in two measures, existing in bait file (default) or not exiting in bait file ("-contrary" option).

CONTACT
=======

If you have any question or suggestion, please contact us.

Wanfei Liu & Qiang Lin
Email: <liuwf@big.ac.cn> & <linq@big.ac.cn>
Source: README.txt, updated 2016-11-23
ISVASE Files

identification of sequence variant associated with splicing event