NAME
====
ISVASE - Identification of Sequence Variant Associated with Splicing Event
VERSION
=======
Version 1.1
History of modification:
* After calculating the raw P values for SVASE, the Benjamini-Hochberg procedure is used to control the false discovery rate (FDR).
* Identification SVASEs for known splicing events in default
* Support single-end read data
* Fix the bug for figure generation
* Recover SVASEs with less than 50& alternative allele reads
COPYRIGHT
=========
Copyright 2016 - Wanfei Liu & Qiang Lin - Beijing Institute of Genomics, Chinese Academy of Sciences.
This tool is a free package; you can redistribute it and/or modify it freely.
INTRODUCTION
============
This program can identify the sequence variant (DNA snp or RNA editing) associating with splicing event based on RNA-seq data. The recent widely used RNA-seq has became a routine method for transcriptome study. Based on the experiment design, RNA-seq can be used for transcriptome profile, differential expression analysis, splicing variantion, single nucleotide variation, fusion gene detection, long non-coding RNA analysis, RNA editing identification and others due to its excellent characteristics such as high-throughput, single base resolution, wide detection ability, low cost, high accuracy and new transcript identification. Splicing variantion is not only increased the transcriptome diversity, but also offered the possibility to cause human diseases. Some studies shown that single nucleotide variation produced by DNA mutation or RNA editing can affect the splicing process which leads to spurious splicing events. Although spurious splicing can result in disease, there is little bioinformatics tools available for identifying DNA or RNA sequence variant associated with splicing event in genome or transcriptome level. We developed ISVASE, a tool for identifying sequence variant associated with splicing event based on RNA-seq data.
ISVASE includes six steps: (i) read alignment result (sorted/unsorted SAM file or standard input, eg. samtools view *.bam | perl ISVASE.pl [options] or gzip -d -c *.sam.gz | perl ISVASE.pl [options]) and identify splicing events; (ii) identify sequence variant using reads mapped to splicing event; (iii) compare sequence variant identified by all junction reads and target junction reads and identify the probability of association between sequence variant and splicing event in two parts, the significance of truly splicing event and the significantce of splicing event caused by sequence variant (fisher excet test, pvalue<=0.05); (iv) identify sequence variant type (DNA mutation or RNA editing) using known DNA mutation and RNA editing records (VCF for DNA mutation (dpsnp) and DARNED/RADAR records for RNA editing) (optinal); (v) filter the false splicing event based on the following criteria: 1 remove the splicing event caused by junction shift, 2 remove the false splicing event according to splicing signal, splicing score based on splicing motif sequence, and the reads number supported splicing event; (vi) print result in three levels, sequence variant level, splicing event level (junction) and reads level (SAM & BAM file).
The only necessary input files for ISVASE are genome sequence file (FASTA format) and sequence alignment file in BAM or SAM format. Sequence alignment file can be produced by popular read mapping software such as GSNAP and TopHat which can support split-read mapping. Genome sequence file can be downloaded from NCBI, UCSC and ENSEMBL. User also can provide gene annotation file (GTF, GFF, GFF3 and BED format), DNA mutation profile (VCF format, dbsnp or DNA variants from DNA-seq data) and RNA editing profile (DARNED, RADAR or RNA editing sites).
In default, ISVASE only identify sequence variants in new splicing events for saving time. If you want to identify sequence variants in all splicing events, please use "-k yes" parameter.
ISVASE is available at https://sourceforge.net/projects/isvase/.
PREREQUISITES
=============
Before running, the following software or program are required:
- Perl Environment (perl v5.18.4 or later (it can be lower))
- Perl module: Getopt::Long; Text::NSP
- R Environment (tested on R 3.1.2 or later (it can be lower))
- samtools (tested on samtools Version: 1.2)
ISVASE have been tested on Linux.
INSTALLATION
============
This package doesn't need install. You can run it using absolute path after decompressed it.
COMMAND LINE
============
Run ISVASE.pl without any parameter, it will print below usage on the screen.
Author: Wanfei Liu
Email: <liuwf@big.ac.cn>
Date: May 4, 2016
Version: 1.0 version
Introduction
This program can identify the sequence variant (DNA snp or RNA editing) associating with splicing event based on RNA-seq data. The recent widely used RNA-seq has became a routine method for transcriptome study. Based on the experiment design, RNA-seq can be used for transcriptome profile, differential expression analysis, splicing variantion, single nucleotide variation, fusion gene detection, long non-coding RNA analysis, RNA editing identification and others due to its excellent characteristics such as high-throughput, single base resolution, wide detection ability, low cost, high accuracy and new transcript identification. Splicing variantion is not only increased the transcriptome diversity, but also offered the possibility to cause human diseases. Some studies shown that single nucleotide variation produced by DNA mutation or RNA editing can affect the splicing process which leads to spurious splicing events. Although spurious splicing can result in disease, there is little bioinformatics tools available for identifying DNA or RNA sequence variant associated with splicing event in genome or transcriptome level. We developed ISVASE, a tool for identifying sequence variant associated with splicing event based on RNA-seq data.
ISVASE includes six steps: (i) read alignment result (sorted/unsorted SAM file or standard input, eg. samtools view *.bam | perl ISVASE.pl [options] or gzip -d -c *.sam.gz | perl ISVASE.pl [options]) and identify splicing events; (ii) identify sequence variant using reads mapped to splicing event; (iii) compare sequence variant identified by all junction reads and target junction reads and identify the probability of association between sequence variant and splicing event in two parts, the significance of truly splicing event and the significantce of splicing event caused by sequence variant (fisher excet test, pvalue<=0.05); (iv) identify sequence variant type (DNA mutation or RNA editing) using known DNA mutation and RNA editing records (VCF for DNA mutation (dpsnp) and DARNED/RADAR records for RNA editing) (optinal); (v) filter the false splicing event based on the following criteria: 1 remove the splicing event caused by junction shift, 2 remove the false splicing event according to splicing signal, splicing score based on splicing motif sequence, and the reads number supported splicing event; (vi) print result in three levels, sequence variant level, splicing event level (junction) and reads level (SAM & BAM file).
Usage: perl perl/ISVASE.pl -g genome_sequence(fasta) -dv database_of_dna_variant(VCF) -rv database_of_rna_editing(RADAR/DARNED) -s sam_file(SAM) -an annotation_file(GTF, GFF or BED) -k known_junction_association_identifcation(yes or no) -q minimum_quality -l minimum_read_length -d minimum_read_depth -c minimum_allele_depth -a minimum_anchor_size -p minimum_alt_proportion -f fisher_exect_test(pvalue) -o prefix_of_out_file.
-g: the absolute path of genome sequence (FASTA format, required).
-dv: the absolute path of known DNA variant file (VCF format, like dbsnp database, used for assign sequence variant type (DNA mutation), optional).
-rv: the absolute path of known RNA editing file (RADAR/DARNED RNA editing records, for example, Human_AG_all_hg19_v2.txt in RADAR for human or hg19.txt in DARNED, used for assign sequence variant type (RNA editing), the files can be downloaded in http://rnaedit.com/download/ and http://darned.ucc.ie/download/ respectively, optional).
-s: the absolute path of SAM file (SAM format, sequence alignment result for RNA-seq, it is mututally exclusive with standard input, required for SAM file or standard input).
-an: the absolute path of gene annotation file (GTF, GFF or BED format, the known splicing events, required).
-k: known junction association identification (yes or no, default value is no).
-q: quality threshold for reads (default value is 30).
-l: read length threshold (default value is 30).
-d: read depth threshold for RNA variant (default value is 3).
-c: allele depth threshold for RNA variant (default value is 3).
-a: read anchor size threshold for splicing identification (default value is 8).
-p: allele proportion threshold for RNA variant (default value is 0.1).
-f: the cutoff pvalue for fisher exect test (default value is 0.05).
-fs: the flanking sequence size for sequence variant (default value is 10).
-o: the prefix of output file (required).
Note: For false splicing event filter, we use different splice score cutoff for different splicing signal (any value for GTAG|CTAC, 0 for GCAG|CTGC and ATAC|GTAT, 6 for other splicing signal). All files should be uncompressed files except for SAM/BAM file (see Introduction). If you want run ISVASE multiple times with different parameters, please use "-o" option to assign different prefix name for output files. To filter false positive sequence variants, we use 3 as the default minimum reads depth and 3 as the minimum alt allele reads coverage, because most of false positive variants come from the low depth and low alt coverage records according to our previous study.
INPUTFILES
==========
ISVASE need two necessary input files:
- sequence alignment file (SAM or BAM format, the mapping result of GSNAP or TopHat).
- Reference genome in Fasta format.
Note: User also can provide gene annotation file (GTF, GFF, GFF3 and BED format), DNA mutation profile (VCF format, dbsnp or DNA variants from DNA-seq data) and RNA editing profile (DARNED, RADAR or RNA editing sites).
OUTPUTFILES
===========
ISVASE mainly produces output files in simple tab-delimited table files.
1. *.var (main result)
The identified sequence variants associated with splicing events.
Main fields are:
- CHROM: chromosome name;
- Position: nucleotide position in chromosome (1-based);
- Ref: reference allele;
- Alt: alternative allele;
- RefAll: total reads supported reference allele;
- AltAll: total reads supported alternative allele;
- RefTarget: target junction (splicing event) reads supported reference allele;
- AltTarget: target junction reads supported alternative allele;
- Type: the source type of sequence variant;
- Pvalue(Variant): p-value for sequence variant using the Fisher¡¯s exact test;
- Pvalue(TargetJun): p-value for splicing event using the Fisher¡¯s exact test;
- Pvalue(Association): p-value for association between sequence variant adn splicing event using the Fisher¡¯s exact test;
- Distance2Jun: distance between sequence variant and junction breakpoint (junction start or end);
- JunStart: junction start position;
- JunEnd: junction end position;
- JunType: junction type (Known, SemiNovel and Novel);
2. *.jun (main result)
The splicing events associated with sequence variants.
Main fields are:
- CHROM: chromosome name;
- JunStart: junction start position;
- JunEnd: junction end position;
- ReadNumber: reads number of junction;
- SpliceSignal: splice signal of junction;
- VarScore: splice signal score using alternative allele;
- RefScore: splice signal score using reference allele;
- JunType: junction type (Known, SemiNovel and Novel);
- VariantNumber: sequence variant number in junction;
- VariantInfo: sequence variant information, formated like "Position: Ref->Alt:RefAll,AltAll,RefTarget,AltTarget:Type:Pvalue(Variant),Pvalue(TargetJun),Pvalue(Association):Distance2Jun". If multiple sequence variant provided, seperated them by ";";
3. *.alljunread & *.junread
The junction reads for total junctions (*.alljunread) and junctions with associated sequence variants (*.junread). This file use standard SAM format for sequence alignment file (https://samtools.github.io/hts-specs/SAMv1.pdf). The junction location was annotated for each alignment using an additional tag "SJ". For example, "SJ:Z:chr9:101802838-101806850" means this read supported the junction in chromosome 9 with location 101802838-101806850.
4. *.alljunread.sorted.bam & *.junread.sorted.bam
The sequence alignment result for junction reads (sorted by coordinate) in BAM format. It can be easily viewed using popular software like IGV.
5. *.eses
The flanking sequence for ref and alt allele (upstream flanking sequence, ref/alt, and downstream flanking sequence) in FASTA format (http://zhanglab.ccmb.med.umich.edu/FASTA/). These sequence can be conveniently used to predict ESE motifs using tools like ESEfinder and Human Splicing Finder.
Examples:
#>refAccession CHROM Position SpliceSignal Strand Distance2Jun Location(Exon5' or Exon3')
>ref1 chr9 76192862 GTAG + 6 Exon3'
CCATTGAGCAAGCTCAGGTCA
>alt1 chr9 76192862 GTAG + 6 Exon3'
CCATTGAGCAGGCTCAGGTCA
6. *.stat
The statistics result file.
For example:
#Splicing signal type
CTAC 4
GTAG 4
#Junction type
SemiNovel 5
Novel 3
#Variant type
Unknown 2
DNAsnp 7
#Variant change type
A->G 4
C->A 1
T->C 1
A->T 2
G->C 1
#Reads information
Total reads are 7260395.
The filtered reads are as following:
ReadQuality 303875
PairedUnmap 0
SingleUnmap 129419
PCR_duplication 0
MappingQuality 0
Multiple 1699140
Mismatch 0
ReadLength 222
AnchorSize 202198
#Junction information
Total junctions are 7136.
The filtered junctions are as following:
LowCoverage 907
NoVariant 379
KnownJunction 5842
JunctionShift 0
JunctionSignal 0
7. *.R *.Rout and *.jpeg
*.R is the R script for drawing the characteristics of splicing events and sequence variants. *.Rout is the logfile for *.R. *.jpeg is the figure produced by *.R, which includes 8 subfigures. Users can easily modify *.R according to the original picture for producing more beautiful figure (such as using xlim=c(a,b) to limit the x-axis range).
Note: In default, the error messages or running messages are printed in screen. You can redirect the screen output into a file.
TEST & EXAMPLE
==============
ISVASE has been tested in two data, one is the sequence alignment result of RNA-seq data (only chromosome 9 in human) from PVAAS software website (http://pvaas.sourceforge.net/) while the other is RNA-seq data of a human glioblastoma cell line U87MG (SRR388226 and SRR388227 are control samples and SRR388228 and SRR388229 are ADAR knockdown samples, GEO accession no. GSE28040). For second data, the raw data was filtered by Trimmomatic and mapped by GSNAP. All data were run in a Linux system.
When you get the ISVASE package, decompress the directory, then you can test the program by following commands (just selected one command for each step).
1. Data pre-processing
1.1 Filter
java -jar trimmomatic-0.33.jar PE -phred33 SRR388226_1.fastq SRR388226_2.fastq SRR388226_1.paired.fastq.gz SRR388226_1.unpaired.fastq.gz SRR388226_2.paired.fastq.gz SRR388226_2.unpaired.fastq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:25 TRAILING:25 SLIDINGWINDOW:4:20 MINLEN:40 &
1.2 Mapping
gsnap -d hg19_gsnap --gunzip -a off -N 1 -A sam -t 8 --force-xs-dir --split-output=SRR388226 SRR388226_1.paired.fastq.gz SRR388226_2.paired.fastq.gz
2. SVASE identification
*Test one
nohup samtools view chr9.sorted.bam | perl ISVASE.pl -g hg19.fa -k yes -dv common_all_20160408.vcf -rv Human_AG_all_hg19_v2.txt -an gencode.v24lift37.annotation.gff3 -o chr9_4 > chr9_4.nohup &
nohup samtools view chr9.sorted.bam | perl ISVASE.pl -g hg19.fa -dv common_all_20160408.vcf -rv Human_AG_all_hg19_v2.txt -an gencode.v24lift37.annotation.gff3 -o chr9_5 > chr9_5.nohup &
*Test two
nohup samtools view SRR388226_gsnap.bam | perl ISVASE.pl -g hg19.fa -dv common_all_20160408.vcf -rv Human_AG_all_hg19_v2.txt -an gencode.v24lift37.annotation.gff3 -o SRR388226_gsnap5 > SRR388226_gsnap5.nohup &
nohup samtools view SRR388227_gsnap.bam | perl ISVASE.pl -g hg19.fa -dv common_all_20160408.vcf -rv Human_AG_all_hg19_v2.txt -an gencode.v24lift37.annotation.gff3 -o SRR388227_gsnap5 > SRR388227_gsnap5.nohup &
nohup samtools view SRR388229_gsnap.bam | perl ISVASE.pl -g hg19.fa -dv common_all_20160408.vcf -rv Human_AG_all_hg19_v2.txt -an gencode.v24lift37.annotation.gff3 -o SRR388229_gsnap5 > SRR388229_gsnap5.nohup &
nohup samtools view SRR388228_gsnap.bam | perl ISVASE.pl -g hg19.fa -dv common_all_20160408.vcf -rv Human_AG_all_hg19_v2.txt -an gencode.v24lift37.annotation.gff3 -o SRR388228_gsnap5 > SRR388228_gsnap5.nohup &
nohup samtools view SRR388228_gsnap.bam | perl ISVASE.pl -g hg19.fa -dv common_all_20160408.vcf -rv Human_AG_all_hg19_v2.txt -an gencode.v24lift37.annotation.gff3 -o SRR388228_gsnap4 -k yes > SRR388228_gsnap4.nohup &
nohup samtools view SRR388229_gsnap.bam | perl ISVASE.pl -g hg19.fa -dv common_all_20160408.vcf -rv Human_AG_all_hg19_v2.txt -an gencode.v24lift37.annotation.gff3 -o SRR388229_gsnap4 -k yes > SRR388229_gsnap4.nohup &
nohup samtools view SRR388227_gsnap.bam | perl ISVASE.pl -g hg19.fa -dv common_all_20160408.vcf -rv Human_AG_all_hg19_v2.txt -an gencode.v24lift37.annotation.gff3 -o SRR388227_gsnap4 -k yes > SRR388227_gsnap4.nohup &
nohup samtools view SRR388226_gsnap.bam | perl ISVASE.pl -g hg19.fa -dv common_all_20160408.vcf -rv Human_AG_all_hg19_v2.txt -an gencode.v24lift37.annotation.gff3 -o SRR388226_gsnap4 -k yes > SRR388226_gsnap4.nohup &
3. ISVASE result
3.1 Figures
Figure 1 - Schematic diagram of the ISVASE software. (A) Identify splicing variants in RNA-seq data. All splicing variants can be divided into four types according to relationship between target splicing variant (red colour) and other splicing variants (from left to right): (i) unique splicing variant; (ii) splicing variants with same junction start; (iii) splicing variants with same junction end; and (iv) splicing variants with same junction start or end. (B) Identify sequence variants for each splicing variant and all related splicing variants. To handle all splicing variant types, we identify sequence variants for two parts of splicing separately. In the left part, for junctions with orange, yellow and red colour, the all related splicing variants should be three (all these junctions); however, for junctions with green and blue colour, the total junction is one (itself). Similarly, in the right part, junctions with red, green and blue colour have three all related splicing variants while junctions with orange and yellow colour only has one related junction (itself). (C) Identify associations. This step includes three significant judgements for sequence variants, junction existence and association between sequence variants and junctions, respectively. The example shown two junctions with same junction end. For junction one (top), two sequence variants are identified (left G(ref)->C(alt) and right G(ref)->A(alt)). In sequence variant significant judgement, left is filtered (p value=1) while right passes the test (p value=0.0476). In junction significant judgement and association judgement, p value of top junction is 0.0128 (significant) and 0.0070 (significant) respectively. Dashed lines represent gaps in the alignment.
Figure 2 - The characteristics of SVASEs identified in new splicing events of test one data.
Figure 3 - The characteristics of SVASEs identified in all splicing events of test one data.
Figure 4 - The characteristics of SVASEs identified in new splicing events of SRR388226 data.
Figure 5 - The characteristics of SVASEs identified in all splicing events of SRR388226 data.
3.2 Tables (only use new splicing events)
Table 1 - chr9_5.var
#CHROM Position Ref Alt RefAll AltAll RefTarget AltTarget Type Pvalue(Variant) Pvalue(TargetJun) Pvalue(Association) Distance2Jun JunStart JunEnd JunType
chr9 76192862 A G 0 87 0 16 DNAsnp 3.32734195504201e-09 1.42963260116567e-05 3.32734195504201e-09 6 76192868 76193731 SemiNovel
chr9 110210721 A G 0 10 0 10 DNAsnp 1.08250882244691e-05 1.08250882244691e-05 1.08250882244691e-05 21 110210742 110217093 Novel
chr9 110217097 A G 0 24 0 24 DNAsnp 6.20201122431976e-14 6.20201122431976e-14 6.20201122431976e-14 4 110210742 110217093 Novel
chr9 100823084 T C 186 171 3 6 DNAsnp 0.0090497737556561 0.0037119189861946 0.0090497737556561 20 100808121 100823064 SemiNovel
chr9 67331056 A T 0 12 0 6 Unknown 0.00216450216450216 0.0137299771167048 0.00216450216450216 14 67331070 67334095 SemiNovel
chr9 4676745 G C 0 4 0 4 DNAsnp 0.0285714285714286 0.0285714285714286 0.0285714285714286 4 4676749 4677935 Novel
chr9 76192862 A G 0 87 0 13 DNAsnp 1.92296598273177e-07 0.000150549857901352 1.92296598273177e-07 6 76192868 76218280 SemiNovel
chr9 35151594 A T 0 4 0 4 DNAsnp 0.0285714285714286 0.0285714285714286 0.0285714285714286 6 35151600 35156093 Novel
chr9 96320999 C A 128 30 0 30 Unknown 1.69112338921449e-17 4.0696995599202e-10 1.82357982937457e-35 1 96321000 96323394 SemiNovel
Table 2 - chr9_5.jun
#CHROM JunStart JunEnd ReadNumber SpliceSignal VarScore RefScore JunType VariantNumber VariantInfo
chr9 76192868 76193731 16 GTAG 17.8440690631199 17.8440690631199 SemiNovel 1 76192862:A->G:0,87,0,16:DNAsnp:3.32734195504201e-09,1.42963260116567e-05,3.32734195504201e-09:6;
chr9 110210742 110217093 24 CTAC 18.066432555019 18.066432555019 Novel 2 110210721:A->G:0,10,0,10:DNAsnp:1.08250882244691e-05,1.08250882244691e-05,1.08250882244691e-05:21;110217097:A->G:0,24,0,24:DNAsnp:6.20201122431976e-14,6.20201122431976e-14,6.20201122431976e-14:4;
chr9 100808121 100823064 11 GTAG 19.9042290083212 19.9042290083212 SemiNovel 1 100823084:T->C:186,171,3,6:DNAsnp:0.0090497737556561,0.0037119189861946,0.0090497737556561:20;
chr9 67331070 67334095 8 CTAC 15.5543328221943 15.5543328221943 SemiNovel 1 67331056:A->T:0,12,0,6:Unknown:0.00216450216450216,0.0137299771167048,0.00216450216450216:14;
chr9 4676749 4677935 4 CTAC 17.6459754228348 17.6459754228348 Novel 1 4676745:G->C:0,4,0,4:DNAsnp:0.0285714285714286,0.0285714285714286,0.0285714285714286:4;
chr9 76192868 76218280 13 GTAG 7.88601403777841 7.88601403777841 SemiNovel 1 76192862:A->G:0,87,0,13:DNAsnp:1.92296598273177e-07,0.000150549857901352,1.92296598273177e-07:6;
chr9 35151600 35156093 4 CTAC 8.8748945801047 8.8748945801047 Novel 1 35151594:A->T:0,4,0,4:DNAsnp:0.0285714285714286,0.0285714285714286,0.0285714285714286:6;
chr9 96321000 96323394 30 GTAG -9.71938299211128 -11.7552900078596 SemiNovel 1 96320999:C->A:128,30,0,30:Unknown:1.69112338921449e-17,4.0696995599202e-10,1.82357982937457e-35:1;
Table 3 - chr9_5.junread (partial)
UNC14-SN744:268:C10DDACXX:4:1305:17021:151643/2 147 chr9 4676712 56 38M1185N10M = 4676641 -1304 AGCTGTAGCAGCCATCCTGCAACCATGATGTAACAAGCTTTATGCCAG <FCFBB>GGGCGAGEHJHCFA+F>HEAEGJBACIGFDFB>?FFDD?B? XF:Z:CTAC, RG:Z:120904_UNC14-SN744_0268_AC10DDACXX_4_GTGGCC IH:i:1 HI:i:1 NM:i:1 XS:A:- SJ:Z:chr9:4676749-4677935
UNC14-SN744:268:C10DDACXX:4:2304:8367:181136/1 99 chr9 4676713 56 37M1185N11M = 4679402 2737 GCTGTAGCAGCCATCCTGCAACCATGATGTAACAAGCTTTATGCCAGG @CCFDFFFHHHFHJIJJJJJJJIJGIIJDHIIGIIJJJJJHGIJJGG2 XF:Z:CTAC, RG:Z:120904_UNC14-SN744_0268_AC10DDACXX_4_GTGGCC IH:i:1 HI:i:1 NM:i:1 XS:A:- SJ:Z:chr9:4676749-4677935
UNC14-SN744:268:C10DDACXX:4:2307:16839:40704/2 163 chr9 4676734 56 16M1185N32M = 4679022 2336 CCATGATGTAACAAGCTTTATGCCAGGCACTATTCAAAGCACTTTGTG CCCFFFFFFHHHHJJJJJJJIJJJJJJJJJIJIIJJJGIIJJGIJJFH XF:Z:CTAC, RG:Z:120904_UNC14-SN744_0268_AC10DDACXX_4_GTGGCC IH:i:1 HI:i:1 NM:i:1 XS:A:- SJ:Z:chr9:4676749-4677935
UNC14-SN744:268:C10DDACXX:4:1101:8572:139850/2 163 chr9 4676734 56 16M1185N32M = 4679022 2336 CCATGATGTAACAAGCTTTATGCCAGGCACTATTCAAAGCACTTTGTG CCCFFFFFGHFHHJJJJJJJJJJJJJGJJJIGIJJIJJJJJJJJJIHH XF:Z:CTAC, RG:Z:120904_UNC14-SN744_0268_AC10DDACXX_4_GTGGCC IH:i:1 HI:i:1 NM:i:1 XS:A:- SJ:Z:chr9:4676749-4677935
Table 4 - chr9_5.stat
#Splicing signal type
CTAC 4
GTAG 4
#Junction type
SemiNovel 5
Novel 3
#Variant type
Unknown 2
DNAsnp 7
#Variant change type
A->G 4
C->A 1
T->C 1
A->T 2
G->C 1
#Reads information
Total reads are 7260395.
The filtered reads are as following:
ReadQuality 303875
PairedUnmap 0
SingleUnmap 129419
PCR_duplication 0
MappingQuality 0
Multiple 1699140
Mismatch 0
ReadLength 222
AnchorSize 202198
#Junction information
Total junctions are 7136.
The filtered junctions are as following:
LowCoverage 907
NoVariant 379
KnownJunction 5842
JunctionShift 0
JunctionSignal 0
Table 5 - chr9_5.eses (partial).
>ref1 chr9 76192862 GTAG + 6 Exon3'
CCATTGAGCAAGCTCAGGTCA
>alt1 chr9 76192862 GTAG + 6 Exon3'
CCATTGAGCAGGCTCAGGTCA
>ref2 chr9 110210721 CTAC - 21 Exon5'
ACTAAAAATATCAAAAGTAGC
>alt2 chr9 110210721 CTAC - 21 Exon5'
ACTAAAAATACCAAAAGTAGC
Table 6 - chr9_5.avinput (for annotation using ANNOVAR)
9 100823084 100823084 T C
9 76192862 76192862 A G
9 35151594 35151594 A T
9 110210721 110210721 A G
9 110217097 110217097 A G
9 67331056 67331056 A T
9 4676745 4676745 G C
9 76192862 76192862 A G
9 96320999 96320999 C A
Table 7 - chr9_5.nohup (The log file for running ISVASE, partial).
Start time: Sat May 14 11:13:59 2016.
/lustre/liuwf/splicevariant/perl/ISVASE.pl -o chr9_5 -rv Human_AG_all_hg19_v2.txt -an gencode.v24lift37.annotation.gff3 -dv common_all_20160408.vcf -g /lustre/liuwf/database/ucsc/hg19/hg19.fa
Read alignment file and identify splicing events.
Processed 1000000 reads.
Processed 2000000 reads.
Processed 3000000 reads.
Processed 4000000 reads.
Processed 5000000 reads.
Processed 6000000 reads.
Processed 7000000 reads.
The splicing events have been identified: Sat May 14 11:18:23 2016.
Read gene annotation file.
The gene annotation file has been readed: Sat May 14 11:18:41 2016.
Identify the association between sequence variant and spurious splicing.
Process the chromosome chr9.
Read DNA mutation variant in chr9 (1504675): Sat May 14 11:20:36 2016.
Read RNA editing in chr9 (99381): Sat May 14 11:21:02 2016.
Read chromosome chr9 sequence: Sat May 14 11:21:27 2016.
Identifying the association between sequence variant and spurious splicing in chr9.
Knownsplice chr9 94519841 94538023 CT AC
Knownsplice chr9 130145824 130147306 GT AG
Knownsplice chr9 88577095 88589882 GT AG
Knownsplice chr9 110074018 110081033 GT AG
Knownsplice chr9 136328677 136333043 GT AG
......
Chromosome chr9 have been processed.
All chromosomes have been processed: Sat May 14 11:22:41 2016.
#Total reads are 7260395.
#The filtered reads are as following:
ReadQuality 303875
PairedUnmap 0
SingleUnmap 129419
PCR_duplication 0
MappingQuality 0
Multiple 1699140
Mismatch 0
ReadLength 222
AnchorSize 202198
#Total junctions are 7136.
#The filtered junctions are as following:
LowCoverage 907
NoVariant 379
KnownJunction 5842
JunctionShift 0
JunctionSignal 0
create sorted bam files and remove the temp files.
The sorted bam files are created: Sat May 14 11:23:00 2016.
Plot figures to illustrate the characteristics of sequence variants and junctions.
Figure plot are finished: Sat May 14 11:23:06 2016.
End time: Sat May 14 11:23:06 2016.
SUBPROGRAM
==========
We also provide a program named var2avinput.pl for convert *.var file to *.avinput file for annotation using ANNOVAR.
The program extract_genename.pl can be used to extract gene list from ANNOVAR annotation result.
The program fish.pl can be used to comparison ISVASE results from different samples/data easily and quickly.
It can get unique bait/baits (one or multiple items in column) in bait file at first (such as chromosome, position, strand, and others, or combination of mulitple items), then these bait/baits are used to search the corresponding column in fish file in two measures, existing in bait file (default) or not exiting in bait file ("-contrary" option).
CONTACT
=======
If you have any question or suggestion, please contact us.
Wanfei Liu & Qiang Lin
Email: <liuwf@big.ac.cn> & <linq@big.ac.cn>