ExonFinder:
A pipeline of deriving novel cassette exons and novel retained-introns with cross-species support.
Programming languages:
C/Bash shell script/AWK
Execution environment (recommended):
Platform: Linux x86-64 (kernel 2.6.32 or later; recommended distribution: Bio-Linux 6 or later versions)
Memory: Depending on BLAT and the target genome size (recommended: at least 4GB).
BLAT shall be executable anywhere.
Main program name:
NE-Extractor.sh
Main program input:
A psl (Pattern Space Layout) alignment file by BLAT*.
*****************************************************************************************************************************************************
Installation/Uninstallation:
Type "sh INSTALL.sh" or "./INSTALL.sh" to complete the installation.
Type "sh UNINSTALL.sh" or "./UNINSTALL.sh" to uninstall this tool.
Initialization (only once whenever the target species and its annotation remain unchanged):
Init_ENSTs.sh [INPUT_ENSTS] [INPUT_ENSTS_CDS_PHASE]
Init_Genomes.sh [GENOME] [DIR_CHROMOSOMES]
[INPUT_ENSTS]:
Gene annotation (csv/tsv) of the target species downloaded from Ensembl BioMart.
Required 19 attributes (must follow the order below):
Ensembl Gene ID, Ensembl Transcript ID, Chromosome Name, Gene Start (bp), Gene End (bp),
Strand, Transcript Start (bp), Transcript End (bp), 5' UTR Start, 5' UTR End,
3' UTR Start, 3' UTR End, CDS Start, CDS End, Exon Chr Start (bp), Exon Chr End (bp),
Gene Biotype, Transcript Biotype, Status (transcript) (or Gene name).
[INPUT_ENSTS_CDS_PHASE]:
Another gene annotation (csv/tsv) of the target species downloaded from Ensembl BioMarkt.
Required 11 attributes (must follow the order below):
Ensembl Gene ID, Ensembl Transcript ID, Chromosome Name, Strand, Exon Chr Start (bp),
Exon Chr End (bp), Exon Rank in Transcript, phase, CDS Start, CDS End, Transcript Biotype.
[GENOME]:
The whole genome (fasta) of the target species downloaded from UCSC/Ensembl database.
[DIR_CHROMOSOMES]:
The directory specified by the user for storing the target genome that will be split into several
files according to the chromosome names.
*****************************************************************************************************************************************************
The pipeline:
Step 1: Obtain candidates with cross-species ESTs support.
NE-Extractor.sh [NON-TARGET_BLAT_OUTPUT] [TARGET_ENSTS] [TARGET_ENSTS_CDS_PHASE] [TARGET_DIR_CHROMOSOMES] \
[NON-TARGET_EXPRESSED_SEQUENCES] [TARGET_CDNA_LIBRARY] [MAX_GAP_LEN] [NON-TARGET_TASK] 1
Step 2: Obtain candidates with target-species ESTs support.
NE-Extractor.sh [TARGET_BLAT_OUTPUT] [TARGET_ENSTS] [TARGET_ENSTS_CDS_PHASE] [TARGET_DIR_CHROMOSOMES] \
[TARGET_EXPRESSED_SEQUENCES] [TARGET_CDNA_LIBRARY] [MAX_GAP_LEN] [TARGET_TASK_NAME] 0
Step 3: Merge the candidates identified in Step 1 and 2.
Merge_Results.sh [NON-TARGET_TASK] [TARGET_TASK_NAME] [OPTION]
[OPTION]:
0: Only with cross-species support; 1: With target species support
*****************************************************************************************************************************************************
Usage of NE-Extractor.sh:
NE-Extractor.sh [BLAT_OUTPUT] [INPUT_ENSTS] [INPUT_ENSTS_CDS_PHASE] [DIR_CHROMOSOMES] \
[EXPRESSED_SEQUENCES] [CDNA_LIBRARY] [MAX_GAP_LEN] [TASK_NAME] [IS_INTER_SPECIES]
[BLAT_OUTPUT]:
An output file of BLAT (psl format)。
[INPUT_ENSTS]:
As described in the initialization steps.
Note: Be sure to run "Init_ENSTs.sh [INPUT_ENSTS] [INPUT_ENSTS_CDS_PHASE]" first.
[INPUT_ENSTS_CDS_PHASE]:
As described in the initialization steps.
Note: Be sure to run "Init_ENSTs.sh [INPUT_ENSTS] [INPUT_ENSTS_CDS_PHASE]" first.
[DIR_CHROMOSOMES]:
As described in the initialization steps.
Note: Be sure to run "Init_Genomes.sh [GENOME] [DIR_CHROMOSOMES]" first.
[EXPRESSED_SEQUENCES]:
The expressed sequences (e.g., 454/EST reads in fasta format) which were used to obtain [BLAT_OUTPUT].
[CDNA_LIBRARY]:
The cDNA (fasta) of the target species downloaded from Ensembl databases.
[MAX_GAP_LEN]:
Maximum gap length between contiguous segments aligned by BLAT.
Two contiguous segments between which there are no more than [MAX_GAP_LEN] gaps will be
regarded as one segment.
[TASK_NAME]:
A label for the current task. No space is allowed.
[IS_INTER_SPECIES]:
Logical value.
If [BLAT_OUTPUT] is obtained from cross-species BLAT alignment, enter "1" here. Otherwise,
enter "0" here.
Output file:
[TASK_NAME]_identified_candidates.tsv
Columns in order:
chr, start (1-base), end (1-base), strand, transcript ID, novel exonic length, AS type (CASSETTE or RETAIN),
splicing site motif, splicing sites motif type (canonical/noncanonical), genomic type (3'UTR/5'UTR/CDS),
coordinates of flanking exons, #supporting reads, supporting reads.
For "start", "end", and "splicing sites motifs":
Two or more numbers/motifs separated by semicolons stand for events of multiple cassette-on exons.
For "coordinates of flanking exons" (1-base):
strand "+":
upstream flanking exon 5'end, upstream flanking exon 3'end; downstream flanking exon 5'end, downstream flanking exon 3'end
strand "-":
downstream flanking exon 3'end, downstream flanking exon 5'end; upstream flanking exon 3'end, upstream flanking exon 5'end
For "splicing site motif" and "splicing site motifs type":
canonical splicing site motifs: GT-AG, GC-AG, AT-AC
noncanonical splicing site motifs: AT-AA, AT-AG, AT-AT, GT-AT, and GT-GG.
(Please refer to "Lewandowska, D., Simpson, C. G., Clark, G. P., Jennings, N. S., Barciszewska-Pacak, M., Lin, C.-F.,
Makalowski, W., Brown, J. W.S., and Jarmolowski, A. (2004). Determinants of Plant U12-Dependent Intron Splicing Efficiency.
The Plant Cell, Vol. 16, 1340-1352.")
*****************************************************************************************************************************************************
*Extension of ExonFinder using NGS short-reads data:
Suggested computational processes of transcriptome assembly:
1) reference-based assembler
RNA-seq -> bowtie -> cufflinks -> gffread -> fasta
2) de novo assembler
RNA-seq -> Trinity -> fasta
Next, the output file (fasta) could be mapped using blat against the target reference genome.
fasta -> blat -> psl output -> ExonFinder
The fasta file derived in 1) or 2) can be used as [EXPRESSED_SEQUENCES].
The output .psl file can be used as [BLAT_OUTPUT] required by NE-Extractor.sh.
See the following websites for more details.
Trinity:
http://trinityrnaseq.sourceforge.net/
Bowtie:
http://bowtie-bio.sourceforge.net/index.shtml
cufflinks:
http://cufflinks.cbcb.umd.edu/
gffread utility:
http://cufflinks.cbcb.umd.edu/gff.html