ExonFinder - Browse Files at SourceForge.net

The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.
Name	Modified	Size
Additional file 1_NovelExon.xls	2014-10-31	1.2 MB
ReadMe.txt	2014-10-29	6.9 kB
exonfinder_v2.0.tar.gz	2014-10-02	22.5 MB
exonfinder_v1.0.tar.gz	2014-10-02	22.5 MB
Totals: 4 Items		46.3 MB
ExonFinder: 
	A pipeline of deriving novel cassette exons and novel retained-introns with cross-species support. 

	Programming languages: 
		C/Bash shell script/AWK

	Execution environment (recommended): 
		Platform: Linux x86-64 (kernel 2.6.32 or later; recommended distribution: Bio-Linux 6 or later versions)	
		Memory: Depending on BLAT and the target genome size (recommended: at least 4GB).
		BLAT shall be executable anywhere. 

	Main program name: 
		NE-Extractor.sh
	
	Main program input: 
		A psl (Pattern Space Layout) alignment file by BLAT*.


*****************************************************************************************************************************************************

Installation/Uninstallation: 

	Type "sh INSTALL.sh" or "./INSTALL.sh" to complete the installation.
	Type "sh UNINSTALL.sh" or "./UNINSTALL.sh" to uninstall this tool. 

Initialization (only once whenever the target species and its annotation remain unchanged): 

	Init_ENSTs.sh [INPUT_ENSTS] [INPUT_ENSTS_CDS_PHASE] 
	Init_Genomes.sh [GENOME] [DIR_CHROMOSOMES]

	[INPUT_ENSTS]: 
		Gene annotation (csv/tsv) of the target species downloaded from Ensembl BioMart. 
		Required 19 attributes (must follow the order below): 
			Ensembl Gene ID, Ensembl Transcript ID, Chromosome Name, Gene Start (bp), Gene End (bp), 
			Strand, Transcript Start (bp), Transcript End (bp), 5' UTR Start, 5' UTR End, 
			3' UTR Start, 3' UTR End, CDS Start, CDS End, Exon Chr Start (bp), Exon Chr End (bp), 
			Gene Biotype, Transcript Biotype, Status (transcript) (or Gene name). 

	[INPUT_ENSTS_CDS_PHASE]: 
		Another gene annotation (csv/tsv) of the target species downloaded from Ensembl BioMarkt. 
		Required 11 attributes (must follow the order below): 
			Ensembl Gene ID, Ensembl Transcript ID, Chromosome Name, Strand, Exon Chr Start (bp), 
			Exon Chr End (bp), Exon Rank in Transcript, phase, CDS Start, CDS End, Transcript Biotype.

	[GENOME]: 
		The whole genome (fasta) of the target species downloaded from UCSC/Ensembl database. 

	[DIR_CHROMOSOMES]: 
		The directory specified by the user for storing the target genome that will be split into several 
		files according to the chromosome names. 


*****************************************************************************************************************************************************

The pipeline: 

	Step 1: Obtain candidates with cross-species ESTs support.

		NE-Extractor.sh [NON-TARGET_BLAT_OUTPUT] [TARGET_ENSTS] [TARGET_ENSTS_CDS_PHASE] [TARGET_DIR_CHROMOSOMES] \
		[NON-TARGET_EXPRESSED_SEQUENCES] [TARGET_CDNA_LIBRARY] [MAX_GAP_LEN] [NON-TARGET_TASK] 1

	Step 2: Obtain candidates with target-species ESTs support.

		NE-Extractor.sh [TARGET_BLAT_OUTPUT] [TARGET_ENSTS] [TARGET_ENSTS_CDS_PHASE] [TARGET_DIR_CHROMOSOMES] \
		[TARGET_EXPRESSED_SEQUENCES] [TARGET_CDNA_LIBRARY] [MAX_GAP_LEN] [TARGET_TASK_NAME] 0

	Step 3: Merge the candidates identified in Step 1 and 2.

		Merge_Results.sh [NON-TARGET_TASK] [TARGET_TASK_NAME] [OPTION]
		
		[OPTION]: 
			0: Only with cross-species support; 1: With target species support


*****************************************************************************************************************************************************

Usage of NE-Extractor.sh: 
	NE-Extractor.sh [BLAT_OUTPUT] [INPUT_ENSTS] [INPUT_ENSTS_CDS_PHASE] [DIR_CHROMOSOMES] \
	[EXPRESSED_SEQUENCES] [CDNA_LIBRARY] [MAX_GAP_LEN] [TASK_NAME] [IS_INTER_SPECIES]

	[BLAT_OUTPUT]: 
		An output file of BLAT (psl format)。

	[INPUT_ENSTS]: 
		As described in the initialization steps. 
		Note: Be sure to run "Init_ENSTs.sh [INPUT_ENSTS] [INPUT_ENSTS_CDS_PHASE]" first. 

	[INPUT_ENSTS_CDS_PHASE]: 
		As described in the initialization steps. 
		Note: Be sure to run "Init_ENSTs.sh [INPUT_ENSTS] [INPUT_ENSTS_CDS_PHASE]" first. 

	[DIR_CHROMOSOMES]: 
		As described in the initialization steps. 
		Note: Be sure to run "Init_Genomes.sh [GENOME] [DIR_CHROMOSOMES]" first.

	[EXPRESSED_SEQUENCES]: 
		The expressed sequences (e.g., 454/EST reads in fasta format) which were used to obtain [BLAT_OUTPUT].

	[CDNA_LIBRARY]: 
		The cDNA (fasta) of the target species downloaded from Ensembl databases. 

	[MAX_GAP_LEN]: 
		Maximum gap length between contiguous segments aligned by BLAT. 
		Two contiguous segments between which there are no more than [MAX_GAP_LEN] gaps will be 
		regarded as one segment. 

	[TASK_NAME]: 
		A label for the current task. No space is allowed. 

	[IS_INTER_SPECIES]: 
		Logical value. 
		If [BLAT_OUTPUT] is obtained from cross-species BLAT alignment, enter "1" here. Otherwise, 
		enter "0" here. 

Output file: 

	[TASK_NAME]_identified_candidates.tsv

	Columns in order:
		chr, start (1-base), end (1-base), strand, transcript ID, novel exonic length, AS type (CASSETTE or RETAIN), 
		splicing site motif, splicing sites motif type (canonical/noncanonical), genomic type (3'UTR/5'UTR/CDS), 
		coordinates of flanking exons, #supporting reads, supporting reads.

	For "start", "end", and "splicing sites motifs": 
		Two or more numbers/motifs separated by semicolons stand for events of multiple cassette-on exons. 

	For "coordinates of flanking exons" (1-base): 
		strand "+": 
			upstream flanking exon 5'end, upstream flanking exon 3'end; downstream flanking exon 5'end, downstream flanking exon 3'end
		strand "-": 
			downstream flanking exon 3'end, downstream flanking exon 5'end; upstream flanking exon 3'end, upstream flanking exon 5'end
	
	For "splicing site motif" and "splicing site motifs type": 
		canonical splicing site motifs: GT-AG, GC-AG, AT-AC
		noncanonical splicing site motifs: AT-AA, AT-AG, AT-AT, GT-AT, and GT-GG. 
		(Please refer to "Lewandowska, D., Simpson, C. G., Clark, G. P., Jennings, N. S., Barciszewska-Pacak, M., Lin, C.-F., 
		Makalowski, W., Brown, J. W.S., and Jarmolowski, A. (2004). Determinants of Plant U12-Dependent Intron Splicing Efficiency. 
		The Plant Cell, Vol. 16, 1340-1352.")


*****************************************************************************************************************************************************

*Extension of ExonFinder using NGS short-reads data: 

	Suggested computational processes of transcriptome assembly: 

	1) reference-based assembler
		RNA-seq -> bowtie -> cufflinks -> gffread -> fasta

	2) de novo assembler
		RNA-seq -> Trinity -> fasta

	Next, the output file (fasta) could be mapped using blat against the target reference genome. 
		fasta -> blat -> psl output -> ExonFinder
	
	The fasta file derived in 1) or 2) can be used as [EXPRESSED_SEQUENCES]. 
	The output .psl file can be used as [BLAT_OUTPUT] required by NE-Extractor.sh. 


	See the following websites for more details. 

	Trinity: 
		http://trinityrnaseq.sourceforge.net/
	Bowtie: 
		http://bowtie-bio.sourceforge.net/index.shtml
	cufflinks:
		http://cufflinks.cbcb.umd.edu/
	gffread utility: 
		http://cufflinks.cbcb.umd.edu/gff.html
Source: ReadMe.txt, updated 2014-10-29
ExonFinder Files

A pipeline to extract novel cassette exons/retained-introns

ExonFinder Files

A pipeline to extract novel cassette exons/retained-introns

Get an email when there's a new version of ExonFinder