PTESFinder Wiki

Post-Transcriptional Exon Shuffling (PTES) Identification Pipeline

Brought to you by: graslevy

Getting Started

Latest version of PTESFinder can be downloaded from here. To run PTESFinder, ensure Bedtools, Samtools and bowtie (versions 1 & 2) are installed on your system. Also, ensure that your system can execute Java programs; minimum version: 1.6.

Input Files:

RNASeq data in Illumina FASTQ format.
Transcriptome annotation in BED format - can be obtained from http://genome.ucsc.edu/cgi-bin/hgTables?command=start
Genomic reference in FASTA format
Pre-built bowtie2 index of genomic reference
**Optional: **
- Previously discovered PTES transcripts and their reference sequences in FASTA format. This aids discovery when analyzing datasets with low read depths or very short read lengths.

Output Files:

**- annotated-ptes.bed: **
- chromosome
- start
- stop
- structure: PTES identified (eg. NM_002111.6.2 - donor exon '6', acceptor exon '2')
- raw_count: number of reads supporting identified structure after filtering
- strand
- mean_flanking_can_count: mean count of flanking canonical junctions (junctions 1-2 and 6-7 for structure NM_002111.6.2)
- sum_flanking_can_count: total count of flanking canonical junctions
- max_can_count: maximum canonical junction count
- mean_can_count : mean count for canonical junctions identified for locus
- sum_can_count : total count of all canonical junctions identified for locus
- canonical_expression: string describing canonical junction counts (eg. 1.2_3.0|2.3_3.0|3.4_2.0|4.5_3.0| ; where '1.2' describes the junction exon 1-exon 2, 3.0 is the count for the junction.
- PTESReads: Reads supporting each identified structure
- flanking-canonical-counts.tsv(.bed) : canonical junction counts
- Optional Files:
- genomic-filtered.sam: reads excluded by the genomic filter
- refseq-filtered.sam: reads excluded by the transcriptomic filter
- pid.tsv: computed percent identity of sequence segments (left and right) either side of PTES junction
- putative_fused_transcripts_reads.tsv: ids of reads supporting putative fused transcripts
- putative_sense_antisense_reads.tsv ids of reads supporting putative sense-antisense transcripts

Running PTESFinder:

$ ./PTESFinder.sh <options> </options>

Parameters:

Mandatory:

-r sequence reads in FASTQ format
-d working directory
-t transcriptome annotation in BED format
-g genomic reference in FASTA format
-b genomic reference bowtie index
-u uniqueness (same as bowtie -m/M value parameter)
-c PTESFinder directory
-s segment size --should be an integer less than read length, eg. 65 for 76bp reads
Optional:

- p PID -- should be <= 1; ideal values between 0.60 and 0.95, default: 0.85
-j junction Span --should be an even integer, ideal values between 4 and 14, default: 8
-a anchor size --should be <20 & >15, default: 20
-P PTES references in FASTA format
-C canonical junction references in FASTA format
-G turn off all filters flag and run only genomic and junctional filters
*-T turn off all filters flag and run only transcriptomic and junctional filters

Example Commands:

Single sample run:

$ ./PTESFinder.sh -r SRR364679.fastq -d test -t ucsc-hg19-refGene.bed -g ucsc.hg19.fasta -b hg19 -s 65 -u 7 -c code/

StepWise Run:

PTESFinder starts by generating the transcriptome reference of the organism under study using the annotation file and genomic reference supplied by user. Bowtie indexes are built for the transcriptome reference. To complete the initialization phase, a ‘coordinates’ file is generated to map the positions of exons and splice sites; this file is used in later phases for building new references for putative PTES models.

Run initialization phase using:

sh $PTESFinder_path/generate_transcriptome_reference.sh $PTESFinder_path $working_directory $transcript_bed $genomic_fasta

To generate anchors:

java -cp $PTESFinder_path/PTESDiscovery.jar bio.igm.utils.init.SplitReads $reads $working_directory/

Full length FASTQ reads are mapped to the references (genomic and transcriptomic) using:

sh $PTESFinder_path/mapGenome.sh $reads $working_directory $genomic_bowtie sh $PTESFinder_path/mapRefseq.sh $reads $working_directory $bowtie_mrna

Discovery:

Anchor reads are mapped to the transcriptomic reference using:

sh $PTESFinder_path/mapreads.sh $working_directory $bowtie_mrna $bowtie_m_value

Anchor alignments (in SAM format) are processed further to detect shuffled coordinates providing preliminary evidence of PTES; to do this run:

sh $PTESFinder_path/detect_shuffled_coordinates.sh $working_directory $PTESFinder_path

After identifying anchor pairs with shuffled coordinates, coordinates are resolved to exons and used to describe the putative PTES model:


java -cp $PTESFinder_path/PTESDiscovery.jar:$PTESFinder_path/commons-lang3-3%2e2%2e1.jar bio.igm.utils.discovery.ResolvePTESExonCoordinates $working_directory/ $coords

$coords should be the path to coordinates file generated during initialization.

java -cp $PTESFinder_path/PTESDiscovery.jar:$PTESFinder_path/commons-lang3-3%2e2%2e1.jar bio.igm.utils.discovery.ConstructReferenceSequences $working_directory/ $transcriptomeFASTA $segment_size

$transcriptomeFASTA should be the path to transcriptome reference generated during initialization
$segment_size should be an integer number – ideally 10bp less than read length. For instance, for 76bp reads, use 50 as segment size; for 100bp, use 65 etc.

java -cp $PTESFinder /PTESDiscovery.jar:$PTESFinder/commons-lang3-3%2e2%2e1.jar bio.igm.utils.init.ReduceConstructs $working_directory/ $segment_size

Evaluation:

To evaluate generated PTES models, bowtie indexes are built for the models before remapping full length reads to the new references. Run with:

sh $PTESFinder/build_ptes_reference.sh $working_directory sh $PTESFinder/remap_reads_to_ptes_models.sh $working_directory $reads

Filtering:

To improve the confidence in identified structures, reads mapping to PTES models are filtered using filtering criteria designed to systematically exclude all known false positive structures. The genomic filter excludes reads with better alignment to pseudogenes and segmental duplicated regions; the transcriptomic filter excludes reads with better alignments to canonical transcripts as a result of tandem exon duplication or high sequence similarity; the junctional filter uses two parameters (junction span and segment percent identity) to improve the confidence in the alignment around the PTES junction.

To run these filters and produce final list of identified structures, use:

java -cp $PTESFinder/PTESDiscovery.jar:$PTESFinder/commons-lang3-3%2e2%2e1.jar bio.igm.utils.filter.PipelineFilter $working_directory/ $jspan $pid $all_filters $genomic $transcriptomic



Set flag for $all_filters to 1 and others ($genomic and $transcriptomic) to 0 to run all three filters.
Setting $genomic to 1 and $transcriptomic to 0 will run only the $genomic and $junctional filters; and vice versa.

java -cp $PTESFinder/PTESDiscovery.jar:$PTESFinder/commons-lang3-3%2e2%2e1.jar bio.igm.utils.annotate.AnnotateStructures $working_directory/exons.bed $working_directory/ptescounts.tsv

exons.bed file is generated in the initialization phase from the transcriptome annotation provided.
ptescounts.tsv is the list of structures generated after filtering but before annotating.

java -cp $PTESFinder/PTESDiscovery.jar:$PTESFinder/commons-lang3-3%2e2%2e1.jar bio.igm.utils.annotate.AnnotateStructures $working_directory/exons.bed $working_directory/flanking-canonical-counts.tsv

Canonical junctions also subjected to filtering and are annotated with the line above.

PTESFinder Wiki

Post-Transcriptional Exon Shuffling (PTES) Identification Pipeline

Getting Started

Getting Started

Input Files:

Output Files:

Running PTESFinder:

Parameters:

Mandatory:

Optional:

Example Commands:

Single sample run:

StepWise Run:

Run initialization phase using:

To generate anchors:

Full length FASTQ reads are mapped to the references (genomic and transcriptomic) using:

Discovery:

Anchor reads are mapped to the transcriptomic reference using:

Anchor alignments (in SAM format) are processed further to detect shuffled coordinates providing preliminary evidence of PTES; to do this run:

After identifying anchor pairs with shuffled coordinates, coordinates are resolved to exons and used to describe the putative PTES model:

Evaluation:

To evaluate generated PTES models, bowtie indexes are built for the models before remapping full length reads to the new references. Run with:

Filtering:

To run these filters and produce final list of identified structures, use: