EJCCMTools is a Java software package for compression of SAM or BAM files, performing subsequent gene or transcript expression profiling and visualizing gene read coverage profiles. Our compression scheme stores any combination of exonic regions and splice junctions defined in the reference transcriptome and supported by paired-end read data. Here, exonic regions are defined as partitions of a gene’s coding sequence defined by overlapping exons of different isoforms that share a genomic region. For each aligned read in a BAM file, the mapped exonic regions and junctions are identified. These are represented as Element_IDs that simultaneously specify their precise genomic location. The combination of Element_IDs mapped by a read is represented by a Combination_ID.
java -cp EJCCMTools.jar:<sam-current.version.jar> biokit/ngs/rnaseq/compression/SAMBAMCompressionControlCenter -compress This mode enables fast compression of SAM/BAM files. REQUIRED: -s [FILE] sorted SAM/BAM file to compress. -g [FILE] sorted GTF file. -c [FILE] output EJCCM file. OPTIONS: -t generate additional files (sorted.ejccm.gz and .sorted.ejccm.gz.tbi) that can be queried with tabix. -a process the mates of paired-end reads in ascending order of their start positions when generating the combination ID. -b ignore junctions, not found in reference genes. -d ignore read mappings where mates map to different chromosomes. -f ignore read mappings where mates map to different genes -q regard strand information. -generateTabixIndex This mode enables to generate a tabix index for EJCCM files. REQUIRED: -f [FILE] EJCCM file.
java –cp EJCCMTools.jar:sam-1.55.jar biokit/ngs/rnaseq/compression/SAMBAMCompressionControlCenter -compress -s example.bam -g example.gtf -c example.ejccm –t java –cp EJCCMTools.jar: sam-1.55.jar biokit/ngs/rnaseq/compression/SAMBAMCompressionControlCenter -generateTabixIndex -f example.ejccm
java -cp EJCCMTools.jar biokit/ngs/rnaseq/compression/EJCCMExpressionProfiler <mode: -runGeneExprProfiling|-runTranscriptExprProfiling|-runTranscriptExprProfilingSingleGene> [mode options] -runGeneExprProfiling In this mode RPKM values for all specified genes are computed. REQUIRED: -f [FILE] EJCCM file. -g [FILE] GTF file. -o [FILE] Output file. -runTranscriptExprProfiling In this mode the expression levels for the transcripts of all specified genes are estimated with MMSeq. REQUIRED: -s [FILE] sample name. -g [FILE] gene-wise GTF file. -p [FILE] path to MMSeq. -i [FILE] input directory containing the EJCCM file to use for estimating transcript expression levels. -o [FILE] output directory. -runTranscriptExprProfilingSingleGene In this mode the expression levels for the transcripts of the specified gene are estimated with MMSeq. Optional a html report can be generated visualizing the average read coverage of exons and junctions. For this report the two files SplicingReport.html and Evaluation.jpg are generated. In order to view the report please make sure that the two files are located in the same directory. REQUIRED: -s [FILE] sample name. -g [FILE] gene ID of the gene of interest. -f [FILE] GTF file. -p [FILE] path to MMSeq. -i [FILE] input directory containing the EJCCM file to use for estimating transcript expression levels. -o [FILE] output directory. OPTIONS: -r generate html and text file reports. This mode requires tabix indexed EJCCM. They can be generated using the biokit/ngs/utils/SAMBAMCompression class. For further information please refer to respective usage. Please add the java option "-Djava.awt.headless=true" if you have X window issues: java -Djava.awt.headless=true -cp biokit.jar ... -l [DOUBLE] minimal isoform level. In the html report only transcripts with a relative expression level higher than the minimal isoform level are shown. [ default: 0.0 ]
java -cp EJCCMTools.jar biokit/ngs/rnaseq/compression/EJCCMExpressionProfiler -runGeneExprProfiling -f example.ejccm -g example.gtf -o example.gene.expression java -cp EJCCMTools.jar biokit/ngs/rnaseq/compression/EJCCMExpressionProfiler -runTranscriptExprProfiling -s example -p pathToMMSeq -g example.gtf -i testDir/ -o testDir/ java -Djava.awt.headless=true -cp EJCCMTools.jar biokit/ngs/rnaseq/compression/EJCCMExpressionProfiler -runTranscriptExprProfilingSingleGene -s example -g ENSG00000092841_12 -f example.gtf -p pathToMMSeq -i testDir/ -o testDir/ -r
EJCCM files are specifically designed to store the results of our compression method. As described in the introduction during the compression for each aligned read in the SAM/BAM file to be compressed the covered elements, i.e. exonic regions and junctions are identified. The covered elements are represented using element IDs. Finally the element_IDs are concatenated semi-kolon separated to combination_IDs which are stored in EJCCM files together with their read count (Refer to paper).
The header consists of three lines starting with '#', characterizing what the compression was based on:
# Original SAM/BAM file: path to SAM/BAM file the compression was computed with
# Read type: paired-end|single-end
# Reference transcriptome: reference transcriptome used for read alignment in GTF format
An EJCCM file contains 5 tab-delimited columns:
1. sequence name : This is the reference written in the SAM/BAM files 2. start position combination : (artificial) left most position of any element listed in the combination ID EXAMPLE: chr1:80,chr1:338;chr1:(64-80);chr1:(338-387) --> left most position found in combination ID 80. 3. stop position combination : (artificial) right most position of any element listed in the combination ID EXAMPLE: chr1:80,chr1:338;chr1:(64-80);chr1:(338-387) --> right most position found in combination ID 387. 4. combination ID : list of elements covered by the original read mappings 5. observed read count : number of reads in original SAM/BAM which result in the combination ID specified in column 4
The start and stop position in columns 2. and 3. are written to an EJCCM for fast querying of the EJCCM
file with tabix.