EJCCMTools Wiki

Generation of EJCCM files from SAM/BAM files

Brought to you by: birzele, luciapu

EJCCMTools Manual

Introduction

EJCCMTools is a Java software package for compression of SAM or BAM files, performing subsequent gene or transcript expression profiling and visualizing gene read coverage profiles. Our compression scheme stores any combination of exonic regions and splice junctions defined in the reference transcriptome and supported by paired-end read data. Here, exonic regions are defined as partitions of a gene’s coding sequence defined by overlapping exons of different isoforms that share a genomic region. For each aligned read in a BAM file, the mapped exonic regions and junctions are identified. These are represented as Element_IDs that simultaneously specify their precise genomic location. The combination of Element_IDs mapped by a read is represented by a Combination_ID.

Usage EJCCMTools

Software requirements

SAMTools: The SAMTools Picard Library is needed for reading SAM/BAM files during compression.
MMSeq: MMSeq is required for performing expression profiling with EJCCMTools.

Compression Usage

java -cp EJCCMTools.jar:<sam-current.version.jar> biokit/ngs/rnaseq/compression/SAMBAMCompressionControlCenter

-compress
    This mode enables fast compression of SAM/BAM files.
    REQUIRED:
    -s [FILE]       sorted SAM/BAM file to compress.
    -g [FILE]       sorted GTF file.
    -c [FILE]       output EJCCM file.
    OPTIONS:
    -t              generate additional files  (sorted.ejccm.gz and .sorted.ejccm.gz.tbi) 
                    that can be queried with tabix.
    -a              process the mates of paired-end reads in
                    ascending order of their start positions
                    when generating the combination ID.
    -b              ignore junctions, not found in reference
                    genes.
    -d              ignore read mappings where mates map to
                    different chromosomes.
    -f              ignore read mappings where mates map to
                    different genes
    -q              regard strand information.

-generateTabixIndex
    This mode enables to generate a tabix index for EJCCM files.
    REQUIRED:
    -f [FILE]       EJCCM file.

Compression example

java –cp EJCCMTools.jar:sam-1.55.jar biokit/ngs/rnaseq/compression/SAMBAMCompressionControlCenter
-compress -s example.bam -g  example.gtf -c example.ejccm –t

java –cp EJCCMTools.jar: sam-1.55.jar biokit/ngs/rnaseq/compression/SAMBAMCompressionControlCenter
-generateTabixIndex -f example.ejccm

Expression profiling usage

java -cp EJCCMTools.jar biokit/ngs/rnaseq/compression/EJCCMExpressionProfiler 
<mode: -runGeneExprProfiling|-runTranscriptExprProfiling|-runTranscriptExprProfilingSingleGene> 
[mode options]

-runGeneExprProfiling
    In this mode RPKM values for all specified genes are computed.
    REQUIRED:
    -f [FILE]       EJCCM file.
    -g [FILE]       GTF file.
    -o [FILE]       Output file.

-runTranscriptExprProfiling
    In this mode the expression levels for the transcripts of
    all specified genes are estimated with MMSeq.
    REQUIRED:
    -s [FILE]       sample name.
    -g [FILE]       gene-wise GTF file.
    -p [FILE]       path to MMSeq.
    -i [FILE]       input directory containing the EJCCM file to
                    use for estimating transcript expression levels.
    -o [FILE]       output directory.

-runTranscriptExprProfilingSingleGene
    In this mode the expression levels for the transcripts of
    the specified gene are estimated with MMSeq. Optional a html report
    can be generated visualizing the average read coverage of exons and
    junctions. For this report the two files SplicingReport.html and
    Evaluation.jpg are generated. In order to view the report please
    make sure that the two files are located in the same directory.
    REQUIRED:
    -s [FILE]       sample name.
    -g [FILE]       gene ID of the gene of interest.
    -f [FILE]       GTF file.
    -p [FILE]       path to MMSeq.
    -i [FILE]       input directory containing the EJCCM file to
                    use for estimating transcript expression levels.
    -o [FILE]       output directory.
    OPTIONS:
    -r              generate html and text file reports. This mode requires tabix indexed
                    EJCCM. They can be generated using the biokit/ngs/utils/SAMBAMCompression
                    class. For further information please refer to respective usage. Please 
                    add the java option "-Djava.awt.headless=true" if you have X window 
                    issues: java -Djava.awt.headless=true -cp biokit.jar ...
    -l [DOUBLE]     minimal isoform level. In the html report only transcripts
                    with a relative expression level higher than the minimal
                    isoform level are shown. [ default: 0.0 ]

Expression profiling examples

java -cp EJCCMTools.jar biokit/ngs/rnaseq/compression/EJCCMExpressionProfiler 
-runGeneExprProfiling -f example.ejccm -g example.gtf -o example.gene.expression

java -cp EJCCMTools.jar biokit/ngs/rnaseq/compression/EJCCMExpressionProfiler 
-runTranscriptExprProfiling -s example -p pathToMMSeq -g example.gtf -i testDir/ -o testDir/

java -Djava.awt.headless=true -cp EJCCMTools.jar
biokit/ngs/rnaseq/compression/EJCCMExpressionProfiler 
-runTranscriptExprProfilingSingleGene -s example -g ENSG00000092841_12 -f example.gtf 
-p pathToMMSeq -i testDir/ -o testDir/ -r

Description EJCCM files

EJCCM files are specifically designed to store the results of our compression method. As described in the introduction during the compression for each aligned read in the SAM/BAM file to be compressed the covered elements, i.e. exonic regions and junctions are identified. The covered elements are represented using element IDs. Finally the element_IDs are concatenated semi-kolon separated to combination_IDs which are stored in EJCCM files together with their read count (Refer to paper).

The header consists of three lines starting with '#', characterizing what the compression was based on:
# Original SAM/BAM file: path to SAM/BAM file the compression was computed with
# Read type: paired-end|single-end
# Reference transcriptome: reference transcriptome used for read alignment in GTF format

Columns

An EJCCM file contains 5 tab-delimited columns:

1. sequence name                : This is the reference written in the SAM/BAM files
2. start position combination   : (artificial) left most position of any element listed
                                  in the combination ID
                                  EXAMPLE: chr1:80,chr1:338;chr1:(64-80);chr1:(338-387)
                                  --> left most position found in combination ID 80.
3. stop position combination    : (artificial) right most position of any element listed 
                                  in the combination ID
                                  EXAMPLE: chr1:80,chr1:338;chr1:(64-80);chr1:(338-387)
                                  --> right most position found in combination ID 387.
4. combination ID               : list of elements covered by the original read mappings
5. observed read count          : number of reads in original SAM/BAM which result in the
                                  combination ID specified in column 4

The start and stop position in columns 2. and 3. are written to an EJCCM for fast querying of the EJCCM
file with tabix.