Menu

CLASS

Liliana Florea mourisl

CLASS - Constraint-based Local Assembly and Selection of Splice variants

Described in:

Current (dynamic-programming) version:
Song, L., Sabunciyan, S., and Florea, L. (2016). CLASS2: accurate and efficient splice variant annotation from RNA-seq reads. Nucleic Acids Res. 2016 Jun 2;44(10):e98. doi: 10.1093/nar/gkw158 . Free full text

Previous ('set cover') version:
Song, L. and Florea, L. (2013). CLASS: Constrained Transcript Assembly of RNA-seq Reads.
Third Annual RECOMB Satellite Workshop on Massively Parallel Sequencing - RECOMB-SEQ 2013. BMC Bioinformatics 14(Suppl. 5), S14. Free full text

Copyright (C) 2012-2016, and GNU GPL, by Li Song, Liliana Florea

Includes portions copyright from:

lp_solve - Copyright (C) 2005, and GNU LGPL, by Michel Berkelaar, Kjell Eikland, Peter Notebaert
SAMtools - Copyright (C) 2008-2009, Genome Research Ltd, Heng Li


Table of contents


What is CLASS?

CLASS is a program for assembling transcripts from RNA-seq reads aligned to a genome. CLASS produces a set of transcripts in three stages. Stage 1 uses linear programming to determine a set of exons for each gene. Stage 2 builds a splice graph representation of a gene, by connecting the exons (vertices) via introns (edges) extracted from spliced read alignments. Stage 3 selects a subset of the candidate transcripts encoded in the graph that can explain all the reads, using either a parsimonius (SET_COVER) or a dynamic programming optimization approach. This stage takes into account constraints derived from mate pairs and spliced alignments and, optionally, knowledge about gene structure extracted from known annotation or alignments of cDNA sequences.

Usage

Usage: perl run_class.pl [options]
Options:
      -a alignment_file (REQUIRED): the path to the alignment file(in BAM format)
      -o output_file: the file storing the output of CLASS (default: ./alignment_file_wo_extension.gtf)
      -p number_of_threads: specify the number of worker threads (default:1)
      -F f: do not report the transcripts whose abundance level is lower than f*|most expressed transcript| in a gene
      -l label: add a prefix and a "_" to the ids in the GTF file (default: not used)
      -j junction: the path to the splice junction file
      -e evidence: the path to the evidence files
      --var_rd_len: extensive variable read lengths, i.e. reads after trimming (default: no)
      --set-cover: use set cover to build transcripts from splicing graph (default: no)
      --verbose: also output the procedure of CLASS (default: no)
      --wd temporary_file_directory: the directory storing the temporary files (default: ./class_tmp)
      --clean: whether to remove the temporary files in -wd (default: no)

Alternatively, run the programs in succession, for instance:
         mkdir sample.d; cd sample.d
         ln -s $BAMFILEPATH/accepted_hits.bam sample.bam
         samtools depth sample.bam > sample.depth
         junc sample.bam -a > sample.splice
         class ./sample -l sample -p 8 > sample.class.gtf
where $BAMFILEPATH is the path to the BAM file and 'sample' is the prefix associated with the run.

IMPORTANT: Using external gene evidence generally improved the results when used within the 'set-cover' version. You can find an 'evidence' file, consisting of spliced alignments of human EST and RefSeq mRNA sequences, here. Alignments were produced with the software ESTmapper/sim4db. When using external gene evidence, it is recommended to use the -a argument when running junc.

Input/Output

The primary input to CLASS is a set of short read alignments in BAM format and sorted by chromosome and position, for instance one produced with the program Tophat2.

CLASS requires XS field in BAM file to know the strand of spliced alignment. If the tools does not provide XS field, such as the default behavior of STAR, you can use the program "addXS" in the package to add XS field to the BAM file. You can run "addXS" as:

  • ./addXS reference_genome.fa < sam_file

The typical usage with BAM file can be:

  • samtools view -h in.bam | ./addXS reference_genome.fa | samtools view -bS - > out.bam

Given an alignment input x.bam, CLASS produces two intermediate data files, x.depth and x.splice in the temporary working directory..

  • The format of the x.depth file, generated by samtools, is:
    chrom_id position #_of_reads_on_the_position

  • The format of the x.splice file, generated by 'junc', is:
    chrom_id start_intron_position end_intron_position #_of_supporting_reads strand

    NOTE: When using the '-a' argument in junc, the value #_of_supporting_reads can be negative, indicating that this splice junction is invalid.

Lastly, to produce a set of transcripts, the program 'class' takes as input a BAM/SAM file, the depth file generated by 'samtools' and the splice junctions file generated by 'junc'. The final output, consisting of predicted transcripts, is in standard GTF format.

NOTE: 'samtools' can only read a BAM file, while 'junc' and 'class' can read both SAM and BAM formats.

IMPORTANT: Please make sure to enable the appropriate access to the directory containing the BAM file. Also, since the depth file will be created in that directory and will have one row for each base, please ensure that sufficient space is available in the work directory.

Example

Use the ./Sample/sample.bam file as an example. Running % perl run_class.pl -a ./Sample/sample.bam will return the set of transcripts in the file ./sample.gtf .

Versions

  • 1/27/2013 CLASS v1.0.1 Initial release
  • 6/16/2013 CLASS v1.0.3
  • 8/20/2013 CLASS v1.0.4
  • 9/13/2013 CLASS v1.0.5
  • 10/28/2013 CLASS v1.0.6:
    Use static linked library of lpsolve55.
    Improve the precision of intron retention.
  • 5/9/2014 CLASS v2.0.0:
    Improve the sensitivity of transcriptome assembly
    Support multi-threads
  • 2/9/2015 CLASS v2.1.0:
    Use a new wrapper to run CLASS
    Change the transcript_id field in the GTF file to "chr_id.gene_id.txpt_id".
    Add samtools package into the release
  • 5/19/2015 CLASS v2.1.1:
    Fix a bug in the wrapper
  • 10/21/2015 CLASS v2.1.2:
    Fix a serious bug introduced in v2.0.0 which breaks the predicted transcripts into fragments
  • 1/28/2016 CLASS v2.1.3:
    Introduce the "-l" option to add a prefix to the ids in the GTF file
    Add the version and command line information to the output
  • 7/29/2016 CLASS v2.1.5:
    Ignore reads with different read length by default.
    Identify more alternative 3',5'-UTR.
    Fix a bug when read depth is 0 on a position.
  • 2/8/2017 CLASS v2.1.6:
    Add XS field if the aligner does not provide such information in BAM file.
    Handle non-canonical splice sites.
    Fix a bug for the untraditional chromosomes (i.e. chr14_GL000009v2_random,...)
    Automatically decide whether to ignore reads with different read length by default.
  • 5/17/2017 CLASS v2.1.7
    Fix more bugs that resulting in no report of transcripts after some chromosomes.

Terms of use

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received (LICENSE.txt) a copy of the GNU General Public License along with this program; if not, you can obtain one from http://www.gnu.org/licenses/gpl.txt or by writing to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA

Support

Contact us at: lsong10@jhu.edu, florea@jhu.edu


Related

Wiki: Home

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.