Menu

Home

Devon Ryan

Welcome to Bison, "Bisulfite alignment on nodes of a cluster". Bison is intended for users with bisulfite-sequencing data (either whole-genome or RRBS) who also have access to a computer cluster. Bison accepts raw fastq input and will output an aligned BAM file including strand and methylation information.

Installation

The following are required for compiling and running bison:

  1. MPI must be installed and functional on all nodes upon which bison is to be run. Any MPI implementation should work, please submit a bug report if you find an MPI implementation that causes problems.
  2. SAMtools, preferably the most recent version. HTSlib may work as well, but is currently untested.
    • The samtools executables are actually not needed. Bison uses the samtools header files (*.h) and the libbam.a library. It's probably simplest to just copy these to a convenient location. The header files can be removed after compilation.
  3. Bowtie2. The executable must be in your shell's PATH.
  4. zcat and gzip must also be in your PATH, though this will normally be the case

The normal installation steps are as follows:

  1. Download the source distribution
  2. Unpack the files (e.g. tar zxf bison-0.1.0.tgz)
  3. Edit the Makefile, if needed. It's important that the linker is told where the MPI and SAMtools libraries are.
  4. Type "make"
  5. Type "make install" (the installation path can be changed in the Makefile)

Usage

There are three basic steps involved in using Bison. One must first index the genome against which reads will be aligned. This will usually only be done once and the resulting index files kept. Second, reads are aligned against the index to create a BAM file. Finally, methylation information can be extracted for downstream analysis.

Indexing

bison_index [OPTIONS] /path/to/directory/with/fasta/file(s)/

Aside from "-h", which displays a help message, all options are passed directly to bowtie2-build. This process requires only a single node! The bison indexer creates a "bisulfite_genome" directory in the same directory as the fasta file(s). Prior to invoking bowtie2-build, bison_index creates C->T and G->A converted multifasta files, which are the actual files indexed. These files are called "genome.fa" and stored in further subdirectories.

Aligning

Aligning normally requires 3-5 nodes (3 for directional and 5 for non-directional libraries, respectively). The command will typically be invoked with a scheduler (such as slurm)or with mpiexec. The command, then will typically be of the form: slurm -N 3 bison -p 12 -g indices/ -1 lane1_1.fq.gz -2 lane1_2.fq.gz.

A text file is created in addition to the BAM file holding the alignments. The text file will contain the same alignment metrics that are printed to the screen.

bison [OPTIONS] -g directory/ {-1 sample_1.fq.gz -2 sample_2.fq.gz | -U sample.fq.gz}

Mandatory inputs are:

-g    Directory containing the genome fasta files (this is identical to the directory
      given to bison_index)

-1    For paired-end reads, the file containing read #1, which is normally named
      something like foo_1.fq.gz. Reads needn't be gzipped, though they typically are
      to save space.

-2    As with -1, but the file containing the second read of each pair. **These files
      must be in sync!** If you trimmed the files individually, make sure that the Nth
      read in the first file is has its mate as the Nth read in the second file.

-U    For convenience, this denotes a fastq file from single-ended reads.
      Alternatively, -1 can be used without using -2.

-p    The number of threads that bowtie2 should use for alignment. Generally, however
      many cores are available per node.

-o    Output directory. By default, everything will be written to the directory holding
      the fastq files (or the file containing read #1, as appropriate). If you would
      prefer for the output BAM file and metrics txt file to be placed elsewhere,
      specify that here. **The directory must exist!**

--directional    The library is directional, rather than non-directional (the default).
      This will result in 3, rather than 5 nodes being used as only alignments to 2
      (rather than 4) strands are possible.

-upto The maximum number of reads to process. This is mostly useful for debuging and
      more quickly determining if a library is directional or not. 0 is the default,
      meaning all reads are used. N.B., the maximum value for this parameter is
      whatever an unsigned long is on your system.

--unmapped    Save unaligned reads to a file or files (as appropriate). This files will
      be placed in the same directory as the source fastq files, regardless of whether
      "-o" is used.

-h    Print a help message

-v    Print version information

Output is to a BAM file with the following additional auxiliary fields (for convenience, these are the same as those found in bismark:

  • XX, the mismatch string to the unconverted genome (differences due to methylation are ignored)
  • XM, the methylation call string. Capital letters indicate a methylated base, while lower case letters are unmethylated. Context is indicated as follows: Z, CpG context; X, CHG context; H, CHH context. For convenience, this is identical to that used by Bismark.
  • XR, The conversion of the aligned read, namely CT or GA for C->T or G->A conversions.
  • XG, The conversion of the genome to which the reads aligned (as with XR)

For convenience, the following table lists the possible combinations of XR and XG tags and the corresponding strand:

 XR:CTXR:GA
XG:CTOriginal top strandComplementary to the original top strand
XG:GAOriginal bottom strandComplementary to the original bottom strand

Extracting methylation values

As with indexing, this step requires only a single node. The output is in bedGraph format (chromosome start stop 1000*methlation% number of methylated reads number of unmethylated reads). Coordinates are 0-based. The output file will be placed in the same directory as the input.bam file. If the input file is named foo.bam, then the output will be named something of the form foo_CpG.bedGraph.

Do not coordinate-sort your BAM file prior to running the methylation extractor!

bison_methylation_extractor [OPTIONS] directory/ input.(sam|bam)

directory    Identical to that used by the aligner and indexer.

input.bam    The BAM (or SAM) file produced from bison. The extractor will also process
             SAM files produced by Bismark.

--MspI       Library was MspI digested.

--TaqI       Library was TaqI digested (this can be in addition to MspI digestion).

--only_CpG   Only process CpG sites (mutually exclusive with --only_CHH and --only_CHG)

--only_CHG   Only process CHG sites (mutually exclusive with --only_CpG and --only_CHH)

--only_CHH   Only process CHH sites (mutually exclusive with --only_CpG and --only_CHG)

-max-sites-size N    This option can increase or decrease memory requirements by
             changing the number of methylation calls stored in memory prior to sorting
             and merging. The default is 50,000.

 

Compatibility with Bismark

Bison is generally similar to bismark, however the indexes are incompatible, to bismark renaming contigs. Also, the two will not produce identical output, due to algorithmic differences. Running bison_methylation_extractor on the output of bismark will also produce different results, again due to algorithmic differences. In addition, bison outputs BAM files directly.

Other details

Bison needn't be run on multiple computers. You can also use a single computer for all compute nodes (e.g. mpiexec -n 5 bison ...).

A debugging mode exists by compiling with -DDEBUG. This results in an additional option, -taskid, that results in bison not sending aligned reads over a network to a master node but, instead, writing them to a file. The -taskid option (values 0-4) instructions bison which compute node to run as. This should only be used for debugging. Because the output file name are dependent purely on the taskid, it is unadvisable to simultaneously run
multiple instances with the same taskid.

Project Members: