Welcome to Bison, "Bisulfite alignment on nodes of a cluster". Bison is intended for users with bisulfite-sequencing data (either whole-genome or RRBS) who also have access to a computer cluster. Bison accepts raw fastq input and will output an aligned BAM file including strand and methylation information.
The following are required for compiling and running bison:
The normal installation steps are as follows:
There are three basic steps involved in using Bison. One must first index the genome against which reads will be aligned. This will usually only be done once and the resulting index files kept. Second, reads are aligned against the index to create a BAM file. Finally, methylation information can be extracted for downstream analysis.
bison_index [OPTIONS] /path/to/directory/with/fasta/file(s)/
Aside from "-h", which displays a help message, all options are passed directly to bowtie2-build. This process requires only a single node! The bison indexer creates a "bisulfite_genome" directory in the same directory as the fasta file(s). Prior to invoking bowtie2-build, bison_index creates C->T and G->A converted multifasta files, which are the actual files indexed. These files are called "genome.fa" and stored in further subdirectories.
Aligning normally requires 3-5 nodes (3 for directional and 5 for non-directional libraries, respectively). The command will typically be invoked with a scheduler (such as slurm)or with mpiexec. The command, then will typically be of the form: slurm -N 3 bison -p 12 -g indices/ -1 lane1_1.fq.gz -2 lane1_2.fq.gz.
A text file is created in addition to the BAM file holding the alignments. The text file will contain the same alignment metrics that are printed to the screen.
bison [OPTIONS] -g directory/ {-1 sample_1.fq.gz -2 sample_2.fq.gz | -U sample.fq.gz}
Mandatory inputs are:
-g Directory containing the genome fasta files (this is identical to the directory
given to bison_index)
-1 For paired-end reads, the file containing read #1, which is normally named
something like foo_1.fq.gz. Reads needn't be gzipped, though they typically are
to save space.
-2 As with -1, but the file containing the second read of each pair. **These files
must be in sync!** If you trimmed the files individually, make sure that the Nth
read in the first file is has its mate as the Nth read in the second file.
-U For convenience, this denotes a fastq file from single-ended reads.
Alternatively, -1 can be used without using -2.
-p The number of threads that bowtie2 should use for alignment. Generally, however
many cores are available per node.
-o Output directory. By default, everything will be written to the directory holding
the fastq files (or the file containing read #1, as appropriate). If you would
prefer for the output BAM file and metrics txt file to be placed elsewhere,
specify that here. **The directory must exist!**
--directional The library is directional, rather than non-directional (the default).
This will result in 3, rather than 5 nodes being used as only alignments to 2
(rather than 4) strands are possible.
-upto The maximum number of reads to process. This is mostly useful for debuging and
more quickly determining if a library is directional or not. 0 is the default,
meaning all reads are used. N.B., the maximum value for this parameter is
whatever an unsigned long is on your system.
--unmapped Save unaligned reads to a file or files (as appropriate). This files will
be placed in the same directory as the source fastq files, regardless of whether
"-o" is used.
-h Print a help message
-v Print version information
Output is to a BAM file with the following additional auxiliary fields (for convenience, these are the same as those found in bismark:
For convenience, the following table lists the possible combinations of XR and XG tags and the corresponding strand:
XR:CT | XR:GA | |
XG:CT | Original top strand | Complementary to the original top strand |
XG:GA | Original bottom strand | Complementary to the original bottom strand |
As with indexing, this step requires only a single node. The output is in bedGraph format (chromosome start stop 1000*methlation% number of methylated reads number of unmethylated reads). Coordinates are 0-based. The output file will be placed in the same directory as the input.bam file. If the input file is named foo.bam, then the output will be named something of the form foo_CpG.bedGraph.
Do not coordinate-sort your BAM file prior to running the methylation extractor!
bison_methylation_extractor [OPTIONS] directory/ input.(sam|bam)
directory Identical to that used by the aligner and indexer.
input.bam The BAM (or SAM) file produced from bison. The extractor will also process
SAM files produced by Bismark.
--MspI Library was MspI digested.
--TaqI Library was TaqI digested (this can be in addition to MspI digestion).
--only_CpG Only process CpG sites (mutually exclusive with --only_CHH and --only_CHG)
--only_CHG Only process CHG sites (mutually exclusive with --only_CpG and --only_CHH)
--only_CHH Only process CHH sites (mutually exclusive with --only_CpG and --only_CHG)
-max-sites-size N This option can increase or decrease memory requirements by
changing the number of methylation calls stored in memory prior to sorting
and merging. The default is 50,000.
Bison is generally similar to bismark, however the indexes are incompatible, to bismark renaming contigs. Also, the two will not produce identical output, due to algorithmic differences. Running bison_methylation_extractor on the output of bismark will also produce different results, again due to algorithmic differences. In addition, bison outputs BAM files directly.
Bison needn't be run on multiple computers. You can also use a single computer for all compute nodes (e.g. mpiexec -n 5 bison ...).
A debugging mode exists by compiling with -DDEBUG. This results in an additional option, -taskid, that results in bison not sending aligned reads over a network to a master node but, instead, writing them to a file. The -taskid option (values 0-4) instructions bison which compute node to run as. This should only be used for debugging. Because the output file name are dependent purely on the taskid, it is unadvisable to simultaneously run
multiple instances with the same taskid.