PWMScan
============================================================================
A tool to scan entire genomes with a position-specific weight matrix (PWM)
Giovanna Ambrosini EPFL SV/ISREC GR-BUCHER
Rel: 1.0.0 - 08-29-2016
- Initial Release
Rel: 1.1.0 - 09-27-2017
- Add C version of matrix_prob program (remove perl script matrix_prob.pl)
- Add pwm_scan bash wrapper script: Scan a genome with a PWM and a p-value
using either Bowtie or matrix_scan
- A few bug fixes in mba and mscan_bed2sga programs
Rel: 1.1.1 - 10-17-2017
- Optimize pwm_scan and pwm_bowtie_wrapper scripts (bowtie pipeline)
- Modify pwm_mscan_wrapper
- Modify README files in the examples section
Rel: 1.1.2 - 11-02-2017
- Add bash wrapper script for matrix conversion (pwm_convert)
- Add python scripts for matrix_scan parallel execution
- Modify bash wrapper scripts for PWMScan pipeline to
generate BEDdetail output format
- A few bug fixes in lpmconvert.pl and pwmconvert.pl
Rel: 1.1.3 - 12-05-2017
- New optimized version of matrix_scan: optimize for rapid score computation
and efficient drop-off strategy.
- Modify pwm_mscan_wrapper, pwm_bowtie_wrapper, and pwm_scan scripts in order
to optimize the entire scanning pipeline
- Bug fix in the program mba
- Modify README file
Rel: 1.1.4 - 12-15-2017
- Perl wrapper scripts: Use getopts to parse script options
Add forward scanning option
Add write to file option
- Modify README files in examples
Rel: 1.1.5 - 15-01-2018
- Main bash scripts: Simplify the definition of the path to the genome files
All hard-coded (absolute) paths have been eliminated
- Makefile : The path to binaries and scripts is defined by the Makefile
and automatically changed in all bash and python scripts
- New bash script: pwmlib_scan to scan a genome with PWM libraries
- New directories: genomedb (genome data files) and pwmlibs (PWM libraries)
- README and INSTALL: Updated with more detailed instructions on program
installation and paths to binaries and genome data files
Rel: 1.1.6 - 16-01-2018
- Main bash scripts: Add optional parameter to set the background base composition
- New program seq_extract_bcomp to either extract BED regions from a set of
FASTA-formatted sequences or compute the base composition of a set of DNA
sequences (in FASTA format).
Rel: 1.1.7 - 23-02-2018
- Fix a bug in the seq_extract_bcomp program
- Add perl script pwm2lpmconvert.pl to convert a position weight matrix (PWM)
to letter probability format.
- Reorganize the application examples to provide a detailed step-by-step list
of commands to be used to reproduce some case studies.
Rel: 1.1.8 - 07-09-2018
- New C program: pwm_scoring to score FASTA-formatted sequences with either
an integer PWM or a base probability matrix.
Rel: 1.1.9 - 19-09-2018
- New bash scripts: pwm_scan_ucsc and pwm_mscan_wrapper_ucsc to deal with
chromosome files downloaded from UCSC.
- New bash script: pwmlib_scan_seq to scan sequences with a PWM collection.
- 03-12-2018 Modify pwmlib_scan_seq script to deal with more general
sequence FASTA header
DESCRIPTION OF THE TOOL
----------------------------------------------------------------------------
The Position Weight Matrix (PWM) is the most commonly used model to describe
the DNA binding motif of a transcription factor. A PWM contains weights for
each base at each motif position. A PWM score can be computed for any base
sequence of the same length by simply summing up the corresponding weights
from the PWM.
PWMScan is a software package with a Web interface.
PWMScan can use two alternative search engines:
- bowtie, a fast memory-efficient short read aligner using indexed genomes
- matrix_scan, a C program using a conventional search algorithm
Basically, two approaches are used to scan large DNA sequences such as genomes
with a PWM:
1) Use a fast string matching algorithm, Bowtie, to scan the genome as follows:
- given a PWM model and a cut-off value, generate all possible matches/tags
that represent the given PWM along with the corresponding scores;
- map the list of tags to a reference genome or a set of DNA sequences,
using a fast string-matching algorithm (e.g bowtie).
2) Use a conventional search algorithm, matrix_scan, that has been optimized for
rapid score computation and drop-off strategy.
matrix_scan first rescales the scoring matrix so that the maximum score at each
position is set to zero. When scanning the genome, for each position along the
sequence, it computes the sum of weights (scores) and drops out as soon as the
score is below the cut-off value.
In order to speed up the scanning process, the matrix_scan program pre-computes
the PWM scores in both forward and reverse directions for all possible nucleotide
words of a given length. In addition, In case the PWM is longer than the word
size, a core region within the PWM is defined such that it minimizes the sum
of weights for rapid drop-off. The lateral positions are ranked in decreasing
order of importance.
The Bowtie-based approach is more efficient for short PWMs and very low p-values
(of the order of 10-5 or less).
The matrix_scan program can be executed in parallel by processing individual
chromosomes in parallel on multiple CPU-cores via a python script.
The Web interface automatically chooses the most suitable method.
We use integer log likelihoods or integer log-odds as the internal working
format of PWMs. The PWM has one column for each nucleotide in DNA sequences,
and it has one row for each position in the pattern. The scores at each
position are calculated as the sum of integer log likelihoods (log-odds).
The match list is provided in BEDdetail format, with the following fields:
1- chromosome name (e.g. chr1, chrX, chrM)
2- starting position of the matching sequence
3- ending position of the matching sequence
4- matching nucleotide sequence
5- integer score of the matching sequence
6- strand (either '+' or '-')
7- PWM name
7- p-value of the match
BEDdetail is an extension of BED format that is used to enhance the track display page.
For PWMScan, we use BEDdetail format to include the name of the PWM as well as the p-value
associated to the motifs identified by PWMScan.
For a complete description of BED and BEDdetail format, please refer to:
https://genome.ucsc.edu/FAQ/FAQformat.html#format1
https://genome.ucsc.edu/FAQ/FAQformat.html#format1.7
WEB SITE
----------------------------------------------------------------------------
PWMScan has a web interface which is freely available at:
http://ccg.vital-it.ch/pwmtools/pwmscan.php
Key features of the Web interface are the following:
- Menu-driven access to genomes of more than 30 model organisms
- Access to large collections of PWMs from MEME and other databases
- Custom PWMs are supplied by copy&paste or file upload
- Support of various PWM formats: JASPAR, TRANSFAC, plain text, etc.
- Cut-off values defined as PWM match scores, match percentage, or p-values
- Output provided in various formats: BEDdetail, SGA, FPS, etc.
- Direct links to the UCSC genome browser for visualization of results
- Action buttons to transfer match list to downstream analysis tools
(ChIP-Seq and motif analysis tools)
The Web interface doesn't support upload of user-supplied FASTA sequence files.
PROGRAM INSTALLATION
----------------------------------------------------------------------------
For code compilation a suitable Makefile is provided.
- To create the binaries files, please type:
make
- To install all the binaries and scripts in $(binDir)=$(PWD)/bin, please type:
make install
- To delete the compiled binary and object files from the current directory, please type:
make clean
- To delete the installed binaries and scripts in $(binDir), please type:
make cleanbin
NB - The $(binDir) variable is by default set to
binDir = $(PWD)/bin
binDir defines the path to all binaries and scripts used by PWMScan.
Make install changes the bin_dir variable of the installed bash wrapper scripts
(pwm_scan, pwm_scan_ucsc, pwmlib_scan, pwmlib_scan_seq, pwm_mscan_wrapper,
pwm_mscan_wrapper_ucsc, pwm_bowtie_wrapper, and pwm_convert) as well as the
matrix_scan_parallel.py python script to point to binDir.
- To unzip the genome files in $(PWD)/genomedb (the genome root directory) for assembly hg19:
make install-genome
EXTERNAL SOFTWARE PACKAGE
============================================================================
For installing Bowtie, please refer to the Bowtie page:
- bowtie http://bowtie-bio.sourceforge.net/index.shtml
The Bowtie binaries are installed system-wide.
BASIC SOFTWARE REQUIREMENTS
============================================================================
The GNU C compiler collection, the UNIX bash (version >=4), Python (version >= 2.7),
Perl (version >=5), and Perl modules Math::Round and Scalar::Util::Numeric (CPAN module).