Download Latest Version pwmscan.1.1.4.tar.gz (8.4 MB)
Email in envelope

Get an email when there's a new version of PWMScan

Home
Name Modified Size InfoDownloads / Week
pwmscan 2018-09-19
README.txt 2018-12-03 9.3 kB
Totals: 2 Items   9.3 kB 10
PWMScan
============================================================================
A tool to scan entire genomes with a position-specific weight matrix (PWM)

Giovanna Ambrosini EPFL SV/ISREC GR-BUCHER

Rel: 1.0.0 -  08-29-2016
- Initial Release

Rel: 1.1.0 -  09-27-2017
- Add C version of matrix_prob program (remove perl script matrix_prob.pl)
- Add pwm_scan bash wrapper script: Scan a genome with a PWM and a p-value 
  using either Bowtie or matrix_scan
- A few bug fixes in mba and mscan_bed2sga programs

Rel: 1.1.1 -  10-17-2017
-  Optimize pwm_scan and pwm_bowtie_wrapper scripts (bowtie pipeline)
-  Modify pwm_mscan_wrapper
-  Modify README files in the examples section 

Rel: 1.1.2 - 11-02-2017 
- Add bash wrapper script for matrix conversion (pwm_convert) 
- Add python scripts for matrix_scan parallel execution 
- Modify bash wrapper scripts for PWMScan pipeline to 
  generate BEDdetail output format 
- A few bug fixes in lpmconvert.pl and pwmconvert.pl

Rel: 1.1.3 -  12-05-2017
-  New optimized version of matrix_scan: optimize for rapid score computation
   and efficient drop-off strategy.
-  Modify pwm_mscan_wrapper, pwm_bowtie_wrapper, and pwm_scan scripts in order
   to optimize the entire scanning pipeline
-  Bug fix in the program mba
-  Modify README file

Rel: 1.1.4 -  12-15-2017
-  Perl wrapper scripts: Use getopts to parse script options
                         Add forward scanning option
                         Add write to file option
-  Modify README files in examples

Rel: 1.1.5 -  15-01-2018
-  Main bash scripts:  Simplify the definition of the path to the genome files
                       All hard-coded (absolute) paths have been eliminated 
-  Makefile         :  The path to binaries and scripts is defined by the Makefile
                       and automatically changed in all bash and python scripts
-  New bash script:    pwmlib_scan to scan a genome with PWM libraries
-  New directories:    genomedb (genome data files) and pwmlibs (PWM libraries)
-  README and INSTALL: Updated with more detailed instructions on program
                       installation and paths to binaries and genome data files

Rel: 1.1.6 -  16-01-2018
-  Main bash scripts:  Add optional parameter to set the background base composition 
-  New program seq_extract_bcomp to either extract BED regions from a set of 
   FASTA-formatted sequences or compute the base composition of a set of DNA 
   sequences (in FASTA format).

Rel: 1.1.7 -  23-02-2018
-  Fix a bug in the seq_extract_bcomp program
-  Add perl script pwm2lpmconvert.pl to convert a position weight matrix (PWM) 
   to letter probability format.
-  Reorganize the application examples to provide a detailed step-by-step list
   of commands to be used to reproduce some case studies.

Rel: 1.1.8 -  07-09-2018
-  New C program:     pwm_scoring to score FASTA-formatted sequences with either
                      an integer PWM or a base probability matrix.

Rel: 1.1.9 -  19-09-2018
-  New bash scripts:  pwm_scan_ucsc and pwm_mscan_wrapper_ucsc to deal with
                      chromosome files downloaded from UCSC.
-  New bash script:   pwmlib_scan_seq to scan sequences with a PWM collection.

- 03-12-2018          Modify pwmlib_scan_seq script to deal with more general
                      sequence FASTA header


DESCRIPTION OF THE TOOL
----------------------------------------------------------------------------

The Position Weight Matrix (PWM) is the most commonly used model to describe
the DNA binding motif of a transcription factor. A PWM contains weights for
each base at each motif position. A PWM score can be computed for any base
sequence of the same length by simply summing up the corresponding weights
from the PWM.

PWMScan is a software package with a Web interface.

PWMScan can use two alternative search engines:

 - bowtie, a fast memory-efficient short read aligner using indexed genomes
 - matrix_scan, a C program using a conventional search algorithm 

Basically, two approaches are used to scan large DNA sequences such as genomes
with a PWM:

 1) Use a fast string matching algorithm, Bowtie, to scan the genome as follows:

     - given a PWM model and a cut-off value, generate all possible matches/tags
       that represent the given PWM along with the corresponding scores;
     - map the list of tags to a reference genome or a set of DNA sequences,
       using a fast string-matching algorithm (e.g bowtie).

 2) Use a conventional search algorithm, matrix_scan, that has been optimized for 
    rapid score computation and drop-off strategy.
    matrix_scan first rescales the scoring matrix so that the maximum score at each
    position is set to zero. When scanning the genome, for each position along the 
    sequence, it computes the sum of weights (scores) and drops out as soon as the 
    score is below the cut-off value.
    In order to speed up the scanning process, the matrix_scan program pre-computes
    the PWM scores in both forward and reverse directions for all possible nucleotide
    words of a given length. In addition, In case the PWM is longer than the word 
    size, a core region within the PWM is defined such that it minimizes the sum 
    of weights for rapid drop-off. The lateral positions are ranked in decreasing     
    order of importance.

The Bowtie-based approach is more efficient for short PWMs and very low p-values 
(of the order of 10-5 or less). 
The matrix_scan program can be executed in parallel by processing individual 
chromosomes in parallel on multiple CPU-cores via a python script.
 
The Web interface automatically chooses the most suitable method.

We use integer log likelihoods or integer log-odds as the internal working
format of PWMs. The PWM has one column for each nucleotide in DNA sequences,
and it has one row for each position in the pattern. The scores at each
position are calculated as the sum of integer log likelihoods (log-odds).

The match list is provided in BEDdetail format, with the following fields:

 1- chromosome name (e.g. chr1, chrX, chrM)
 2- starting position of the matching sequence
 3- ending position of the matching sequence
 4- matching nucleotide sequence
 5- integer score of the matching sequence
 6- strand (either '+' or '-')
 7- PWM name
 7- p-value of the match

BEDdetail is an extension of BED format that is used to enhance the track display page.
For PWMScan, we use BEDdetail format to include the name of the PWM as well as the p-value
associated to the motifs identified by PWMScan.

For a complete description of BED and BEDdetail format, please refer to:

 https://genome.ucsc.edu/FAQ/FAQformat.html#format1
 https://genome.ucsc.edu/FAQ/FAQformat.html#format1.7


WEB SITE
----------------------------------------------------------------------------

PWMScan has a web interface which is freely available at:

   http://ccg.vital-it.ch/pwmtools/pwmscan.php

Key features of the Web interface are the following:

  - Menu-driven access to genomes of more than 30 model organisms
  - Access to large collections of PWMs from MEME and other databases
  - Custom PWMs are supplied by copy&paste or file upload
  - Support of various PWM formats: JASPAR, TRANSFAC, plain text, etc.
  - Cut-off values defined as PWM match scores, match percentage, or p-values
  - Output provided in various formats: BEDdetail, SGA, FPS, etc.
  - Direct links to the UCSC genome browser for visualization of results
  - Action buttons to transfer match list to downstream analysis tools
    (ChIP-Seq and motif analysis tools)

The Web interface doesn't support upload of user-supplied FASTA sequence files.


PROGRAM INSTALLATION
----------------------------------------------------------------------------

For code compilation a suitable Makefile is provided.

- To create the binaries files, please type:

make

- To install all the binaries and scripts in $(binDir)=$(PWD)/bin, please type:

make install

- To delete the compiled binary and object files from the current directory, please type:

make clean

- To delete the installed binaries and scripts in $(binDir), please type:

make cleanbin

NB - The $(binDir) variable is by default set to
     binDir = $(PWD)/bin

     binDir defines the path to all binaries and scripts used by PWMScan.
     Make install changes the bin_dir variable of the installed bash wrapper scripts
     (pwm_scan, pwm_scan_ucsc, pwmlib_scan, pwmlib_scan_seq, pwm_mscan_wrapper,
     pwm_mscan_wrapper_ucsc, pwm_bowtie_wrapper, and pwm_convert) as well as the 
     matrix_scan_parallel.py python script to point to binDir.

- To unzip the genome files in $(PWD)/genomedb (the genome root directory) for assembly hg19:

make install-genome


EXTERNAL SOFTWARE PACKAGE
============================================================================

For installing Bowtie, please refer to the Bowtie page:

  - bowtie    http://bowtie-bio.sourceforge.net/index.shtml

The Bowtie binaries are installed system-wide.


BASIC SOFTWARE REQUIREMENTS
============================================================================

The GNU C compiler collection, the UNIX bash (version >=4), Python (version >= 2.7),
Perl (version >=5), and Perl modules Math::Round and Scalar::Util::Numeric (CPAN module).
Source: README.txt, updated 2018-12-03