Home

Wei Li

Model-based Analysis of Genome-wide CRISPR-Cas9 Knockout (MAGeCK) is a computational tool to identify important genes from the recent genome-scale CRISPR-Cas9 knockout screens (or GeCKO) technology.

MAGeCK is developed and maintained by Wei Li and Han Xu from Dr. Xiaole Shirley Liu's lab at Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute and Harvard School of Public Health.

MAGeCK is a free, open source software under the BSD license.

This documentation includes the following items:

Installation

Download

The latest version of MAGeCK (0.4) can be downloaded here:


System requirement

MAGeCK can be run on either Mac or Linux system. Since MAGeCK is written in Python and C, Python 2.7 (>2.7) and a C compiler is needed.

MAGeCK also suggests users installing numpy to calculate the negative binomial p value. However, in cases where numpy is not found, MAGeCK will use the normal p value instead. There may be slight differences between both methods.

Installation

Since version 0.3, MAGeCK uses standard Python installation procedures (distutils) for compiling and installation of the software.

The installation procedure is extremely easy. First, download the source code, unzip it (assuming the current version is 0.4), and go into the directory:

tar xvzf mageck-0.4.tar.gz
cd mageck-0.4

After that, invoke python setup.py:

python setup.py install

And it is done! If you want MAGeCK to be installed on your own directory , use the following command instead:

python setup.py install --prefix=$HOME

where $HOME is the root directory you want to install (usually the user home).

Manual installation

The manual installation is deprecated since version 0.3. Please refer to the installation instructions above.

After downloading the source code, follow the instructions below for manual installation.

Setting up the environment variables

Usually if you use the default Python installation options (i.e., install MAGeCK into the system directory), you are all set. For those who installed MAGeCK into custom directories, there may be additional steps, including changing the environment variable. First you need to add the path of the mageck program to your PATH variable. For example, if you used the --prefix=$HOME option during installation, then set up the PATH variable by typing:

export PATH=$PATH:$HOME/bin

You also need to add the path of the MAGeCK module to the PYTHONPATH variable. Again, his variable should be set as

export PYTHONPATH=$HOME/lib/python2.7/site-packages:$PYTHONPATH

To save the path configuration (so you don't have to type it every time), place the above command in your ~/.bashrc (for Linux) or ~/.bash_profile (for Mac).

Return to [Home]


Tutorial

Running MAGeCK is extremely easy and convenient. The demo folder contains two mini examples to go through all steps in MAGeCK. Simply execute the sh script in the command line in each example to run the demos.

The first demo: starting from read count tables

There is only one command line in the demo:

mageck test -k sample.txt -t HL60.final,KBM7.final -c HL60.initial,KBM7.initial  -n demo

The parameters are explained as follows.

Parameters Meaning
mageck The main portal of the MAGeCK program
test A sub-command to ask MAGeCK to perform sgRNA and gene ranking based on provided read count tables
-k sgrna_count.txt The provided read count table file. The format of the file is specified here.
-t HL60.final,KBM7.final The treatment samples are defined as HL60.final,KBM7.final (or the 2nd and 3rd sample, starting from 0) in sgrna_count.txt. See input files for a detailed explanation.
-c HL60.initial,KBM7.initial The control samples are defined as HL60.initial,KBM7.initial (or the 0th and 1st sample, starting from 0) in sgrna_count.txt. See input files for a detailed explanation.
-n demo The prefix of the output files is demo, so you will expect the output files are: demo.sgrna_summary.txt, demo.gene_summary.txt, etc.

An explanation of the output files can be found in the [output] page. For all available parameters, see the [usage] page.

You can also specify the treatment and control samples using sample index. For example,

mageck test -k sgrna_count.txt -t 2,3 -c 0,1 -n demo

The second demo: starting from raw fastq files

This demo shows an mini example of how to go through the whole pipeline from raw fastq files. In this example, we have fastq files from two conditions, and we would like to compare which gene and sgRNA is significant between conditions. The command line used in the runmageck.sh script is:

mageck run --fastq test1.fastq,test2.fastq -l library.txt -n demo --sample-label L1,CTRL -t CTRL -c L1

The parameters are explained as follows.

Parameters Meaning
mageck The main portal of the MAGeCK program
run A sub-command to ask MAGeCK to go through sgRNA counting, sgRNA and gene ranking from fastq files.
--fastq test1.fastq,test2.fastq The provided fastq file, separated by comma.
-l library.txt The provided sgRNA information, including the sgRNA id, the sequence, and the gene it is targeting. See input files for a detailed explanation.
-n demo The prefix of the output files is demo, so you will expect the output files are: demo.summary.txt, demo.gene.high.txt, etc.
--sample-label L1,CTRL The labels of the two samples are L1 (test1.fastq) and CTRL (test2.fastq).
-t CTRL The treatment samples are defined as CTRL. In another way, it is the sample in test2.fastq. See input files for a detailed explanation.
-c L1 The control samples are defined as L1. It is the sample in test1.fastq. See input files for a detailed explanation.

You can also use

mageck run --fastq test1.fastq,test2.fastq -l library.txt -n demo --sample-label L1,CTRL -t 1 -c 0

Return to [Home]


Usage

The main portal of MAGeCK is the mageck.py script, which includes 3 different subprograms:

  • run: collect sgRNA read counts from read mapping files (sam format), and perform sgRNA and gene ranking.
  • count: only collect sgRNA read counts from read mapping files (sam format).
  • test: given a table of read counts, perform the sgRNA and gene ranking.
  • pathway: given a ranked gene list, test whether one pathway is enriched.

run

The parameters for the run sub-command are included in test and count sub-command. See both sub-commands for more details.

test

This subcommand tests and ranks sgRNAs and genes based on the read count tables provided.

usage:

usage: mageck test [-h] -k COUNT_TABLE -t TREATMENT_ID [-c CONTROL_ID]
               [-n OUTPUT_PREFIX] [--norm-method {none,median,total}]
               [--normcounts-to-file]
               [--gene-test-fdr-threshold GENE_TEST_FDR_THRESHOLD]
               [--adjust-method {fdr,holm}] [--variance-from-all-samples]
               [--sort-criteria {neg,pos}] [--keep-tmp]

required arguments:

Parameter Explanation
-k COUNT_TABLE, --count-table COUNT_TABLE Provide a tab-separated count table instead of sam files. Each line in the table should include sgRNA name (1st column), targeting gene (2nd column) and read counts in each sample. See input/#sgrna-read-count-file for a detailed description.
-t TREATMENT_ID, --treatment-id TREATMENT_ID Sample label or sample index (0 as the first sample) in the count table as treatment experiments, separated by comma (,). If sample label is provided, the labels must match the labels in the first line of the count table; for example, "HL60.final,KBM7.final". For sample index, "0,2" means the 1st and 3rd samples are treatment experiments. See input/#sample-index for a detailed description.

optional arguments:

Parameter Explanation
-h, --help show this help message and exit
-c CONTROL_ID, --control-id CONTROL_ID Sample label or sample index in the count table as control experiments, separated by comma (,). Default is all the samples not specified in treatment experiments. See input/#sample-index for a detailed description.
-n OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX The prefix of the output file(s). Default sample1.
--norm-method {none,median,total} Method for normalization, default median.
--normcounts-to-file Write normalized read counts to file ({output-prefix}.normalized.txt).
--gene-test-fdr-threshold GENE_TEST_FDR_THRESHOLD FDR threshold for gene test, default 0.05.
--adjust-method {fdr,holm} Method for p-value adjustment, including false discovery rate (fdr) or holm's method (holm). Default fdr.
--variance-from-all-samples Estimate the variance from all samples, instead of from only control samples. Use this option only if you believe there are relatively few essential sgRNAs or genes between control and treatment samples.
--sort-criteria {neg,pos} Sorting criteria, either by negative selection (neg) or positive selection (pos). Default negative selection.
--keep-tmp Keep intermediate files.

count

This subcommand collects sgRNA read count information from fastq files. The output count tables can be used directly in the test subcommand.

usage:

usage: mageck count [-h] --fastq FASTQ [-l LIST_SEQ] [-n OUTPUT_PREFIX]
                [--sample-label SAMPLE_LABEL] [--trim-5 TRIM_5]
                [--sgrna-len SGRNA_LEN] [--count-n]

required arguments:

Parameter Explanation
--fastq The fastq files to be counted, separated by comma.

optional arguments:

Parameter Explanation
-h, --help show this help message and exit
-l LIST_SEQ, --list-seq LIST_SEQ A file containing list of sgRNA names, the sequences and target genes. See input/#sgrna-library-file for more details. If this file is not provided, mageck will count all possible sgRNAs in the fastq.
-n OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX The prefix of the output file(s). Default sample1.
--sample-label SAMPLE_LABEL Sample labels, separated by comma. Must be equal to the number of fastq files provided. Default "sample1,sample2,...".
--trim-5 TRIM_5 Length of trimming the 5' of the reads. Default 0
--sgrna-len SGRNA_LEN Length of the sgRNA. Default 20
--count-n Count sgRNAs with Ns. By default, sgRNAs containing Ns will be discarded.

pathway

MAGeCK can also invoke RRA to test if a pathway is enriched in one particular gene ranking.

usage:

usage: mageck pathway [-h] --gene-ranking GENE_RANKING --gmt-file GMT_FILE
                  [-n OUTPUT_PREFIX]

required arguments:

Parameter Explanation
--gene-ranking GENE_RANKING The gene ranking file generated by the gene test step.
--gmt-file GMT_FILE The pathway file in GMT format. See input/#pathway-file-gmt for more details of the GMT file format.

optional arguments:

Parameter Explanation
-h, --help show this help message and exit
--single-ranking The provided file is a (single) gene ranking file, either positive or negative selection. Only one enrichment comparison will be performed.
-n OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX The prefix of the output file(s). Default sample1.
--sort-criteria {neg,pos} Sorting criteria, either by negative selection (neg) or positive selection (pos). Default negative selection.
--keep-tmp Keep intermediate files.

Return to [Home]


Input file specification

sgRNA read count file

The sgRNA read count file will be used in -k parameter in the test or run sub-command.

The read count file should list the names of the sgRNA, the gene it is targeting, followed by the read counts in each sample. Each item should be separated by the tab ('\t'). A header line is optional. For example:

sgRNA           gene    HL60.initial    KBM7.initial    HL60.final      KBM7.final
A1CF_m52595977  A1CF    213     274     883     175
A1CF_m52596017  A1CF    294     412     1554    1891
A1CF_m52596056  A1CF    421     368     566     759
A1CF_m52603842  A1CF    274     243     314     855
A1CF_m52603847  A1CF    0       50      145     266

The count sub-command will output the read count file like this.

Sample index

In the -t/--treatment-id, -c/--control-id parameters, you can use either sample label or sample index to specify samples. If sample label is used, the labels [must] match the sample labels in the first line of the count table. For example, "HL60.final,KBM7.final".

You can also use sample index to specify samples. The index of the sample is the order it appears in the sgRNA read count file, starting from 0. The index is used in the -t/--treatment-id, -c/--control-id parameters. In the example above, there are four samples, and the index of each sample is as follows:

sample index
HL60.initial 0
KBM7.initial 1
HL60.final 2
KBM7.final 3

sgRNA library file

When starting from fastq files, MAGeCK needs to know the sgRNA sequence and its targeting gene. Such information is provided in the sgRNA library file, and can be specified by the -l/--list-seq option in run or count subcommand.

There are three columns in the library file: the sgRNA ID, the sequence, and the gene it is targeting. One example of the library file is provided as library.txt in demo2:

s_10007 TGTTCACAGTATAGTTTGCC    CCNA1
s_10008 TTCTCCCTAATTGCTTGCTG    CCNA1
s_10027 ACATGTTGCTTCCCCTTGCA    CCNC

pathway file (gmt)

The GMT file format stores the pathway information and is consistent with the GMT file in Gene Set Enrichment Analysis (GSEA). The details of the GMT format can be found at GSEA website.

You can also download different pathway files directly from GSEA MSigDB database. They can be used directly by MAGeCK.

sgRNA/gene mapping file (depreciated after version 0.3)

The sgRNA/gene mapping file will be used in the --gene-test parameter in the test or run sub-command.

This file should list the names of the sgRNAs and their corresponding genes, separated by the tab ('\t'). For example:

A1CF_m52595977  A1CF
A1CF_m52596017  A1CF
A1CF_m52596056  A1CF
A1CF_m52603842  A1CF
A1CF_m52603847  A1CF
A1CF_p52595870  A1CF
A1CF_p52595881  A1CF
A1CF_p52596023  A1CF

Return to [Home]


Output file specification

The output of the MAGeCK consists of the following files:

Other file formats are intermediate files, including:

  • .gene.high.txt: The gene ranking results (positively selected genes).
  • .gene.low.txt: The gene ranking results (negatively selected genes).

sgrna_summary_txt

An example of the sgRNA ranking results is as follows:

sgrna   Gene   control_count   treatment_count control_mean    treat_mean      control_var     adj_var score   p.low   p.high  p.twosided      FDR     high_in_treatment
INO80B_m74682554   INO80B        0.0/0.0 1220.1598778/1476.14096301      0.810860655738  1348.15042041   0.0     19.0767988005   308.478081895   1.0     1.11022302463e-16       2.22044604925e-16       1.57651669497e-14       True
NHS_p17705966   NHS   1.62172131148/3.90887850467     2327.09368635/1849.95115143     2.76529990807   2088.52241889   2.6155440132    68.2450168229   252.480744404   1.0     1.11022302463e-16       2.22044604925e-16       1.57651669497e-14       True

The contents of each column are as follows.

Column Content
sgrna sgRNA ID
Gene The targeting gene
control_count Normalized read counts in control samples
treatment_count Normalized read counts in treatment samples
control_mean Mean read counts in control samples
treat_mean Mean read counts in treatment samples
control_var The raw variance in control samples
adj_var The adjusted variance in control samples
score The score of this sgRNA
p.low p-value (lower tail)
p.high p-value (higher tail)
p.twosided p-value (two sided)
FDR false discovery rate
high_in_treatment Whether the abundance is higher in treatment samples

gene_summary_txt

An example of the gene ranking file (.gene.high.txt or .gene.low.txt) is as follows:

id      num.neg p.neg   fdr.neg rank.neg        num.pos p.pos   fdr.pos rank.pos
RPL18A  6       6.9363e-11      0.009901        2       6       1.0     1.0     21813
CASP8AP2        6       6.9916e-07      0.029703        3       6       1.0     1.0     21812
NBPF24  3       6.9959e-07      0.029703        4       3       1.0     1.0     21811

The contents of each column is as follows.

Column Content
id Gene ID
num.neg The number of targeting sgRNAs for each gene in negative selection
p.neg The raw p-value of this gene in negative selection
fdr.neg The false discovery rate of this gene in negative selection
rank.neg The ranking of this gene in negative selection
num.pos The number of targeting sgRNAs for each gene in positive selection (usually the same as num.neg)
p.pos The raw p-value of this gene in positive selection
fdr.pos The false discovery rate of this gene in positive selection
rank.pos The ranking of this gene in positive selection

Genes are ranked by the p.neg field (by default). If you need a ranking by the p.pos, you can use the --sort-criteria option.

pathway_summary_txt

The output of the pathway summary is similar to the gene summary. Here is an example:

id      num.neg p.neg   fdr.neg rank.neg        num.pos p.pos   fdr.pos rank.pos
KEGG_RIBOSOME   87      4.4271e-25      0.0011  1       87      0.99995 1.0     187
KEGG_ALANINE_ASPARTATE_AND_GLUTAMATE_METABOLISM 32      0.025868        0.663079        2      32      0.99364 1.0     186

log

This file includes the logging information during the execution.

Intermediate file formats

gene_txt

An example of the gene ranking file (.gene.high.txt or .gene.low.txt) is as follows:

 group_id        #_items_in_group        lo_value        FDR
 RPL3    93      4.9169e-36      0.000080
 RPL8    67      1.8232e-24      0.000080
 RPS2    61      1.6928e-20      0.000080
 RPS18   40      1.0152e-18      0.000080

The contents of each column is as follows.

Column Content
group_id Gene ID
#_items_in_group The number of targeting sgRNAs for each gene
lo_value The raw p-value
FDR The false discovery rate

Return to [Home]


Version history

0.4

2014.11.13 Version 0.4

  • Added the BSD license information.
  • Improved the logging system.
  • The control_id and treatment_id options now can be specified using sample strings.
  • Merge positive selection and negative selection genes and pathways into one file.
  • Add the --keep-tmp option to control intermediate files after running.
  • Fixed one bug in FDR calculation.

0.3

2014.07.01 Version 0.3

  • The installation method is changed so users can now more easily install the software.
  • Added a new feature to detect enriched pathways (pathway command)
  • Changed the input format of the program:
    • The second column of the count table (generated by the count subcommand and used by the test subcommand) is now the gene name.
    • For the count subcommand, the sgRNA information is provided with the library file.

0.2

2014.04.17 Version 0.2

  • Updated the demo and wiki page

0.1

2014.04.04 Version 0.1

  • The source code released.
Project Members:

Related

Wiki: demo
Wiki: history
Wiki: input
Wiki: install
Wiki: output
Wiki: usage