MAGeCK Wiki

Model-based Analysis of Genome-wide CRISPR-Cas9 Knockout

Brought to you by: davidliwei

Home

Introduction

Note: Try MAGeCK without code on Galaxy platform or Latch! 🧙

Model-based Analysis of Genome-wide CRISPR-Cas9 Knockout (MAGeCK) is a computational tool to identify important genes from the recent genome-scale CRISPR-Cas9 knockout screens (or GeCKO) technology. MAGeCK is developed by Wei Li and Han Xu from Dr. Xiaole Shirley Liu's lab at Dana-Farber Cancer Institute, and is being actively updated by Wei Li lab from Children's National Medical Center. The MAGeCK algorithm is described in the following paper:

Li, et al. MAGeCK enables robust identification of essential genes from genome-scale CRISPR/Cas9 knockout screens. Genome Biology 15:554 (2014)

Besides MAGeCK, we also developed the following softwares and algorithms:

MAGeCK-VISPR, a comprehensive quality control, analysis and visualization workflow for CRISPR/Cas9 screens.
MAGeCKFlute, an integrative analysis pipeline for pooled CRISPR functional genetic screens.
scMAGeCK, a computational model to identify genes associated with multiple expression phenotypes from CRISPR screening coupled with single-cell RNA sequencing data.

MAGeCK and associated softwares offer a range of functions meet the analys needs of different users, including:

Different installation and running options (source code, bioconda or Docker);
Simple treatment vs. control analysis (via MAGeCK RRA) and multiple sample comparison analysis (via MAGeCK MLE);
Different levels of visualization modules (MAGeCK R markdown, web-based MAGeCK-VISPR, and MAGeCKFlute R package);
Starting from either raw fastq files (via MAGeCK count), raw count table, normalized count table (via MAGeCK RRA or MLE), or even sgRNA ranks (via RRA);
Various normalization approaches including custom negative control guides/genes;
Paired sample analysis (RRA only);
Copy-number variation (CNV) correction with or without known CNV profiles of cells;
More complicated experimental design including time-series and drug treatment CRISPR screen using MAGeCK MLE;

and so on.

MAGeCK and MAGeCK-VISPR are free, open source softwares under the BSD license. We greatly appreciate the support from The Claudia Adams Barr Program in Innovative Basic Cancer Research and NHGRI (NIH) to develop MAGeCK and MAGeCK-VISPR.

We have been using MAGeCK/MAGeCK-VISPR in many screening projects, including the identification of functional lncRNAs that reaches close to 100% validation rate (Zhu, Li, et al. Nature Biotechnology 2016), resistance mechanism to T cell killing (Pan et al. Science 2018), the function of RNA binding protein in prostate cancer (Fei et al. PNAS 2017), etc.

Many independent studies also used MAGeCK/MAGeCK-VISPR to analyze their CRISPR/Cas9 coding and non-coding screening data, including:

and so on.

Any questions about MAGeCK or MAGeCK-VISPR? Check the FAQ below or join our MAGeCK Google group.

Refer to our Nature Protocols paper for running MAGeCK suites!

MAGeCK Manual

This documentation includes the following items:

Installation
Tutorial
More complicated tutorials
- Allow mismatches from read mapping
- Correct copy number variation effect
- Run MAGeCK on Docker
- Make full use of MLE design matrix for more complicated experimental designs (e.g., paired samples, time series)
- Include sgRNA efficiency estimation from MLE
Usage
Running MAGeCK as an expert
- Provide negative control guides
- Paired comparisons
Visualization
File formats (input)
File formats (output)
Commonly used libraries
Frequently Asked Questions
Version history

Installation

Installation

There are several ways to install MAGeCK.

Method 1: through conda/bioconda channel

To install MAGeCK through bioconda channel, first download and install the Python 3 variant of the Miniconda Python distribution. Then, in the command line, type

conda install -c bioconda -c conda-forge mageck

That's it!

An optional step (but recommended) is, you can create an isolated software environment for mageck by executing

conda create -c bioconda -c conda-forge -n mageckenv mageck

in a terminal. The environment can be activated via

source activate mageckenv

To update mageck, run

conda update mageck

from within the environment.

This environment can be deactivated via

source deactivate

You can install MAGeCK-VISPR (which includes MAGeCK) using conda, a commonly used package management softare. The instructions to install MAGeCK-VISPR can be found on MAGeCK-VISPR manual.

Method 2: through Docker Image

You can also run MAGeCK via Docker image which is automatically built upon each commit in our bitbucket source code.

To run through Docker image, install Docker on your own system, and follow the instructions in tutorials on running Docker images.

Method 3: through source code

You can also download the software and install it by yourself. See the detailed instructions below.

Download

The latest version of MAGeCK (0.5.9) can be downloaded here:

Or click the link here (in cases the button points to a wrong file).

For earlier versions (< 0.5.4), the zip file is encrypted, but you can get the password easily by one of the following options:

Join our MAGeCK Google group and the password is shown on the top of the forum; or
Send an email to mageck.help@gmail.com with the subject "password", and you will get an automatic reply with password within 1 minute.

You need to go to the Terminal to unzip and install the software. See the instructions for Installation below.

System requirement

MAGeCK can be run on either Mac or Linux system. Since MAGeCK is written in Python and C, Python (version 3) and a C compiler is needed.

Other dependencies include numpy and scipy.

Due to the end of Python 2 life cycle,mageck 0.5.9 or higher versions are not designated to run on Python 2.

Optional dependencies

Since version 0.5.9.3, MAGeCK updates the visualization module by generating a R markdown file (.Rmd) for count and test subcommand. This allows users to easily create a html-based report webpage using RStudio. No additional dependencies are needed for running MAGeCK. However, to generate the report webpage, a computer with Rstudio and rmarkdown are needed.

To use the --pdf-report option, which is mainly used for visualization before 0.5.9.3, two optional softwares include R and pdflatex. MAGeCK relies on both softwares to generating PDF reports if the --pdf-report option is used. If it is not possible to install them, you can also generate PDF reports by copying some MAGeCK output files to another computers with R and pdflatex are properly installed. See Q and A for more information.

If you use the --pdf-report option, xtable is required, and gplots as well as ggplot2 is optional. Use install.packages("xtable") and install.packages(c("gplots","ggplot2")) in R to install them.

You won't get any error messages if you don't have gplots, but you will get a more beautiful clustering figure in the pdf report of the count command.

You can run MAGeCK without --pdf-report option, and copy some files to another machine with these R packages to generate pdf report. See Q and A for more details.

What if I run into issues of latex

You can still get some figures generated from MAGeCK, by adding the "--keep-tmp" option to keep intermediate files.

Installation

Since version 0.3, MAGeCK uses standard Python installation procedures (distutils) for compiling and installation of the software.

The installation procedure is extremely easy. First, download the source code, unzip it by using the following command (or just double-clicking it), and go into the directory in the command line:

tar xvzf mageck-0.5.4.tar.gz
cd mageck-0.5.4

After that, invoke python setup.py:

python setup.py install

And it is done! If you want MAGeCK to be installed on your own directory , use the following command instead:

python setup.py install --user

This is the easiest way to install mageck. An alternative approach is (you may have one additional step to set up the environment variables; see below)

python setup.py install --prefix=$HOME

where $HOME is the root directory you want to install (usually the user home).

Manual installation

The manual installation is deprecated since version 0.3. Please refer to the installation instructions above.

After downloading the source code, follow the instructions below for manual installation.

Setting up the environment variables

In most systems you don't need to set up the environment variables. Just type "mageck" in the command line to see if the mageck program works.

If you get a "command not found" error, that indicates the environment variables are not properly set up. There are several additional steps to finish the installation. First you need to add the path of the mageck program to your PATH variable.

There are several different situations.

1. If you use the --prefix=$HOME option during installation

Set up the PATH variable by typing:

export PATH=$PATH:$HOME/bin

2. If you use the --user option during installation (and other situations)

You first need to determine where MAGeCK is installed. See this Q and A for additional steps to determine the correct bin directory.

If your bin directory is located in /Users/john/.local/bin, then type the following:

export PATH=$PATH:/Users/john/.local/bin

Setting up PYTHONPATH variable

You may also need to add the path of the MAGeCK module to the PYTHONPATH variable. Again, follow the steps above to determine the correct Python installation path (see the Q&A). This variable should be set as, for example,

export PYTHONPATH=/Users/john/.local/lib/python2.7/site-packages:$PYTHONPATH

Save your PATH or PYTHONPATH variable

To save the path configuration (so you don't have to type it every time), place the above command in your ~/.bashrc (for Linux) or ~/.bash_profile (for Mac).

Experimental version

The experimental version of MAGeCK is available at bitbucket. Note that the source codes on BitBucket are experimental and are not fully tested, and it may not be stable or function well. It is strongly recommended to use the MAGeCK software downloaded from sourceforge or from bioconda.

Return to [Home]

Tutorial

Tutorial
Advanced tutorial

Running MAGeCK is extremely easy and convenient. The demo folder contains two mini examples to go through all steps in MAGeCK. Simply execute the sh script in the command line in each example to run the demos. To see how you can enable visualization functions of MAGeCK in both demos, see the visualization manual.

Some advanced tutorial topics can be found in the Advanced Tutorial page.

Also check out the following videos in YouTube to learn how to install and run MAGeCK:

Tutorial 1: Installation

Tutorial 2: Comparison between samples

The first tutorial: starting from read count tables

Check demo/demo1 folder in the source code for the first tutorial.

There is only one command line in the tutorial:

mageck test -k sample.txt -t HL60.final,KBM7.final -c HL60.initial,KBM7.initial  -n demo

The parameters are explained as follows.

Parameters	Meaning
mageck	The main portal of the MAGeCK program
test	A sub-command to ask MAGeCK to perform sgRNA and gene ranking based on provided read count tables
-k sgrna_count.txt	The provided read count table file. The format of the file is specified here.
-t HL60.final,KBM7.final	The treatment samples are defined as HL60.final,KBM7.final (or the 2nd and 3rd sample, starting from 0) in sgrna_count.txt. See input files for a detailed explanation.
-c HL60.initial,KBM7.initial	The control samples are defined as HL60.initial,KBM7.initial (or the 0th and 1st sample, starting from 0) in sgrna_count.txt. See input files for a detailed explanation.
-n demo	The prefix of the output files is demo, so you will expect the output files are: demo.sgrna_summary.txt, demo.gene_summary.txt, etc.

An explanation of the output files can be found in the [output] page. For all available parameters, see the [usage] page.

You can also specify the treatment and control samples using sample index. For example,

mageck test -k sgrna_count.txt -t 2,3 -c 0,1 -n demo

The second tutorial: starting from raw fastq files

Check demo/demo2 folder in the source code for this tutorial

This demo shows an mini example of how to go through the whole pipeline from raw fastq files. In this example, we have fastq files from two conditions, and we would like to compare which gene and sgRNA is significant between conditions. The command line used in the runmageck.sh script is:

mageck count -l library.txt -n demo --sample-label L1,CTRL  --fastq test1.fastq test2.fastq 
mageck test -k demo.count.txt -t L1 -c CTRL -n demo

The "test" command is the same as the first demo. The parameters of the "count" command are explained as follows.

Parameters	Meaning
mageck	The main portal of the MAGeCK program
count	A sub-command to ask MAGeCK to generate sgRNA read count table.
-l library.txt	The provided sgRNA information, including the sgRNA id, the sequence, and the gene it is targeting. See input files for a detailed explanation.
-n demo	The prefix of the output files.
--sample-label L1,CTRL	The labels of the two samples are L1 (test1.fastq) and CTRL (test2.fastq).
--fastq test1.fastq test2.fastq	The provided fastq file, separated by space. (Technical replicates of the same sample can also indicated using comma as a separator; for example, "sample1_replicate1.fastq,sample1_replicate2.fastq")

The third tutorial: going through a public CRISPR/Cas9 screening dataset

After the first two demos, you have a basic sense of how MAGeCK works. In this demo, let us go through a real dataset which is more complicated, and see how to handle some practical problems, like the trimming of the 5' end.

The dataset we use comes from the following paper: Genome-wide recessive genetic screening in mammalian cells with a lentiviral CRISPR-guide RNA library. In this paper, the authors did the CRISPR/Cas9 screening on mouse ESC cells, and identify genes that are essential in mouse ESC cells.

Step 1: download the fastq file

The fastq files of screens are public available on ENA archive. There are different replicates for one condition, but for simplicity, let us only download the following two fastq files and use them to test MAGeCK functions.

Accession	Sample	Download Link
ERR376998	one replicate of plasmid	ERR376998
ERR376999	one replicate of ESC	ERR376999

You can download these files, double click to unzip them (or use gunzip in the terminal), and place them into one separate folder:

gunzip ERR376998.fastq.gz
gunzip ERR376999.fastq.gz

Step 2: prepare the library file

The next step is to prepare the library file so MAGeCK will know which sgRNA targets which gene. If you are using one of the standard GeCKO libraries, you can just download the files from MAGeCK sourceforge. For non-standard libraries, you need to prepare the library file according to the library file format.

In this demo, you can generate the library file using Supplementary Data 2 (or Supplementary Table S7) from the paper, or download it directly from our collection of libraries (the file name is "yusa_library.csv.zip). Double click to unzip it (or use "unzip" in the terminal).

(Optional) Step 3: determine the trimming length and sgRNA length

**Note: since version 0.5.6, MAGeCK is now able to automatically determine the trimming length and sgRNA length, in most cases. Therefore, you don't need to go to this step unless MAGeCK fails to do so by itself. **

In many cases, your sequencing primer is not exactly in front of the first base of guide RNA. This is indeed the case in this demo, where the the first few bases in the fastq file are identical. Make sure you know exactly how many bases to trim before running MAGeCK. You can talk to experimental people, or get this information by taking a look at the first few lines of the fastq files.

Here are the first few lines of ERR376998.fastq (only sequences are shown):

CTTGTGGAAAGGACGAAACACCGGTGAAGGTGCCGTTGTGTAGTTTTAGA
CTTGTGGAAAGGACGAAACACCGAGCAGCACAACAATATGGGTTTTAGAG
CTTGTGGAAAGGACGAAACACCGCTCTTGGGTTTGGATGTTTGTTTTAGA
CTTGTGGAAAGGACGAAACACCGTTTGGCGAGGGGAGCGCCGGTTTTAGA
......

You can see that the first 23 nucleotides are identical, so in this case you need to tell MAGeCK to trim the first 23 nucleotides to collect read counts (--trim-5 23). If the nucleotide length in front of sgRNA varies between different reads, use cutadapt to remove the adaptor sequences.

The sgRNA length can be determined from the experimental design. It is usually 20 nucleotide, but in this demo, the sgRNA length is 19.

Step 4: run the MAGeCK count command

Now we have everything ready to generate count tables from MAGeCK. Place two fastq files and one library file into the same directory, and under that directory, run MAGeCK on terminal:

mageck count -l yusa_library.csv -n escneg --sample-label "plasmid,ESC1" --fastq ERR376998.fastq  ERR376999.fastq

This command also tells MAGeCK to assign labels to each library ("plasmid" for ERR376998.fastq, and "ESC1" for ERR376999.fastq), and output the file with prefix "escneg". Note that MAGeCK will automatically determine the length of the sgRNAs from the library, so you don't have to specify it here.

If it is running successfully, you will see one file "escneg.count.txt" collecting all read counts. The top lines are as follows:

sgRNA   Gene    plasmid ESC1
chr19:5884430-5884453   SLC25A45        13      32
chr11:58831475-58831498 OLFR312 94      108
chr4:49282352-49282375  E130309F12RIK   85      128

If you use the --pdf-report option (see Visualization), it will generate a nice PDF report of the sample statistics of the fastq files. Click Here to see the PDF results.

If you want to manually use the --trim-5 option determined in step 3, the command becomes:

mageck count -l yusa_library.csv -n escneg --sample-label "plasmid,ESC1" --trim-5 23 --fastq ERR376998.fastq  ERR376999.fastq

Step 5: compare samples using MAGeCK test subcommand

With the read count table, now you can compare ESC1 vs. plasmid condition to see which genes are negatively or positively selected:

mageck test -k escneg.count.txt -t ESC1 -c plasmid -n esccp

This command tells MAGeCK to compare ESC1 with plasmid in the read count table escneg.count.txt, and output results with prefix "esccp".

If successful, you should see a file "esccp.gene_summary.txt". The top lines are as follows:

id      num     neg|score  neg|p-value   neg|fdr neg|rank        neg|goodsgrna   pos|score  pos|p-value   pos|fdr pos|rank  pos|goodsgrna
GTF2B   5       2.0462e-10      2.5851e-07      0.000707        1       5       1.0     1.0     1.0     19150   0
RPS5    5       5.9353e-10      2.5851e-07      0.000707        2       5       1.0     1.0     1.0     19149   0
RPL19   4       2.695e-09       2.5851e-07      0.000707        3       4       1.0     1.0     1.0     19148   0
KIF18B  5       1.0136e-08      2.5851e-07      0.000707        4       5       1.0     1.0     1.0     19146   0
....

You can immediately see two ribosomal genes, RPS5 and RPL19, are on the top of negatively selected genes. If you rank the genes by "rank.pos" (11th column), you will see TRP53 (mouse homolog of TP53) on the top of positively selected genes:

sort -k 11,11n esccp.gene_summary.txt | less

id      num     neg|score  neg|p-value   neg|fdr neg|rank        neg|goodsgrna   pos|score  pos|p-value   pos|fdr pos|rank  pos|goodsgrna
ZFP945  5       1.0     1.0     0.999999        19150   0       9.6166e-07      5.4287e-06      0.05198 1  5
TRP53   5       0.95411 0.95409 0.999999        17901   0       1.0347e-06      5.4287e-06      0.05198 2  4
PDAP1   5       0.85937 0.86223 0.999999        15753   1       7.6412e-06      2.8178e-05      0.174505  3       2

As is in the count command, if you use --pdf-report option, a nice PDF file will be generated. Here is the example of generated PDF file in this demo.

Final remarks

Right now you should be quite familiar with basic functions of MAGeCK. MAGeCK also provides additional functions for you to further explore the data, for example, test the enrichment of pathways, plot the top-ranked genes or genes you are interested in, etc. If you have further questions, feel free to ask in our google group. Enjoy your MAGeCK trip!

The fourth tutorial: using MAGeCK mle module

Since version 0.5, MAGeCK provides a new subcommand, mle, to calculate gene essentiality from CRISPR screens. Compared with the original algorithm in "test" subcommand, MAGeCK-mle uses a measurement called beta score to call gene essentialities: a positive beta score means a gene is positively selected, and a negative beta score means a gene is negatively selected. It is similar to the term log fold change in differential expression, and compared with the original RRA algorithm, this measurement has the following advantages:

It has only one score for one gene, instead of two scores in RRA: one for positive selection, one for negative selection;
It allows a direct comparison across multiple conditions, or even experiments;
It is able to incorporate sgRNA efficiency information.

This demo will help you go through all the steps in running the mle module.

**The demo/demo3 folder provides an example for running MAGeCK MLE, plus an optional copy number correction module (see advanced tutorials section). **

Step 1: download the count table

For simplicity, let's assume you already know how to generate read count table from fastq files; if not, check the third demo above. We will use the read count table presented in T Wang et al. Science 2014.

Download the read count table here.

Step 2: prepare the design matrix file

The design matrix file indicates which sample is affected by which condition. It is generally a binary matrix indicating which sample (indicated by the first column) is affected by which condition (indicated by the first row). For the meanings of the design matrix, check the input file format page.

To create a design matrix file, copy the following content to a text editing software, and save it as a plain txt file:

Samples        baseline        HL60        KBM7
HL60.initial   1               0           0
KBM7.initial   1               0           0
HL60.final     1               1           0
KBM7.final     1               0           1

Remember the following rules of a design matrix file:

The design matrix file must include a header line of condition labels;
The first column is the sample labels that must match sample labels in read count file;
The second column must be a "baseline" column that sets all values to "1";
The element in the design matrix is either "0" or "1";
You must have at least one sample of "initial state" (e.g., day 0 or plasmid) that has only one "1" in the corresponding row. That only "1" must be in the baseline column.

In the design matrix above, we have four samples, two corresponding to the initial states of two cell lines, and two corresponding to the final states of two cell lines. We design two conditions (HL60 and KBM7) that model the cell type-specific effects.

Step 3: run the module

Now we have the minimum requirements to run the MAGeCK mle module. Assuming you save the design matrix file as "designmat.txt", type the following command to run

mageck mle -k leukemia.new.csv -d designmat.txt -n beta_leukemia

If successful, MAGeCK mle will generate three files, the log file, the gene_summary file (including gene beta scores), and the sgrna_summary file (including sgRNA efficiency probability predictions). Here are a few lines of the gene_summary file:

Gene    sgRNA   HL60|beta       HL60|z  HL60|p-value    HL60|fdr        HL60|wald-p-value       HL60|wald-fdr   KBM7|beta       KBM7|z  KBM7|p-value    KBM7|fdr        KBM7|wald-p-value       KBM7|wald-fdr
RNF14   10      0.24927 0.72077 0.36256 0.75648 0.47105 0.9999  0.57276 1.6565  0.06468 0.32386 0.097625
0.73193
RNF10   10      0.10159 0.29373 0.92087 0.98235 0.76896 0.9999  0.11341 0.32794 0.90145 0.97365 0.74296 0.98421
RNF11   10      3.6354  10.513  0.0002811       0.021739        7.5197e-26      1.3376e-22      2.5928  7.4925  0.0014898       0.032024        6.7577e-14      1.33e-11

This file includes the beta scores in two conditions specified in the design matrix (HL60|beta and KBM7|beta), and the associated statistics. For more information, check the output format specification of gene_summary file.

Advanced tutorial

The Advanced tutorial page provides more complicated examples for experienced users.

Return to [Home]

Usage

Usage
- test
- count
- pathway
- mle
- plot
- run (disabled since 0.5.4)
Internal programs
- RRA
- mageckGSEA

The main portal of MAGeCK is the mageck program, which includes a couple of different subprograms:

count: only collect sgRNA read counts from read mapping files (sam format).
test: given a table of read counts, perform the sgRNA and gene ranking.
pathway: given a ranked gene list, test whether one pathway is enriched.
mle: perform maximum-likelihood estimation of gene essentiality scores.
run: collect sgRNA read counts from read mapping files (sam format), and perform sgRNA and gene ranking (disabled since 0.5.4).

There is also another subprogram plot that plots some figures of the genes you are interested in from the test results.

plot: Generating graphics for selected genes.

test

This subcommand tests and ranks sgRNAs and genes based on the read count tables provided.

usage:

  usage: mageck test [-h] -k COUNT_TABLE
                    (-t TREATMENT_ID | --day0-label DAY0_LABEL)
                    [-c CONTROL_ID]
                    [--paired] [--norm-method {none,median,total,control}]
                    [--gene-test-fdr-threshold GENE_TEST_FDR_THRESHOLD]
                    [--adjust-method {fdr,holm,pounds}]
                    [--variance-estimation-samples VARIANCE_ESTIMATION_SAMPLES]
                    [--sort-criteria {neg,pos}]
                    [--remove-zero {none,control,treatment,both,any}]
                    [--remove-zero-threshold REMOVE_ZERO_THRESHOLD]
                    [--pdf-report]
                    [--gene-lfc-method {median,alphamedian,mean,alphamean,secondbest}]
                    [-n OUTPUT_PREFIX] [--control-sgrna CONTROL_SGRNA]
                    [--normcounts-to-file] [--skip-gene SKIP_GENE]
                    [--keep-tmp]
                    [--additional-rra-parameters ADDITIONAL_RRA_PARAMETERS]
                    [--cnv-norm CNV_NORM] [--cell-line CELL_LINE]

required arguments:

Parameter	Explanation
-k COUNT_TABLE, --count-table COUNT_TABLE	Provide a tab-separated count table instead of sam files. Each line in the table should include sgRNA name (1st column), targeting gene (2nd column) and read counts in each sample. See input/#sgrna-read-count-file for a detailed description.
-t TREATMENT_ID, --treatment-id TREATMENT_ID	Sample label or sample index (0 as the first sample) in the count table as treatment experiments, separated by comma (,). If sample label is provided, the labels must match the labels in the first line of the count table; for example, "HL60.final,KBM7.final". For sample index, "0,2" means the 1st and 3rd samples are treatment experiments. See input/#sample-index for a detailed description.
--day0-label DAY0_LABEL	Specify the label for control sample (usually day 0 or plasmid). For every other sample label, the module will treat it as a treatment condition and compare with control sample.

optional general arguments:

Parameter	Explanation
-h, --help	show this help message and exit
-c CONTROL_ID, --control-id CONTROL_ID	Sample label or sample index in the count table as control experiments, separated by comma (,). Default is all the samples not specified in treatment experiments. See input/#sample-index for a detailed description.
--paired	Paired sample comparisons. In this mode, the number of samples in -t and -c must match and have an exact order in terms of samples. For example, "-t HL60.final,KBM7.final -c HL60.initial,KBM7.initial".
--norm-method {none,median,total,control}	Method for normalization, default median. If control is specified, the size factor will be estimated using control sgRNAs specified in --control-sgrna option.
--gene-test-fdr-threshold GENE_TEST_FDR_THRESHOLD	FDR threshold for gene test, default 0.25.
--adjust-method {fdr,holm,pounds}	Method for sgrna-level p-value adjustment, including false discovery rate (fdr), holm's method (holm), or pounds's method (pounds).
--variance-estimation-samples VARIANCE_ESTIMATION_SAMPLES	Sample label or sample index for estimating variances, separated by comma (,). See -t/--treatment-id option for specifying samples.
--sort-criteria {neg,pos}	Sorting criteria, either by negative selection (neg) or positive selection (pos). Default negative selection.
--remove-zero {none,control,treatment,both}	Whether to remove zero-count sgRNAs in control and/or treatment experiments. Default: none (do not remove those zero-count sgRNAs).
--pdf-report	Generate pdf report of the analysis.
--gene-lfc-method {median,alphamedian,mean,alphamean,secondbest}	Method to calculate gene log fold changes (LFC) from sgRNA LFCs. Available methods include the median/mean of all sgRNAs (median/mean), or the median/mean sgRNAs that are ranked in front of the alpha cutoff in RRA (alphamedian/alphamean), or the sgRNA that has the second strongest LFC (secondbest). In the alphamedian/alphamean case, the number of sgRNAs correspond to the "goodsgrna" column in the output, and the gene LFC will be set to 0 if no sgRNA is in front of the alpha cutoff. Default median. (new since v0.5.5)

Optional arguments for input and output:

Parameter	Explanation
-n OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX	The prefix of the output file(s). Default sample1.
--control-sgrna CONTROL_SGRNA	A list of control sgRNAs for normalization and for generating the null distribution of RRA. See the format specification.
--normcounts-to-file	Write normalized read counts to file ({output-prefix}.normalized.txt).
--keep-tmp	Keep intermediate files.
--skip-gene SKIP_GENE	Skip genes in the report. By default, "NA" or "na" will be skipped.
--additional-rra-parameters ADDITIONAL_RRA_PARAMETERS	Additional arguments to run RRA. They will be appended to the command line for calling RRA.

Optional arguments for CNV correction:

Parameter	Explanation
--cnv-norm CNV_NORM	A matrix of copy number variation data across cell lines to normalize CNV-biased sgRNA scores prior to gene ranking.
--cell-line CELL_LINE	The name of the cell line to be used for copy number variation normalization.

count

This subcommand collects sgRNA read count information from fastq files. The output count tables can be used directly in the test subcommand.

usage:

 usage: mageck count [-h] -l LIST_SEQ 
                (--fastq FASTQ [FASTQ ...] | -k COUNT_TABLE)
                [--norm-method {none,median,total,control}]
                [--control-sgrna CONTROL_SGRNA]
                [--sample-label SAMPLE_LABEL] [-n OUTPUT_PREFIX]
                [--unmapped-to-file] [--keep-tmp] [--test-run]
                [--trim-5 TRIM_5] [--sgrna-len SGRNA_LEN] [--count-n]
                [--reverse-complement] [--pdf-report]
                [--day0-label DAY0_LABEL] [--gmt-file GMT_FILE]

required arguments:

Parameter	Explanation
-l LIST_SEQ, --list-seq LIST_SEQ	A file containing list of sgRNA names, the sequences and target genes, either in .txt or in .csv format. See input/#sgrna-library-file for more details. If this file is not provided, mageck will count all possible sgRNAs in the fastq.
--fastq FASTQ	Sample fastq/fastq.gz files (or bam files after v0.5.5. See advanced tutorial), separated by space; use comma (,) to indicate technical replicates of the same sample. For example, "--fastq sample1_replicate1.fastq,sample1_replicate2.fastq sample2_replicate1.fastq,sample2_replicate2.fastq" indicates two samples with 2 technical replicates for each sample.
-k COUNT_TABLE, --count-table COUNT_TABLE	The read count table file. Only 1 file is accepted.

optional arguments for normalization:

Parameter	Explanation
--norm-method {none,median,total,control}	Method for normalization, including "none" (no normalization), "median" (median normalization, default), "total" (normalization by total read counts), "control" (normalization by control sgRNAs specified by the --control-sgrna option).
--control-sgrna CONTROL_SGRNA	A list of control sgRNAs for normalization and for generating the null distribution of RRA. See the format specification.

optional arguments for input and output:

Parameter	Explanation
--sample-label SAMPLE_LABEL	Sample labels, separated by comma (,). Must be equal to the number of samples provided (in --fastq option). Default "sample1,sample2,...".
-n OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX	The prefix of the output file(s). Default sample1.
--unmapped-to-file	Save unmapped reads to file.
--keep-tmp	Keep intermediate files.
--test-run	Test running. If this option is on, MAGeCK will only process the first 1M records for each file.

optional arguments for processing fastq files:

Parameter	Explanation
--trim-5 TRIM_5	Length of trimming the 5' of the reads. Default 0
--sgrna-len SGRNA_LEN	Length of the sgRNA. Default 20. ATTENTION: after v 0.5.3, the program will automatically determine the sgRNA length from library file; so only use this if you turn on the --unmapped-to-file option.
--count-n	Count sgRNAs with Ns. By default, sgRNAs containing Ns will be discarded.
--reverse-complement	Reverse complement the sequences in library for read mapping.

Optional arguments for quality controls:

Parameter	Explanation
--pdf-report	Generate pdf report of the fastq files.
--day0-label DAY0_LABEL	Turn on the negative selection QC and specify the label for control sample (usually day 0 or plasmid). For every other sample label, the negative selection QC will compare it with day0 sample, and estimate the degree of negative selections in essential genes.
--gmt-file GMT_FILE	The pathway file used for QC, in GMT format. By default it will use the GMT file provided by MAGeCK.

pathway

MAGeCK can also invoke GSEA (default) or RRA to test if a pathway is enriched in one particular gene ranking.

usage:

usage: mageck pathway [-h] --gene-ranking GENE_RANKING --gmt-file GMT_FILE
                  [-n OUTPUT_PREFIX] [--method {gsea,rra}]
                  [--single-ranking] [--sort-criteria {neg,pos}]
                  [--keep-tmp] [--ranking-column RANKING_COLUMN]
                  [--ranking-column-2 RANKING_COLUMN_2]
                  [--pathway-alpha PATHWAY_ALPHA]
                  [--permutation PERMUTATION]

required arguments:

Parameter	Explanation
--gene-ranking GENE_RANKING	The gene ranking file generated by the gene test step.
--gmt-file GMT_FILE	The pathway file in GMT format. See input/#pathway-file-gmt for more details of the GMT file format.

optional arguments:

Parameter	Explanation
-h, --help	show this help message and exit
--single-ranking	The provided file is a (single) gene ranking file, either positive or negative selection. Only one enrichment comparison will be performed.
-n OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX	The prefix of the output file(s). Default sample1.
--method {gsea,rra}	Method for testing pathway enrichment, including gsea (Gene Set Enrichment Analysis) or rra. Default gsea.
--sort-criteria {neg,pos}	Sorting criteria, either by negative selection (neg) or positive selection (pos). Default negative selection.
--keep-tmp	Keep intermediate files.
--ranking-column RANKING_COLUMN	Column number or label in gene summary file for gene ranking; can be either an integer of column number, or a string of column label. Default "2" (the 3rd column).
--ranking-column-2 RANKING_COLUMN_2	Column number or label in gene summary file for gene ranking; can be either an integer of column number, or a string of column label. This option is used to determine the column for positive selections and is disabled if --single-ranking is specified. Default "8" (the 9th column).
--pathway-alpha PATHWAY_ALPHA	The default alpha value for RRA pathway enrichment. Default 0.25.
--permutation PERMUTATION	The perumtation for gsea. Default 1000.

mle

The mle subcommand performs maximum-likelihood analysis of gene essentialities, instead of the RRA analysis.

usage:

     usage: mageck.beta mle [-h] -k COUNT_TABLE
                   (-d DESIGN_MATRIX | --day0-label DAY0_LABEL)
                   [-n OUTPUT_PREFIX] [-i INCLUDE_SAMPLES]
                   [-b BETA_LABELS] [--control-sgrna CONTROL_SGRNA]
                   [--cnv-norm CNV_NORM] [--cnv-est CNV_EST] [--debug]
                   [--debug-gene DEBUG_GENE]
                   [--norm-method {none,median,total,control}]
                   [--genes-varmodeling GENES_VARMODELING]
                   [--permutation-round PERMUTATION_ROUND]
                   [--no-permutation-by-group]
                   [--max-sgrnapergene-permutation MAX_SGRNAPERGENE_PERMUTATION]
                   [--remove-outliers] [--threads THREADS]
                   [--adjust-method {fdr,holm,pounds}]
                   [--sgrna-efficiency SGRNA_EFFICIENCY]
                   [--sgrna-eff-name-column SGRNA_EFF_NAME_COLUMN]
                   [--sgrna-eff-score-column SGRNA_EFF_SCORE_COLUMN]
                   [--update-efficiency] [--bayes] [-p] [-w PPI_WEIGHTING]
                   [-e NEGATIVE_CONTROL]

required arguments:

Parameter	Explanation
-k COUNT_TABLE, --count-table COUNT_TABLE	Provide a tab-separated count table. Each line in the table should include sgRNA name (1st column), target gene (2nd column) and read counts in each sample. See input/#sgrna-read-count-file for a detailed description.
-d DESIGN_MATRIX, --design-matrix DESIGN_MATRIX	Provide a design matrix, either a file name or a quoted string of the design matrix. For example, "1,1;1,0". The row of the design matrix must match the order of the samples in the count table (if --include-samples is not specified), or the order of the samples by the --include-samples option.
--day0-label DAY0_LABEL	Specify the label for control sample (usually day 0 or plasmid). For every other sample label, the MLE module will treat it as a single condition and generate an corresponding design matrix.

optional arguments for input and output:

Parameter	Explanation
-n OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX	The prefix of the output file(s). Default sample1.
-i INCLUDE_SAMPLES, --include-samples INCLUDE_SAMPLES	Specify the sample labels if the design matrix is not given by file in the --design-matrix option. Sample labels are separated by ",", and must match the labels in the count table.
-b BETA_LABELS, --beta-labels BETA_LABELS	Specify the labels of the variables (i.e., beta), if the design matrix is not given by file in the --design-matrix option. Should be separated by ",", and the number of labels must equal to (# columns of design matrix), including baseline labels. Default value: "bata_0,beta_1,beta_2,...".
--control-sgrna CONTROL_SGRNA	A list of control sgRNAs. See the format specification.

Optional arguments for CNV correction:

Parameter	Explanation
--cnv-norm CNV_NORM	A matrix of copy number variation data across cell lines to normalize CNV-biased sgRNA scores prior to gene ranking.

optional arguments for MLE module:

Parameter	Explanation
--debug	Debug mode to output detailed information of the running.
--debug-gene DEBUG_GENE	Debug mode to only run one gene with specified ID.
--norm-method {none,median,total,control}	Method for normalization, including "none" (no normalization), "median" (median normalization, default), "total" (normalization by total read counts), "control" (normalization by control sgRNAs specified by the --control-sgrna option).
--genes-varmodeling GENES_VARMODELING	The number of genes for mean-variance modeling. Default 1000.
--permutation-round PERMUTATION_ROUND	The rounds for permutation (interger). The permutation time is (# genes) * x for x rounds of permutation. Suggested value: 10 (may take longer time). Default 2.
--no-permutation-by-group	By default, gene permutation is performed separately, by their number of sgRNAs. Turning this option will perform permutation on all genes together. This makes the program faster, but the p value estimation is accurate only if the number of sgRNAs per gene is approximately the same.
--max-sgrnapergene-permutation MAX_SGRNAPERGENE_PERMUTATION	Only permute genes by group if the number of sgRNAs per gene is smaller than this number. This will save a lot of time if some regions are targeted by a large number of sgRNAs (usually hundreds). Must be an integer. Default 100.
--remove-outliers	Try to remove outliers. Turning this option on will slow the algorithm.
--threads THREADS	Using multiple threads to run the algorithm. Default using only 1 thread.
--adjust-method {fdr,holm,pounds}	Method for sgrna-level p-value adjustment, including false discovery rate (fdr), holm's method (holm), or pounds's method (pounds).

optional arguments for the EM iteration:

Parameter	Explanation
--sgrna-efficiency SGRNA_EFFICIENCY	An optional file of sgRNA efficiency prediction. The efficiency prediction will be used as an initial guess of the probability an sgRNA is efficient. Must contain at least two columns, one containing sgRNA ID, the other containing sgRNA efficiency prediction.
--sgrna-eff-name-column SGRNA_EFF_NAME_COLUMN	The sgRNA ID column in sgRNA efficiency prediction file (specified by the --sgrna-efficiency option). Default is 0 (the first column).
--sgrna-eff-score-column SGRNA_EFF_SCORE_COLUMN	The sgRNA efficiency prediction column in sgRNA efficiency prediction file (specified by the --sgrna-efficiency option). Default is 1 (the second column).
--update-efficiency	Iteratively update sgRNA efficiency during EM iteration.

plot

The plot command generating graphics for selected genes. For interactive visualizations, use our new MAGeCK-VISPR algorithm.

usage:

usage: mageck plot [-h] -k COUNT_TABLE -g GENE_SUMMARY [--genes GENES]
                   [-s SAMPLES] [-n OUTPUT_PREFIX]
                   [--norm-method {none,median,total}] [--keep-tmp]

required arguments:

Parameter	Explanation
-k COUNT_TABLE, --count-table COUNT_TABLE	Provide a tab-separated count table.
-g GENE_SUMMARY, --gene-summary GENE_SUMMARY	The gene summary file generated by the test command.

optional arguments:

Parameter	Explanation
-h, --help	show this help message and exit
--genes GENES	A list of genes to be plotted, separated by comma. Default: none.
-s SAMPLES, --samples SAMPLES	A list of samples to be plotted, separated by comma. Default: using all samples in the count table.
-n OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX	The prefix of the output file(s). Default sample1.
--norm-method {none,median,total}	Method for normalization, default median.
--keep-tmp	Keep intermediate files.

run (disabled since 0.5.4)

This subcommand allows you generate comparison results directly from fastq files, with limited parameter settings available. The parameters for the run sub-command are included in test and count sub-command. See both sub-commands for more details. It is strongly suggested that users run the count and test command separately, in order to gain a finer control of the results.

Internal programs

These programs are used by MAGeCK internally, but can also be executed by users for other purposes.

RRA

RRA - Robust Rank Aggreation v 0.5.6.

Usage:

Parameter	Explanation
-i input_data file	Input file name. Format: "item id" "group id" "list id" "value" ["probability"] ["chosen"]
-o output_file	Output file name. Format: "group id" "number of items in the group" "lo-value" "false discovery rate"
-p maximum_percentile	RRA only consider the items with percentile smaller than this parameter. Default=0.1
--control control_sgrna_list	A list of control sgRNA names.
--permutation permutation_round	The number of rounds of permutation. Increase this value if the number of genes is small. Default 100.
--no-permutation-by-group	By default, gene permutation is performed separately, by their number of sgRNAs. Turning this option will perform permutation on all genes together. This makes the program faster, but the p value estimation is accurate only if the number of sgRNAs per gene is approximately the same.
--skip-gene gene_name	Genes to skip from doing permutation. Specify it multiple times if you need to skip more than 1 genes.
--min-percentage-goodsgrna min_percentage	Filter genes that have too few percentage of 'good sgrnas', or sgrnas that fall below the -p threshold. Must be a number between 0-1. Default 0 (do not filter genes).
--min-number-goodsgrna min_number	Filter genes that have too few number of 'good sgrnas', or sgrnas that fall below the -p threshold. Must be an integer. Default 0 (do not filter genes).

mageckGSEA

mageckGSEA is a fast implementation of Gene Set Enrichment Analysis (GSEA) using C++. It's used by MAGeCK for quality controls and pathway enrichment tests. Compared with the official GSEA program, the main advantage is its easy use and extremely fast running speed.

In the gsea/demo folder, an example is provided to run GSEA. Use the following command to perform GSEA analysis based on the ranked gene list in demo1.txt (provided in the demo folder), tested on pathways defined in kegg.ribosome.gmt (provided in the demo folder). The scores on the 2nd column will be used to rank genes (-c 1), and permute 10000 times to get p value:

 mageckGSEA -r demo1.txt -g kegg.ribosome.gmt  -c 1 -p 10000

You can either provide genes with their scores, as is in demo1.txt (genes with smaller scores are ranked in the front).

SYNRG   0.715581582
SREK1   0.992306809
SLC25A46        0.057411873
COL4A5  0.36387645
CCDC22  -0.463887932
MVD     0.020897922

mageckGSEA will first rank genes based on the provided scores, as long as you indicate which column to use (-c 1).

Or you can just provide gene rankings, as is in demo2.txt.

C5orf64
TTC17
MRPS27
PIGY
GPAA1
KIF4A
EPS15

The output is a tab-separated file to report the following statistics of GSEA:

Pathway Size    ES  p   p_permutation   FDR Ranking Hits    LFC
KEGG_RIBOSOME   88  0.3262  0.00240772  0.0043  0.0043  0   32  0

Item	Explanation
Pathway	The name of the pathway
Size	The size of the pathway, i.e., the number of genes
ES	Enrichment Score (ES) in GSEA
p	The p value of ES
p_permutation	The permutation p value of ES (usually more accurate than p
FDR	False Discovery Rate of p_permutation
Ranking	The ranking of this pathway
Hits	The number of genes that are ranked before ES score. See "Leading Edge" analysis of GSEA
LFC	Log fold change (not implemented)

USAGE:

 mageckGSEA  -r rank_file -g gmt_file 
                           [-e] [-s]  [-c score_column] 
                           [-p perm_time]   [-n pathway_name] 
                           [-o output_file]  [--] [--version] [-h]

Parameter	Explanation
-e, --reverse_value	Reverse the order of the gene.
-s, --sort_byp	Sort the pathways by p value.
-c score_column, --score_column score_column	The column for gene scores. If you just want to use the ranking of the gene (located at the 1st column), use 0. Otherwise, specify which column should be used to rank the gene. The column number starts from 0. Default: 0.
-p perm_time, --perm_time perm_time	Permutations, default 1000.
-n pathway_name, --pathway_name pathway_name	Name of the pathway to be tested. If not found, will test all pathways.
-o output_file, --output_file output_file	The name of the output file. Use - to print to standard output.
-r rank_file, --rank_file rank_file	(required) Rank file. The first column of the rank file must be the gene name.
-g gmt_file, --gmt_file gmt_file	(required) The pathway annotation in GMT format.
--version	Displays version information and exits.
-h, --help	Displays usage information and exits.

Return to [Home]

Visualization Functions in MAGeCK

Visualization Functions in MAGeCK

We developed a stand-alone visualization tool, VISPR, to visualize CRISPR screening results. See the paper and the VISPR project for more details.

The MAGeCKFlute package provides a convenient approach to visualze MAGeCK and MAGeCK-VISPR results using R programming language.

Visualization functions are also available within the MAGeCK software. Since version 0.5, MAGeCK enables a couple of visualization functions. With these features on, MAGeCK helps users better interpret datasets and results, and generates figures and tables that can be directly used in presentations or papers. The in-house visualization module provides a simple solution for users with limited knowledge in R.

The Visualization function has additional software dependencies, but they are easy to install in many operating systems. See installation for more details.

The R markdown option (since 0.5.9.3)

Since 0.5.9.3, MAGeCK generates an R markdown file (.Rmd) for count and test options. Users can copy this file (along with all other files generated by MAGeCK) to a computer with RStudio installed, and generate a html based report page.

To generate the report page, simply open the corresponding .Rmd file in RStudio, and press the "Run" --> "Run all" button. A html file will be generated correspondingly.

Users can also modify the "Parameters" section in .Rmd file to adjust the parameters used in the report.

An example of the generated html file (from test command) can be downloaded here.

An example of the generated html file (from count command) can be downloaded here.

Users need to install rmarkdown package as the dependency:

install.packages("rmarkdown")

The --pdf-report option (before 0.5.9.3)

--pdf-report option will be gradually depreciated after version 0.5.9.3, due to its complicated dependencies on pdflatex.

MAGeCK will generate PDF files in both count and test command, by simply adding the --pdf-report option. If successful, a <prefix>.countsummary.pdf (for count command) or <prefix>_summary.pdf (for test command) will be generated. </prefix></prefix>

You can also try it in the two demos provided. In demo1, note that the command used in the run.sh is:

mageck test -k sample.txt -t HL60.final,KBM7.final -c HL60.initial,KBM7.initial  -n demo

Use the following command to generate PDF file:

mageck test -k sample.txt -t HL60.final,KBM7.final -c HL60.initial,KBM7.initial  -n demo --pdf-report

You can download the sample PDF file from demo1 here.

In demo2, the command used in the runmageck.sh is:

mageck run --fastq test1.fastq test2.fastq -l library.txt -n demo --sample-label L1,CTRL -t L1 -c CTRL

You can split the run command into count and test command, with --pdf-report option enabled. An alternative way is, note that the Rnw and R files (used for PDF file production) exist after successfully running this demo:

demo.count.median_normalized.csv  demo.count.txt         demo.R                  library.txt   test2.fastq
demo_countsummary.R               demo.gene_summary.txt  demo.sgrna_summary.txt  runmageck.sh
demo_countsummary.Rnw             demo.log               demo_summary.Rnw        test1.fastq

Simply execute the two .R files and you can get the PDF files as well:

Rscript demo_countsummary.R
Rscript demo.R

You can download the count sample PDF file from demo2 here.

The plot command

After running test, MAGeCK can generate a couple of figures describing the genes you are interested using the plot command. In the demo1 for example, if you are interested in the ACTR8 gene, use the following command to generate the PDF reports describing the sgRNA read count change of ACTR8, and its RRA score relative to all-gene RRA score distribution:

mageck plot -k sample.txt -g demo.gene_summary.txt --genes ACTR8

The PDF file generated using this command is here.

[Home]

Input file specification

Input file specification

sgRNA read count file

The sgRNA read count file will be used in -k parameter in the test or run sub-command.

The read count file should list the names of the sgRNA, the gene it is targeting, followed by the read counts in each sample. Each item should be separated by the tab ('\t'). A header line is optional. For example in the studies of T. Wang et al. Science 2014, there are 4 CRISPR screening samples, and they are labeled as: HL60.initial, KBM7.initial, HL60.final, KBM7.final. Here are a few lines of the read count file:

sgRNA           gene    HL60.initial    KBM7.initial    HL60.final      KBM7.final
A1CF_m52595977  A1CF    213             274            883                175
A1CF_m52596017  A1CF    294             412            1554              1891
A1CF_m52596056  A1CF    421             368            566                759
A1CF_m52603842  A1CF    274             243            314                855
A1CF_m52603847  A1CF    0               50             145                266

The count sub-command will output the read count file like this.

Sample index

In the -t/--treatment-id, -c/--control-id parameters, you can use either sample label or sample index to specify samples. If sample label is used, the labels [must] match the sample labels in the first line of the count table. For example, "HL60.final,KBM7.final".

You can also use sample index to specify samples. The index of the sample is the order it appears in the sgRNA read count file, starting from 0. The index is used in the -t/--treatment-id, -c/--control-id parameters. In the example above, there are four samples, and the index of each sample is as follows:

sample	index
HL60.initial	0
KBM7.initial	1
HL60.final	2
KBM7.final	3

design matrix file

The design matrix is a txt file indicating the effects of different conditions on different samples. In this file, each row is a sample, each column is a condition, and the value is 1 or 0, indicating whether the sample (in the row) is affected by the condition (in the column).

Here is a simple example of the design matrix from the studies in T. Wang et al. Science 2014. The CRISPR screens are done on two cell lines, HL60 and KBM7, and four samples are generated, two corresponding to the initial states of two cell lines, and two corresponding to the final states of two cell lines. If you want to model the effects of two cell lines, you can have the design matrix as follows:

Samples        baseline        HL60        KBM7
HL60.initial   1               0           0
KBM7.initial   1               0           0
HL60.final     1               1           0
KBM7.final     1               0           1

Here are some important rules of the design matrix:

The design matrix file must include a header line of condition labels;
The first column is the sample labels that must match labels in read count file (see the above example in sgRNA read count file);
The second column must be a "baseline" column that sets all values to "1";
The element in the design matrix is either "0" or "1".
You must have at least one sample of "initial state" (e.g., day 0 or plasmid) that has only one "1" in the corresponding row. That only "1" must be in the baseline column.

Note: different orders of the samples in the design matrix may change the results, because there are preprocessing steps to remove outliers. A good practice will be to always place initial samples (like day0 or plasmid) as the first rows in the design matrix.

sgRNA library file

When starting from fastq files, MAGeCK needs to know the sgRNA sequence and its targeting gene. Such information is provided in the sgRNA library file, and can be specified by the -l/--list-seq option in run or count subcommand.

The sgRNA library file can be provided either in .txt format or in .csv format. There are three columns in the library file: the sgRNA ID, the sequence, and the gene it is targeting. One example of the library file is provided as library.txt in demo2:

s_10007 TGTTCACAGTATAGTTTGCC    CCNA1
s_10008 TTCTCCCTAATTGCTTGCTG    CCNA1
s_10027 ACATGTTGCTTCCCCTTGCA    CCNC

If provided in .csv format, the file will look like:

s_10007,TGTTCACAGTATAGTTTGCC,CCNA1
s_10008,TTCTCCCTAATTGCTTGCTG,CCNA1
s_10027,ACATGTTGCTTCCCCTTGCA,CCNC

negative control sgRNA list

When using --control-sgrna option, users need to provide a plain text file just containing negative control sgRNA IDS (one per each line). For example,

NonTargetingControlGuideForHuman_0001
NonTargetingControlGuideForHuman_0002
NonTargetingControlGuideForHuman_0003
NonTargetingControlGuideForHuman_0004

Some systems may read only 1 control sgRNA ID. Please look at this Q&A for solutions.

pathway file (gmt)

The GMT file format stores the pathway information and is consistent with the GMT file in Gene Set Enrichment Analysis (GSEA). The details of the GMT format can be found at GSEA website.

You can also download different pathway files directly from GSEA MSigDB database. They can be used directly by MAGeCK.

sgRNA/gene mapping file (depreciated after version 0.3)

The sgRNA/gene mapping file will be used in the --gene-test parameter in the test or run sub-command.

This file should list the names of the sgRNAs and their corresponding genes, separated by the tab ('\t'). For example:

A1CF_m52595977  A1CF
A1CF_m52596017  A1CF
A1CF_m52596056  A1CF
A1CF_m52603842  A1CF
A1CF_m52603847  A1CF
A1CF_p52595870  A1CF
A1CF_p52595881  A1CF
A1CF_p52596023  A1CF

Return to [Home]

Output file specification

Output file specification

The output of the MAGeCK consists of the following files:

countsummary.txt: Count summary and QC measurements.
sgrna_summary.txt: The sgRNA ranking results.
gene_summary.txt: The gene ranking results.
pathway_summary.txt: The pathway ranking results.
log: The logging information during the running.

The following files are the outputs of RRA. They are intermediate files and are deleted after MAGeCK running is complete. To see these files, use the --keep-tmp option in MAGeCK test subcommand.

.gene.high.txt: The gene ranking results (positively selected genes).
.gene.low.txt: The gene ranking results (negatively selected genes).

The following files are the inputs of RRA and will be deleted after MAGeCK is complete.

count_summary_txt

This file is generated by count command, and summarizes QC measurements of the fastq (or count table) files.

An example is as follows:

File    Label   Reads   Mapped  Percentage  TotalsgRNAs Zerocounts  GiniIndex   NegSelQC    NegSelQCPval    NegSelQCPvalPermutation NegSelQCPvalPermutationFDR  NegSelQCGene
S6_R1_001.fastq.gz  LNCaP_Day21 15567122    13033442    0.8372  92817   2204    0.1472  0.68965 1.6688e-31  0   0   86
S5_R1_001.fastq.gz  LNCaP_Day0  16659017    14497805    0.8703  92817   461 0.0996  0   1   1   1   0.0

The contents of each column are as follows. To help you evaluate the quality of the data, recommended values are shown in bold.

Column	Content
File	The fastq (or the count table) file used.
Label	The label of that fastq file assigned.
Reads	Total number reads in the fastq file. (Recommended: 100~300 times the number of sgRNAs)
Mapped	Total number of reads that can be mapped to library
Percentage	Mapped percentage, calculated as Mapped/Reads (Recommended: at least 60%)
TotalsgRNAs	Total number of sgRNAs in the library
Zerocounts	Total number of missing sgRNAs (sgRNAs that have 0 counts) (Recommended: no more than 1%)
GiniIndex	The Gini Index of the read count distribution. A smaller value indicates more eveness of the count distribution. (Recommended: around 0.1 for plasmid or initial state samples, and around 0.2-0.3 for negative selection samples )

The following column is used to evaluate the degree of negative selection in known essential genes. It is set only if you provide the --day0-label option. MAGeCK will run pathway analysis for each sample, and use several GSEA metrics to evaluate the quality of the samples.

Column	Content
NegSelQC	The Enrichment Score (ES) of GSEA
NegSelQCPval	The p value of the GSEA analysis (Recommended: smaller than 1e-10)
NegSelQCPvalPermutation	The permutation p value
NegSelQCPvalPermutationFDR	The FDR of the permutation p value
NegSelQCGene	The number of essential genes found in the library that are evaluated for GSEA analysis.

sgrna_summary_txt

An example of the sgRNA ranking results is as follows:

sgrna   Gene   control_count   treatment_count control_mean    treat_mean    LFC     control_var     adj_var score   p.low   p.high  p.twosided      FDR     high_in_treatment
INO80B_m74682554   INO80B        0.0/0.0 1220.1598778/1476.14096301      0.810860655738  1348.15042041   10.70    0.0     19.0767988005   308.478081895   1.0     1.11022302463e-16       2.22044604925e-16       1.57651669497e-14       True
NHS_p17705966   NHS   1.62172131148/3.90887850467     2327.09368635/1849.95115143     2.76529990807   2088.52241889    9.54   2.6155440132    68.2450168229   252.480744404   1.0     1.11022302463e-16       2.22044604925e-16       1.57651669497e-14       True

The contents of each column are as follows.

Column	Content
sgrna	sgRNA ID
Gene	The targeting gene
control_count	Normalized read counts in control samples
treatment_count	Normalized read counts in treatment samples
control_mean	Median read counts in control samples
treat_mean	Median read counts in treatment samples
LFC	The log2 fold change of sgRNA
control_var	The raw variance in control samples
adj_var	The adjusted variance in control samples
score	The score of this sgRNA
p.low	p-value (lower tail)
p.high	p-value (higher tail)
p.twosided	p-value (two sided)
FDR	false discovery rate
high_in_treatment	Whether the abundance is higher in treatment samples

sgrna_summary_txt in mle subcommand

Note that this file will have different meaning in mle subcommand: it records the estimated efficiency probability of the guides in the MLE model, after the termination of iteration.

Note that by default, this value is 1 since --sgrna-efficiency is turned off. The values will be between 0-1 if you turn this option on and/or if you explicitly set up the --sgrna-efficiency parameter.

gene_summary_txt

An example of the gene summary file is as follows:

id      num     neg|score  neg|p-value   neg|fdr neg|rank        neg|goodsgrna    neg|lfc   pos|score  pos|p-value   pos|fdr pos|rank  pos|goodsgrna    pos|lfc
ESPL1   12      6.4327e-10      7.558e-06       7.9e-05 1    -2.35    11      0.99725 0.99981 0.999992        615     0    -0.07
RPL18   12      6.4671e-10      7.558e-06       7.9e-05 2    -2.12    11      0.99799 0.99989 0.999992        620     0    -0.32
CDK1    12      2.6439e-09      7.558e-06       7.9e-05 3    -1.93    12      1.0     0.99999 0.999992        655     0    -0.12

The contents of each column is as follows.

Column	Content
id	Gene ID
num	The number of targeting sgRNAs for each gene
neg\|score	The RRA lo value of this gene in negative selection
neg\|p-value	The raw p-value (using permutation) of this gene in negative selection
neg\|fdr	The false discovery rate of this gene in negative selection
neg\|rank	The ranking of this gene in negative selection
neg\|goodsgrna	The number of "good" sgRNAs, i.e., sgRNAs whose ranking is below the alpha cutoff (determined by the --gene-test-fdr-threshold option), in negative selection.
neg\|lfc	The log2 fold change of this gene in negative selection. The way to calculate gene lfc is controlled by the --gene-lfc-method option
pos\|score	The RRA lo value of this gene in positive selection
pos\|p-value	The raw p-value (using permutation) of this gene in positive selection
pos\|fdr	The false discovery rate of this gene in positive selection
pos\|rank	The ranking of this gene in positive selection
pos\|goodsgrna	The number of "good" sgRNAs, i.e., sgRNAs whose ranking is below the alpha cutoff (determined by the --gene-test-fdr-threshold option), in positive selection.
pos\|lfc	The log fold change of this gene in positive selection

Genes are ranked by the p.neg field (by default). If you need a ranking by the p.pos, you can use the --sort-criteria option.

gene_summary_txt in mle subcommand

The output of the gene_summary.txt in mle subcommand is pretty similar to the gene_summary.txt format above, except a few new columns. Here is an example of the gene_summary.txt generated from the mle subcommand:

Gene    sgRNA   HL60|beta       HL60|z  HL60|p-value    HL60|fdr        HL60|wald-p-value       HL60|wald-fdr   KBM7|beta       KBM7|z  KBM7|p-value    KBM7|fdr        KBM7|wald-p-value       KBM7|wald-fdr
RNF14   10      0.24927 0.72077 0.36256 0.75648 0.47105 0.9999  0.57276 1.6565  0.06468 0.32386 0.097625
    0.73193
RNF10   10      0.10159 0.29373 0.92087 0.98235 0.76896 0.9999  0.11341 0.32794 0.90145 0.97365 0.74296 0.98421
RNF11   10      3.6354  10.513  0.0002811       0.021739        7.5197e-26      1.3376e-22      2.5928  7.4925  0.0014898       0.032024        6.7577e-14      1.33e-11

Column	Content
Gene	Gene ID
sgRNA	The number of targeting sgRNAs for each gene
HL60\|beta;KBM7\|beta	The beta scores of this gene in conditions "HL60" and "KBM7", respectively. The conditions are specified in the design matrix as an input of the mle subcommand.
HL60\|p-value	The raw p-value (using permutation) of this gene
HL60\|fdr	The false discovery rate of this gene
HL60\|z	The z-score associated with Wald test
HL60\|wald-p-value	The p value using Wald test
HL60\|wald-fdr	The false discovery rate of the Wald test

pathway_summary_txt

The output of the pathway summary is similar to the gene summary. Here is an example:

id      num     neg|score  neg|p-value   neg|fdr neg|rank        neg|goodsgrna   pos|score  pos|p-value   pos|fdr pos|rank  pos|goodsgrna
KEGG_RIBOSOME   87      8.3272e-23      2.6473e-05      0.001238        1       50      0.051213        0.20927 0.841006        38      4
KEGG_SPLICEOSOME        125     3.7084e-08      2.6473e-05      0.001238        2       41      0.52219 0.80968 0.99902 149     13
KEGG_PROTEASOME 44      1.9586e-06      2.6473e-05      0.001238        3       18      0.52149 0.80905 0.99902 148     5

This table shows a pathway KEGG_RIBOSOME has 87 genes, its RRA lo value 8.3272e-23, permutation p value is 2.6473e-05 (negative selection), FDR 0.001238, its ranking is 1, and there are 50 genes that are below the alpha cutoff. This shows the genes in this pathway (i.e., ribosomal genes) are strongly negatively selected, which is expected in negative selection CRISPR experiments.

log

This file includes the logging information during the execution. For count command, it will list some basic statistics of the dataset at the end, including the number of reads, the number of reads mapped to the library, the number of zero-count sgRNAs, etc.

Rnw and R

If the "--pdf-report" option is on for count or test command, MAGeCK may generate Rnw and R files that are used to create PDF files. MAGeCK calls the Sweave function in R to generate PDF files.

Intermediate file formats

These files will be automatically deleted after the completion of each command. To keep these files, use the "--keep-tmp" option during the execution.

gene_txt

An example of the gene ranking file (.gene.high.txt or .gene.low.txt) is as follows:

 group_id        #_items_in_group        lo_value        FDR
 RPL3    93      4.9169e-36      0.000080
 RPL8    67      1.8232e-24      0.000080
 RPS2    61      1.6928e-20      0.000080
 RPS18   40      1.0152e-18      0.000080

The contents of each column is as follows.

Column	Content
group_id	Gene ID
#_items_in_group	The number of targeting sgRNAs for each gene
lo_value	The raw p-value
FDR	The false discovery rate

RRA input

An example of the sgrna ranking file (.plow.txt or ..phigh.txt) is as follows. These files are the input of RRA.

sgrna   symbol  pool    p.low   prob    chosen
Drug_0009853    TOP2A   list    -31.3383375285032       1       1
Drug_0010808    RPS11   list    -29.865960506388134     1       1

The contents of each column is as follows.

Column	Content
sgrna	sgRNA ID
symbol	Gene ID
pool	Depreciated column. Set all the values in this column as a single value (e.g., "list")
p.low	The score used to sort sgRNA (increasing order)
prob	Reserved column. Set to 1
chosen	Reserved column. Set to 1

Return to [Home]

Commonly used libraries

Download frequently used libraries

For your convenience, we provide a set of library files that are ready to be used in MAGeCK (in the -l/--list-seq option of the count command) in the libraries folder. You can also create your own library files, see sgrna-library-file for more details.

File	Explanation
broadgpp-brunello-library-corrected.txt.zip	Human Brunello genome-wide library developed by Broad Institute
Human_GeCKOv2_Library_A_3_mageck.csv.zip	Human GeCKO v2 half-library A (can be used in either 1- or 2-plasmid systems)
Human_GeCKOv2_Library_B_1_mageck.csv.zip	Human GeCKO v2 half-library B
Human_GeCKOv2_Library_combine.csv.zip	Human GeCKO v2 combined library of A and B
mouse_geckov2_library_a_2_mageck.csv.zip	Mouse GeCKO v2 half-library A (can be used in either 1- or 2-plasmid systems)
mouse_geckov2_library_b_1_mageck.csv.zip	Mouse GeCKO v2 half-library B
mouse_geckov2_library_combine.csv.zip	Mouse GeCKO v2 combined library of A and B
GeCKOv1.txt.zip	GeCKO v1 library file (from the GeCKO Science paper)
human_sam_library.csv.zip	Human Synergistic Activation Mediator (SAM) pooled library (CRISPRa library), generated by Feng Zhang laboratory.
yusa_library.csv.zip	Mouse knockout library generated by Kosuke Yusa laboratory.
tim_library.txt.zip	Human CRISPR knockout library of 7,000 genes (from T. Wang Science 2014).
tim_science2015_library.txt.zip	Human CRISPR pooled library of 18,166 genes (from T.Wang Science 2015).

Q and A

Q and A

You can always ask questions on our Google group. Usually your questions are also other's questions, so please help us better improve our algorithm by joining our Google group and asking questions there!

Installation problems

I encountered an error after installation: "ImportError: No module named mageck". What is the problem?

A: Probably you are installing MAGeCK to your own directory, which is not recognized by Python. The solution is to set up the PYTHONPATH environment: see install/#setting-up-the-environment-variables for more details.

Where is MAGeCK binary installed?

A: If you add the "--user" option during installation, mageck executable is usually located on your local directory ($HOME/bin or $HOME/.local/bin). If you don't have this option, mageck is installed in the system bin (/usr/bin or /usr/sbin).

There are two ways you can check the path of MAGeCK. You can either type

which mageck

to determine the path of the mageck executable. Or, at the end of the installation, you will see a few lines of the log like this:

copying build/scripts-2.7/mageck -> /Users/john/.local/bin
changing mode of /Users/john/.local/bin/mageck to 755
running install_data
copying bin/RRA -> /Users/john/.local/bin

That means your mageck is installed at /Users/john/.local/bin. On the other hand, if you see a message like this:

copying build/scripts-2.7/mageck -> /Users/john/Library/Python/2.7/bin
changing mode of /Users/john/Library/Python/2.7/bin/mageck to 755
running install_data
copying bin/RRA -> /Users/john/Library/Python/2.7/bin

That means your mageck is installed at /Users/john/Library/Python/2.7/bin.

Depending on your system, the path may look like one of the following:

/Users/john/.local/bin
/Users/john/Library/Python/2.7/bin
/Users/john/bin
/home/john/.pyenv/versions/2.7.13/bin

Where is MAGeCK python module installed?

A: You can use a similar approach to identify MAGeCK python module, but look for pattern like python2.7/site-packages. During installation, if you see a message like this:

copying bin/mageckGSEA -> /home/john/.pyenv/versions/2.7.13/bin
running install_egg_info
Removing /home/john/.pyenv/versions/2.7.13/lib/python2.7/site-packages/mageck-0.5.6-py2.7.egg-info
Writing /home/john/.pyenv/versions/2.7.13/lib/python2.7/site-packages/mageck-0.5.6-py2.7.egg-info

That means your MAGeCK python module is installed in /home/john/.pyenv/versions/2.7.13/lib/python2.7/site-packages.

I use conda to install the latest version of MAGeCK, but my system still calls an older version of MAGeCK. What is the problem?

A: This usually happens when you have both conda version of MAGeCK and your previously installed version of MAGeCK. Even if your "mageck" command comes from conda, the libraries may still come from your previously installed MAGeCK. To solve this problem, you can manually install MAGeCK to the latest version.

I don't want to run the conda MAGeCK version, but instead the version I installed by myself. How can I do that?

A: There are two different solutions to do this.

Solution 1: Uninstall the conda MAGeCK version using the followig command:

conda uninstall mageck

You can always re-install MAGeCK later.

To avoid frequent un-installing and re-installing the software, consider using conda environments. For example, you can install the MAGeCK conda version under some certain environment, and activate it only the environment is activated.

Here is an example. First, create a python 3 environment named "mageckenv":

conda create -n mageckenv anaconda python=3

Then activate the environment using the following command:

source activate mageckenv

Now, install mageck under that environment

conda install -c bioconda mageck

You can use the MAGeCK conda version under the mageckenv environment now. To disable it, simply deactivate the environment:

source deactivate

Solution 2: The conda MAGeCK is run under python 3, while the MAGeCK in sourceforge and bitbucket is run under python 2. So the best way to run the installed version other than conda version is to create an python 2 conda environment and run mageck under that environment.

To create a python 2 envorinment when you have miniconda3 (where MAGeCK-VISPR is hosted), type the following command:

conda create -n py2k anaconda python=2

After that, you can activate the environment by typing

source activate py2k

If you run mageck now, it will invoke the installed version. You can also deactivate your environment by typing:

source deactivate

You may also need to manually edit the PATH variable such that the system will run your local mageck first. To do this, first locate the directory of mageck from your own installation (see the question "where is MAGeCK binary installed?"). If it's in /Users/john/.local/bin, then edit the PATH variable as follows:

::bash
export PATH=/Users/john/.local/bin:$PATH

Then you should be able to run your own installed mageck, not the conda mageck. For more information, go to Setting up the environment variables.

Using MAGeCK

How to deal with biological replicates and technical replicates?

A: Usually you can pool the read counts for technical replicates of the same sample. To do this, use comma (,) to separate the fastq files of the technical replicates from the same sample in the --fastq option. For example, "--fastq sample1_replicate1.fastq,sample1_replicate2.fastq sample2_replicate1.fastq,sample2_replicate2.fastq" indicates two samples with 2 technical replicates for each sample.
For biological replicates, treat them as separate samples and use them together when doing the comparison; so MAGeCK can analyze the variance of these samples. For example in the test command, "-t sample1_bio_replicate1,sample1_bio_replicate2 -c sample2_bio_replicate1,sample2_bio_replicate2" compares 2 samples (with 2 biological replicates in each sample).

The --trim-5 option can only trim a fixed length of nucleotides before sgRNA, but what if the trimming length is different in different reads?

A: Since version 0.5.6, MAGeCK enables automatically determining trimming length, even the length may be different within the same fastq files. Alternatively, you can use cutadapt to trim the adaptor sequences of variable length before running MAGeCK.

How do I get the simple statistics of the fastq files?

A: Since version 0.5, MAGeCK produces a "countsummary.txt" file containing all the statistics of the fastq files. If you use "--pdf-report" option, the statistics of fastq files are also in the PDF file from the test.

The statistics can also be found in the log file (for run and count command). Here is an example of the log file generated from count command (the last few lines):

INFO  @ Mon, 02 Feb 2015 08:12:15: Summary of file sample1_R1.fastq: 
INFO  @ Mon, 02 Feb 2015 08:12:15: reads        45631055 
INFO  @ Mon, 02 Feb 2015 08:12:15: mappedreads  34300176 
INFO  @ Mon, 02 Feb 2015 08:12:15: zerosgrnas   119315 
INFO  @ Mon, 02 Feb 2015 08:12:15: label        sample_1 
INFO  @ Mon, 02 Feb 2015 08:12:15: Summary of file sample2_R1.fastq: 
INFO  @ Mon, 02 Feb 2015 08:12:15: reads        36344414 
INFO  @ Mon, 02 Feb 2015 08:12:15: mappedreads  27042629 
INFO  @ Mon, 02 Feb 2015 08:12:15: zerosgrnas   119002 
INFO  @ Mon, 02 Feb 2015 08:12:15: label        sample_2

It provides the total number of reads, the number of mapped reads, the number of sgRNAs with 0 read counts, and the sample label of the fastq file.

How do I know the quality of my samples?

A: We published a paper (MAGeCK-VISPR) to describe some quality control (QC) terms to help you determine the quality of your samples.

For simple QC terms, you can just take a look at the sample statistics. Generally in a good negative selection sample, (1) the mapped reads should be over 60 percent of the total number reads, and (2) the number of zero-count sgRNAs should be few (<5%, and prefered <1%). One exception is in positive selection experiments, where the number of zero-count sgRNAs may be much higher, but the percentage of mapped reads should be reasonably high.

You can also inspect the results by taking a look at the comparison results, see the related question below.

The program cannot read library file or control sgRNA file, but they look fine when I manually check these files. What happened?

A: One possible reason is: you may save your library file or control sgRNA file to txt or csv format using some Microsoft softwares (like excel). Sometimes the line break representation is different between Windows and Linux/Mac systems, and it creates some problems for the program to read these files.

One solution is to open your txt file using Microsoft excel, copy all the contents (Ctrl+A, Ctrl+C), paste to another plain text editor like Vim (Ctrl+V), and save it to plain txt format.

The MLE module uses more CPU resources than expected, even if I specify the number of threads in --threads option. How to solve this problem?

A: The reason is numpy and scipy use MKL and openBLAS. Both libraries use multipel CPUs to accelerate numeric calculation (e.g., matrix operation). To limit the number of CPU to 1 per thread, set up the OMP_NUM_THREADS environment variable in Linux system. In other words, before running the mageck mle command, type the following command in the terminal:

export OMP_NUM_THREADS=1

This solution comes from the discussion here.

How to perform paired analysis?

A: Since version 0.5.9, MAGeCK RRA introduces paired comparison between treatments and controls (--paired option). This option allows MAGeCK to make full use of paired samples to boost the statistical power. It is especially useful if the data between two (or more) replicates is poorly correlated, and you want to find top hits that are consistent between paired samples.

Paired samples are usually biological replicates that have treatment and control conditions independently. For example, you have two replicates (r1, r2), and for each replicate you perform screens on treatment and control conditions separately. In the end you have four samples (treatment_r1, treatment_r2, control_r1, control_r2).

You can now run MAGeCK RRA to compare treatment and control conditions, but add an additional --paired parameter to tell MAGeCK that (treatment_r1, control_r1) and (treatment_r2, control_r2) are paired:

mageck test -k count.txt -t treatment_r1,treatment_r2 -c control_r1,control_r2 --paired

In the --paired mode, the number of samples in -t and -c must match and have an exactly the same order in terms of samples.

The way MAGeCK deals with paired samples is to consider sgRNAs in paired samples as independent sgRNAs; therefore, it is equivalent to doubling the number of sgRNAs per gene (if you have two paired samples). The assumption of independence is not always hold, especially if the correlation between replicates is high. If this is the case, it may introduce false positives. Therefore, use the --paired option only if the correlation between paired samples is low, and you want to find consistent signals between paired replicates.

Interpreting results

How do I know if my experiments work well?

A: First of all, make sure your sample statistics looks good (see the related question in "Counting sgRNAs from fastq files"). Next, take a look at the rankings of some well-known genes. In negative selection experiments, you will expect some ribosomal genes and well-known oncogenes that are on the top; for example, MYC, RAS, etc. In positive selection experiments, TP53 usually has a high ranking.

Besides visually inspecting top-ranked genes, a good validation is to run the pathway command to test on MSigDB KEGG pathways (see MSigDB website). In negative selection experiments (usually on some condition compared with day 0 condition), you will expect to see a set of essential pathways ranking on the top, like ribosome, splicesome, proteasome and cell cycle genes. If you see these pathways coming out, this is a good sign that your experiments are working. The smaller the RRA lo_value and p values they have, the better they are.

I see very few genes that are below the certain FDR cutoff (like 0.10). Why it is that and what should I do?

A: There are a couple of reasons that the top ranked genes have a high FDR. First, many CRISPR/Cas9 libraries designed few sgRNAs (<7) for each gene. Since some of them may have low cutting efficiency or off-target effect, there may not be enough statistical power to detect essential genes. Second, if there are two many comparisons (or genes), the multiple comparison adjustment may lead to a high FDR estimation. Also, MAGeCK employs a pretty stringent statistical framework to evaluate the statistical significance, its FDR estimation may be conservative.

There are a couple of procedures you can do to increase the sensitivity. First, try to filter out genes that you think are not hits before running MAGeCK; for example, remove genes that have extremely low expression, genes that have very few targeting sgRNAs (<4). Second, If you have a list of negative control genes (genes that you think are not essential, like AAVS1), you can specify the corresponding sgRNA IDs using the --control-sgrna option (see below), thus allowing MAGeCK to have a better estimation of null distribution. Third, if your replicates are paired samples, consider using the --paired option (see here).

What does the --control-sgrna CONTROL_SGRNA option do? How to use this option?

A: This option tells MAGeCK to use provided negative control sgRNAs to generate the null distribution when calculating the p values. If this option is not specified, MAGeCK generates the null distribution of RRA scores by assuming all of the genes in the library are non-essential. This approach is sometimes over-conservative, and you can improve this if you know some genes are not essential. By providing the corresponding sgRNA IDs in the --control-sgrna option, MAGeCK will have a better estimation of p values.

In addition, you can use the list of negative control sgRNAs to do the normalization. If --norm-method control option is specified, the median factor used for normalization will be calculated based on control sgRNAs only, rather than all the sgRNAs (by default).

New since 0.5.9.3: We include a new demo (demo5) in the MAGeCK source code to demonstrate the usage of control-sgrnas. Besides, we have an additional --control-gene option to specify the control genes instead of control sgRNAs.

To use this option, you need to prepare a text file specifying the IDs of control sgRNAs, one line for one sgRNA ID. Here is an example of the file:

NonTargetingControlGuideForHuman_0001
NonTargetingControlGuideForHuman_0002
NonTargetingControlGuideForHuman_0003
NonTargetingControlGuideForHuman_0004

There are several issues that you need to keep in mind:

You should have enough number of negative control guides (>100 recommended) for accurate p value estimation and normalization.
It is known that for growth based screens, non-targeting controls may lead to high false positives (e.g., Morgens et al. 2017). Use non-targeting controls carefully.

Visualization

The test or count command is successful but I have some problems producing the PDF file. How can I generate the PDF file?

A: MAGeCK will generate .R and .Rnw file even if the "--pdf-report" option is not specified. You can copy these files to a new computer where both R and pdflatex are properly installed, and use the following command to generate PDF files:

Rscript *.R

Note the for count command, the median-normalized read count file (.median_normalized.csv) should also be copied to the same directory. For test command, the gene summary file (.gene_summary.txt) should also be copied to the same directory.

I run into issues of generating pdf files using latex.

A: You may get some error messages like this:

Error in texi2dvi("recount_countsummary.tex", pdf = TRUE) :  
Running 'texi2dvi' on 'recount_countsummary.tex' failed.

This may be due to the system compatibility issue of latex. You can still get some figures generated from MAGeCK, by adding the "--keep-tmp" option to keep intermediate files.

Version history

Version history
- 0.5.9
- 0.5.8
- 0.5.7
- 0.5.6
- 0.5.5
- 0.5.4
- 0.5.3
- 0.5.2
- 0.5.1
- 0.5
- 0.4.4
- 0.4.3
- 0.4.2
- 0.4.1
- 0.4
- 0.3
- 0.2
- 0.1

For the latest releases and version history, see our bitbucket repo.

0.5.9

2019.07.01 Version 0.5.9

Fix a bug of incorrect sample labels for count table input in mageck count module.
Revise plotting function.
Add paired comparisons to compare paired treatment vs control samples.
Add --variance-estimation-samples option to estimate variances from designated samples; remove --variance-from-all-samples option
Add a "-P" option for calculating RRA scores.
Fix an issue of RRA printing too many debugging information.
Fix an issue of over-estimating variances from a few guides with extremely high count. Right now, only consider guides whose average count falls below mean(m)+ 4 std(m), where m is the average count of all guides. Thanks Jake Freimer for pointing this out.
0.5.8a: Fix a bug to cause math domain error when performing mageck test.

0.5.8

2019.01.04 Version 0.5.8

Report count table with empty gene IDs.
Fix a bug to improperly handle the return character in control sgRNA list file in RRA.
Fix a bug with improper count of sgRNAs with multiple trim-5 lengths in mageck count command.
Fix a bug to skip mapped sgRNAs in sam/bam files.
Improve alpha calculation and gene LFC calculation to exclude guides filtered by --remove-zero option.
0.5.7b: Add paired-end read support (Thanks to Wubing Zhang!): find sgRNAs in the 2nd pair if the 1st pair does not include sgRNAs.
Add warning messages if normalization appears to be improper.
Mean-var modeling is now based on control samples, unless --variance-from-all-samples is used.
0.5.7a: MLE permutation will performed on negative control sgRNAs, if --control-sgrna option is provided.
Skip MLE calculation for genes whose number of sgRNA is greater than a threshold (defined in --max-sgrnapergene-permutation). This change will greatly increase the speed of MLE calculation.

0.5.7

2018.01.05 Version 0.5.7

Change the average calculation of sgRNA counts from mean to median to better tolerate outliers.
Add a beta function to estimate CNV profiles based solely on CRISPR screening data.
Fix a bug to calculate positive selection lfc values of genes from negative selection sgRNAs.
Improve the wald p value calculation method in MLE module (Thanks to Chen-Hao!).
Update the permutation p value calculation of MLE. Genes with the same sgRNAs are permuted together to calculate p values. It takes longer time but the p values are more accurate. Also change the default --permutation-round to 2 to save time.
Improve variance estimation in RRA by considering raw variance. If more samples are provided, the adjusted variance will also consider the raw variances, calculated directly from these samples. If more than 8 samples are used (control + treatment), the adjusted variance will be calculated directly from raw variances.
Improve --remove-zero option to allow users choose from none, control, treatment, both or any; set default to both; also allow users to choose the threshold of --remove-zero by adding --remove-zero-threshold option.
Improve the ranking of sgRNA score to avoid the bias of low read-count sgRNAs in samples with many low read-count sgRNAs. The results of RRA will also be improved based on the revised score.

0.5.6

2017.05.17 Version 0.5.6

Fix a bug of pathway enrichment analysis
Add a QC module for count analysis. Use the --day0-label option to enable the QC module and specify the control sample label (usually day 0 or plasmid). For all other sample labels, MAGeCK will invoke "test" command to get negative selection gene list, and invoke "pathway" command to check whether known genes (specified by --gmt-file option) are negatively selected.
By default, the permutation of RRA is now done by considering genes with the same number of sgRNAs. The speed of RRA is decreased (as permutation is performed separately) but the p value estimation is more accurate.
By default, MAGeCK count command will automatically determine the trimming length of the fastq file.
Add a parameter (--additional-rra-parameters) to allow users more controls of RRA calling.
RRA now allows filtering genes by their numbers and percentages of good sgRNAs. This will improve the FDR calculation.
Fix a bug of processing BAM file headers.
Suppress too many warnings in BAM file processing due to duplicated sgRNA sequences.
RRA will skip control genes if control sgRNA is specified.
Incorporate a beta version of CNV normalization.
Fix a bug of incorrect column number in pathway enrichment analysis.
Add a parameter (--max-sgrnapergene-permutation) in RRA to increase the speed of RRA when a region is targeted by many sgRNAs.
Add functions to correct the effects of copy number variation (CNV) in both RRA and MLE modules -- thanks to the great work from Alex Wu.

0.5.5

2016.12.02 Version 0.5.5

Allow read count table as input run running the count module. This will allow users to normalize counts and get statistics based on count tables.
Allow BAM and SAM files as input when running the count module. This will allow users to use read mapping algorithms (like Bowtie) to map reads and collect read counts
Try to match the longest sgRNA first when do the counting
Fix a bug to require IPython for mle
Suppress negative control gene output when --control-sgrna is specified
Fix a bug to cause unexpected exit when calculating sgRNA efficiencies
For test module, MAGeCK now reports the log fold change (LFC) of gene level using various calculation methods, including mean, median, mean/median for "good" sgRNAs, and "secondbest" that reports the LFC for the second best sgRNA.

0.5.4

2016.06.29 Version 0.5.4

Fix a bug in MLE design matrix file processing
Fix a bug in --unmapped-to-file option
Update RRA positive selection procedure, and a smaller alpha cutoff is set for stronger positive selection e
xperiments. The results are now less affected by 0-count sgRNAs
Add an extra step in MLE to skip instances with too many sgRNAs and too many conditions (causes memory error
)
Improve the argument parsing module
Initial incorporation of Bayes estimation (experimental)
Improve MLE beta score estimation to avoid biased beta distribution and unreasonably large beta scores for some genes.
Add multiprocessing options for MLE module.
Remove the 100,000 gene limit in RRA.
MAGeCK now doesn't require a password to unzip the source code.
Disable run subcommand and update demo.

0.5.3

2016.01.15 Version 0.5.3

Update functions for pdf-report options
For count command, now it supports gRNAs with various lengths. The program will automatically extract length information from library file, and the --sgrna-length option will not be working if the library file is specified
MAGeCK count now supports gzipped fastq files besides fastq files. You can provide the gzipped files (the file name must end with .gz) directly to MAGeCK

0.5.2

2015.08.09 Version 0.5.2

For variable --trim-5 option, we advise using cutadapt program as the pre-processing step of mageck count
Improve the --pdf-report plots and statistics of mageck count
A new (still experimental) maximum-likelihood (MLE) approach for gene calling is introduced

0.5.1

2015.06.23 Version 0.5.1

Add one real dataset workflow in the documentation
MAGeCK can now run on python 3 (still experimental)
FDR calculation method can be specified on both sgRNA and gene level
The column header of the output has changed to a more human readable form
Add several more statistics to read count summary

0.5

2015.04.26 Version 0.5

Add multiple visualization functions (still experimental).
Fix a bug in negative binomial calculation (thanks to Ido Tamir).

0.4.4

2015.03.19 Version 0.4.4

Improve the running speed of the RRA program.
For gene testing, MAGeCK now accepts multiple -t and -c pairs, allowing generating one summary table containing results of multiple comparisons.
Modify the format of gene_summary.txt; the duplicated "item" column is now removed for positive selection results. Also, two more columns are added to better help users identify true hits: "lo": the RRA lo values; "goodsgrna": the number of sgRNAs in this gene whose ranking is higher than the alpha threshold.
MAGeCK now allows users specifying a set of control sgRNAs to generate null distributions.
Fix two bugs in calculating the median factor during normalization (thanks to Bastiaan Evers).
Add the "-v/--version" command.

0.4.3

2015.02.12 Version 0.4.3

Fix a bug where the program exits unexpectedly for certain samples with many 0 read counts.
Fix a bug of pathway analysis where the RRA program stops early for certain gene belonging to too many pathways.

0.4.2

2015.02.04 Version 0.4.2

Create youtube tutorial videos for installation and sample comparisons.
Improve the median normalization method to handle cases with many zero-count sgRNAs.
The median normalized read count are provided in the count command.
Modify the count command line options to accept combining reads from technical replicates.
Provide simple statistics for processing fastq files.
Provide library file for Synergistic Activation Mediators (SAM), a CRISPR activation protocol developed in Feng Zhang laboratory (http://www.addgene.org/crispr/libraries/sam/).

0.4.1

2014.12.01 Version 0.4.1

Increase the default alpha cutoff from 0.05 to 0.25.
Provide some of the commonly used library files for the convenience of users.

0.4

2014.11.13 Version 0.4

Added the BSD license information.
Improved the logging system.
The control_id and treatment_id options now can be specified using sample strings.
Merge positive selection and negative selection genes and pathways into one file.
Add the --keep-tmp option to control intermediate files after running.
Fixed one bug in FDR calculation.

0.3

2014.07.01 Version 0.3

The installation method is changed so users can now more easily install the software.
Added a new feature to detect enriched pathways (pathway command)
Changed the input format of the program:
- The second column of the count table (generated by the count subcommand and used by the test subcommand) is now the gene name.
- For the count subcommand, the sgRNA information is provided with the library file.

0.2

2014.04.17 Version 0.2

Updated the demo and wiki page

0.1

2014.04.04 Version 0.1

The source code released.

Project Members:

Bicna Song
Wei Li (admin)

Wiki: QA
Wiki: advanced_tutorial
Wiki: demo
Wiki: history
Wiki: input
Wiki: install
Wiki: libraries
Wiki: output
Wiki: usage
Wiki: visualization

Discussion

yun - 2022-07-11

Last edit: yun 2022-07-11

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Fiona Hartley - 2022-08-24

Hello,
I'm getting the following error message when running mageck test in --paired mode
"An error occurs while trying to compute p values. Quit.."
Up until then, the log file appears normal.

I can't figure out why this error is occurring. I'm trying to run analysis on x3 control and x3 treated samples. This error occurs if I try to run all the samples together (1+2+3), or if I try to run 1+3, leaving out sample 2. However, the program works fine if I run it on each sample individually, or if I run samples 1+2 or 2+3. The program also ran successfully on all samples when not using paired mode, so I'm confident I'm using the program correctly and that my input files are as they should be.

Could you please advise what would cause the program to error at the pvalue stage? Thank you :)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link: