Model-based Analysis of Genome-wide CRISPR-Cas9 Knockout (MAGeCK) is a computational tool to identify important genes from the recent genome-scale CRISPR-Cas9 knockout screens (or GeCKO) technology. MAGeCK is developed by Wei Li and Han Xu from Dr. Xiaole Shirley Liu's lab at Dana-Farber Cancer Institute, and is being actively updated by Wei Li lab from Children's National Medical Center. The MAGeCK algorithm is described in the following paper:
Besides MAGeCK, we also developed the following softwares and algorithms:
MAGeCK and associated softwares offer a range of functions meet the analys needs of different users, including:
and so on.
MAGeCK and MAGeCK-VISPR are free, open source softwares under the BSD license. We greatly appreciate the support from The Claudia Adams Barr Program in Innovative Basic Cancer Research and NHGRI (NIH) to develop MAGeCK and MAGeCK-VISPR.
We have been using MAGeCK/MAGeCK-VISPR in many screening projects, including the identification of functional lncRNAs that reaches close to 100% validation rate (Zhu, Li, et al. Nature Biotechnology 2016), resistance mechanism to T cell killing (Pan et al. Science 2018), the function of RNA binding protein in prostate cancer (Fei et al. PNAS 2017), etc.
Many independent studies also used MAGeCK/MAGeCK-VISPR to analyze their CRISPR/Cas9 coding and non-coding screening data, including:
and so on.
Any questions about MAGeCK or MAGeCK-VISPR? Check the FAQ below or join our MAGeCK Google group.
Refer to our Nature Protocols paper for running MAGeCK suites!
This documentation includes the following items:
There are several ways to install MAGeCK.
To install MAGeCK through bioconda channel, first download and install the Python 3 variant of the Miniconda Python distribution. Then, in the command line, type
conda install -c bioconda -c conda-forge mageck
That's it!
An optional step (but recommended) is, you can create an isolated software environment for mageck by executing
conda create -c bioconda -c conda-forge -n mageckenv mageck
in a terminal. The environment can be activated via
source activate mageckenv
To update mageck, run
conda update mageck
from within the environment.
This environment can be deactivated via
source deactivate
You can install MAGeCK-VISPR (which includes MAGeCK) using conda, a commonly used package management softare. The instructions to install MAGeCK-VISPR can be found on MAGeCK-VISPR manual.
You can also run MAGeCK via Docker image which is automatically built upon each commit in our bitbucket source code.
To run through Docker image, install Docker on your own system, and follow the instructions in tutorials on running Docker images.
You can also download the software and install it by yourself. See the detailed instructions below.
The latest version of MAGeCK (0.5.9) can be downloaded here:
Or click the link here (in cases the button points to a wrong file).
For earlier versions (< 0.5.4), the zip file is encrypted, but you can get the password easily by one of the following options:
You need to go to the Terminal to unzip and install the software. See the instructions for Installation below.
MAGeCK can be run on either Mac or Linux system. Since MAGeCK is written in Python and C, Python (version 3) and a C compiler is needed.
Other dependencies include numpy and scipy.
Due to the end of Python 2 life cycle,mageck 0.5.9 or higher versions are not designated to run on Python 2.
Since version 0.5.9.3, MAGeCK updates the visualization module by generating a R markdown file (.Rmd) for count and test subcommand. This allows users to easily create a html-based report webpage using RStudio. No additional dependencies are needed for running MAGeCK. However, to generate the report webpage, a computer with Rstudio and rmarkdown are needed.
To use the --pdf-report option, which is mainly used for visualization before 0.5.9.3, two optional softwares include R and pdflatex. MAGeCK relies on both softwares to generating PDF reports if the --pdf-report option is used. If it is not possible to install them, you can also generate PDF reports by copying some MAGeCK output files to another computers with R and pdflatex are properly installed. See Q and A for more information.
If you use the --pdf-report option, xtable is required, and gplots as well as ggplot2 is optional. Use install.packages("xtable") and install.packages(c("gplots","ggplot2")) in R to install them.
You won't get any error messages if you don't have gplots, but you will get a more beautiful clustering figure in the pdf report of the count command.
You can run MAGeCK without --pdf-report option, and copy some files to another machine with these R packages to generate pdf report. See Q and A for more details.
You can still get some figures generated from MAGeCK, by adding the "--keep-tmp" option to keep intermediate files.
Since version 0.3, MAGeCK uses standard Python installation procedures (distutils) for compiling and installation of the software.
The installation procedure is extremely easy. First, download the source code, unzip it by using the following command (or just double-clicking it), and go into the directory in the command line:
tar xvzf mageck-0.5.4.tar.gz
cd mageck-0.5.4
After that, invoke python setup.py:
python setup.py install
And it is done! If you want MAGeCK to be installed on your own directory , use the following command instead:
python setup.py install --user
This is the easiest way to install mageck. An alternative approach is (you may have one additional step to set up the environment variables; see below)
python setup.py install --prefix=$HOME
where $HOME is the root directory you want to install (usually the user home).
The manual installation is deprecated since version 0.3. Please refer to the installation instructions above.
After downloading the source code, follow the instructions below for manual installation.
In most systems you don't need to set up the environment variables. Just type "mageck" in the command line to see if the mageck program works.
If you get a "command not found" error, that indicates the environment variables are not properly set up. There are several additional steps to finish the installation. First you need to add the path of the mageck program to your PATH variable.
There are several different situations.
Set up the PATH variable by typing:
export PATH=$PATH:$HOME/bin
You first need to determine where MAGeCK is installed. See this Q and A for additional steps to determine the correct bin directory.
If your bin directory is located in /Users/john/.local/bin, then type the following:
export PATH=$PATH:/Users/john/.local/bin
You may also need to add the path of the MAGeCK module to the PYTHONPATH variable. Again, follow the steps above to determine the correct Python installation path (see the Q&A). This variable should be set as, for example,
export PYTHONPATH=/Users/john/.local/lib/python2.7/site-packages:$PYTHONPATH
To save the path configuration (so you don't have to type it every time), place the above command in your ~/.bashrc (for Linux) or ~/.bash_profile (for Mac).
The experimental version of MAGeCK is available at bitbucket. Note that the source codes on BitBucket are experimental and are not fully tested, and it may not be stable or function well. It is strongly recommended to use the MAGeCK software downloaded from sourceforge or from bioconda.
Return to [Home]
Running MAGeCK is extremely easy and convenient. The demo folder contains two mini examples to go through all steps in MAGeCK. Simply execute the sh script in the command line in each example to run the demos. To see how you can enable visualization functions of MAGeCK in both demos, see the visualization manual.
Some advanced tutorial topics can be found in the Advanced Tutorial page.
Also check out the following videos in YouTube to learn how to install and run MAGeCK:
Tutorial 2: Comparison between samples
Check demo/demo1 folder in the source code for the first tutorial.
There is only one command line in the tutorial:
mageck test -k sample.txt -t HL60.final,KBM7.final -c HL60.initial,KBM7.initial -n demo
The parameters are explained as follows.
Parameters | Meaning |
---|---|
mageck | The main portal of the MAGeCK program |
test | A sub-command to ask MAGeCK to perform sgRNA and gene ranking based on provided read count tables |
-k sgrna_count.txt | The provided read count table file. The format of the file is specified here. |
-t HL60.final,KBM7.final | The treatment samples are defined as HL60.final,KBM7.final (or the 2nd and 3rd sample, starting from 0) in sgrna_count.txt. See input files for a detailed explanation. |
-c HL60.initial,KBM7.initial | The control samples are defined as HL60.initial,KBM7.initial (or the 0th and 1st sample, starting from 0) in sgrna_count.txt. See input files for a detailed explanation. |
-n demo | The prefix of the output files is demo, so you will expect the output files are: demo.sgrna_summary.txt, demo.gene_summary.txt, etc. |
An explanation of the output files can be found in the [output] page. For all available parameters, see the [usage] page.
You can also specify the treatment and control samples using sample index. For example,
mageck test -k sgrna_count.txt -t 2,3 -c 0,1 -n demo
Check demo/demo2 folder in the source code for this tutorial
This demo shows an mini example of how to go through the whole pipeline from raw fastq files. In this example, we have fastq files from two conditions, and we would like to compare which gene and sgRNA is significant between conditions. The command line used in the runmageck.sh script is:
mageck count -l library.txt -n demo --sample-label L1,CTRL --fastq test1.fastq test2.fastq
mageck test -k demo.count.txt -t L1 -c CTRL -n demo
The "test" command is the same as the first demo. The parameters of the "count" command are explained as follows.
Parameters | Meaning |
---|---|
mageck | The main portal of the MAGeCK program |
count | A sub-command to ask MAGeCK to generate sgRNA read count table. |
-l library.txt | The provided sgRNA information, including the sgRNA id, the sequence, and the gene it is targeting. See input files for a detailed explanation. |
-n demo | The prefix of the output files. |
--sample-label L1,CTRL | The labels of the two samples are L1 (test1.fastq) and CTRL (test2.fastq). |
--fastq test1.fastq test2.fastq | The provided fastq file, separated by space. (Technical replicates of the same sample can also indicated using comma as a separator; for example, "sample1_replicate1.fastq,sample1_replicate2.fastq") |
After the first two demos, you have a basic sense of how MAGeCK works. In this demo, let us go through a real dataset which is more complicated, and see how to handle some practical problems, like the trimming of the 5' end.
The dataset we use comes from the following paper: Genome-wide recessive genetic screening in mammalian cells with a lentiviral CRISPR-guide RNA library. In this paper, the authors did the CRISPR/Cas9 screening on mouse ESC cells, and identify genes that are essential in mouse ESC cells.
The fastq files of screens are public available on ENA archive. There are different replicates for one condition, but for simplicity, let us only download the following two fastq files and use them to test MAGeCK functions.
Accession | Sample | Download Link |
---|---|---|
ERR376998 | one replicate of plasmid | ERR376998 |
ERR376999 | one replicate of ESC | ERR376999 |
You can download these files, double click to unzip them (or use gunzip in the terminal), and place them into one separate folder:
gunzip ERR376998.fastq.gz
gunzip ERR376999.fastq.gz
The next step is to prepare the library file so MAGeCK will know which sgRNA targets which gene. If you are using one of the standard GeCKO libraries, you can just download the files from MAGeCK sourceforge. For non-standard libraries, you need to prepare the library file according to the library file format.
In this demo, you can generate the library file using Supplementary Data 2 (or Supplementary Table S7) from the paper, or download it directly from our collection of libraries (the file name is "yusa_library.csv.zip). Double click to unzip it (or use "unzip" in the terminal).
**Note: since version 0.5.6, MAGeCK is now able to automatically determine the trimming length and sgRNA length, in most cases. Therefore, you don't need to go to this step unless MAGeCK fails to do so by itself. **
In many cases, your sequencing primer is not exactly in front of the first base of guide RNA. This is indeed the case in this demo, where the the first few bases in the fastq file are identical. Make sure you know exactly how many bases to trim before running MAGeCK. You can talk to experimental people, or get this information by taking a look at the first few lines of the fastq files.
Here are the first few lines of ERR376998.fastq (only sequences are shown):
CTTGTGGAAAGGACGAAACACCGGTGAAGGTGCCGTTGTGTAGTTTTAGA
CTTGTGGAAAGGACGAAACACCGAGCAGCACAACAATATGGGTTTTAGAG
CTTGTGGAAAGGACGAAACACCGCTCTTGGGTTTGGATGTTTGTTTTAGA
CTTGTGGAAAGGACGAAACACCGTTTGGCGAGGGGAGCGCCGGTTTTAGA
......
You can see that the first 23 nucleotides are identical, so in this case you need to tell MAGeCK to trim the first 23 nucleotides to collect read counts (--trim-5 23). If the nucleotide length in front of sgRNA varies between different reads, use cutadapt to remove the adaptor sequences.
The sgRNA length can be determined from the experimental design. It is usually 20 nucleotide, but in this demo, the sgRNA length is 19.
Now we have everything ready to generate count tables from MAGeCK. Place two fastq files and one library file into the same directory, and under that directory, run MAGeCK on terminal:
mageck count -l yusa_library.csv -n escneg --sample-label "plasmid,ESC1" --fastq ERR376998.fastq ERR376999.fastq
This command also tells MAGeCK to assign labels to each library ("plasmid" for ERR376998.fastq, and "ESC1" for ERR376999.fastq), and output the file with prefix "escneg". Note that MAGeCK will automatically determine the length of the sgRNAs from the library, so you don't have to specify it here.
If it is running successfully, you will see one file "escneg.count.txt" collecting all read counts. The top lines are as follows:
sgRNA Gene plasmid ESC1
chr19:5884430-5884453 SLC25A45 13 32
chr11:58831475-58831498 OLFR312 94 108
chr4:49282352-49282375 E130309F12RIK 85 128
If you use the --pdf-report option (see Visualization), it will generate a nice PDF report of the sample statistics of the fastq files. Click Here to see the PDF results.
If you want to manually use the --trim-5 option determined in step 3, the command becomes:
mageck count -l yusa_library.csv -n escneg --sample-label "plasmid,ESC1" --trim-5 23 --fastq ERR376998.fastq ERR376999.fastq
With the read count table, now you can compare ESC1 vs. plasmid condition to see which genes are negatively or positively selected:
mageck test -k escneg.count.txt -t ESC1 -c plasmid -n esccp
This command tells MAGeCK to compare ESC1 with plasmid in the read count table escneg.count.txt, and output results with prefix "esccp".
If successful, you should see a file "esccp.gene_summary.txt". The top lines are as follows:
id num neg|score neg|p-value neg|fdr neg|rank neg|goodsgrna pos|score pos|p-value pos|fdr pos|rank pos|goodsgrna
GTF2B 5 2.0462e-10 2.5851e-07 0.000707 1 5 1.0 1.0 1.0 19150 0
RPS5 5 5.9353e-10 2.5851e-07 0.000707 2 5 1.0 1.0 1.0 19149 0
RPL19 4 2.695e-09 2.5851e-07 0.000707 3 4 1.0 1.0 1.0 19148 0
KIF18B 5 1.0136e-08 2.5851e-07 0.000707 4 5 1.0 1.0 1.0 19146 0
....
You can immediately see two ribosomal genes, RPS5 and RPL19, are on the top of negatively selected genes. If you rank the genes by "rank.pos" (11th column), you will see TRP53 (mouse homolog of TP53) on the top of positively selected genes:
sort -k 11,11n esccp.gene_summary.txt | less
id num neg|score neg|p-value neg|fdr neg|rank neg|goodsgrna pos|score pos|p-value pos|fdr pos|rank pos|goodsgrna
ZFP945 5 1.0 1.0 0.999999 19150 0 9.6166e-07 5.4287e-06 0.05198 1 5
TRP53 5 0.95411 0.95409 0.999999 17901 0 1.0347e-06 5.4287e-06 0.05198 2 4
PDAP1 5 0.85937 0.86223 0.999999 15753 1 7.6412e-06 2.8178e-05 0.174505 3 2
As is in the count command, if you use --pdf-report option, a nice PDF file will be generated. Here is the example of generated PDF file in this demo.
Right now you should be quite familiar with basic functions of MAGeCK. MAGeCK also provides additional functions for you to further explore the data, for example, test the enrichment of pathways, plot the top-ranked genes or genes you are interested in, etc. If you have further questions, feel free to ask in our google group. Enjoy your MAGeCK trip!
Since version 0.5, MAGeCK provides a new subcommand, mle, to calculate gene essentiality from CRISPR screens. Compared with the original algorithm in "test" subcommand, MAGeCK-mle uses a measurement called beta score to call gene essentialities: a positive beta score means a gene is positively selected, and a negative beta score means a gene is negatively selected. It is similar to the term log fold change in differential expression, and compared with the original RRA algorithm, this measurement has the following advantages:
This demo will help you go through all the steps in running the mle module.
**The demo/demo3 folder provides an example for running MAGeCK MLE, plus an optional copy number correction module (see advanced tutorials section). **
For simplicity, let's assume you already know how to generate read count table from fastq files; if not, check the third demo above. We will use the read count table presented in T Wang et al. Science 2014.
Download the read count table here.
The design matrix file indicates which sample is affected by which condition. It is generally a binary matrix indicating which sample (indicated by the first column) is affected by which condition (indicated by the first row). For the meanings of the design matrix, check the input file format page.
To create a design matrix file, copy the following content to a text editing software, and save it as a plain txt file:
Samples baseline HL60 KBM7
HL60.initial 1 0 0
KBM7.initial 1 0 0
HL60.final 1 1 0
KBM7.final 1 0 1
Remember the following rules of a design matrix file:
In the design matrix above, we have four samples, two corresponding to the initial states of two cell lines, and two corresponding to the final states of two cell lines. We design two conditions (HL60 and KBM7) that model the cell type-specific effects.
Now we have the minimum requirements to run the MAGeCK mle module. Assuming you save the design matrix file as "designmat.txt", type the following command to run
mageck mle -k leukemia.new.csv -d designmat.txt -n beta_leukemia
If successful, MAGeCK mle will generate three files, the log file, the gene_summary file (including gene beta scores), and the sgrna_summary file (including sgRNA efficiency probability predictions). Here are a few lines of the gene_summary file:
Gene sgRNA HL60|beta HL60|z HL60|p-value HL60|fdr HL60|wald-p-value HL60|wald-fdr KBM7|beta KBM7|z KBM7|p-value KBM7|fdr KBM7|wald-p-value KBM7|wald-fdr
RNF14 10 0.24927 0.72077 0.36256 0.75648 0.47105 0.9999 0.57276 1.6565 0.06468 0.32386 0.097625
0.73193
RNF10 10 0.10159 0.29373 0.92087 0.98235 0.76896 0.9999 0.11341 0.32794 0.90145 0.97365 0.74296 0.98421
RNF11 10 3.6354 10.513 0.0002811 0.021739 7.5197e-26 1.3376e-22 2.5928 7.4925 0.0014898 0.032024 6.7577e-14 1.33e-11
This file includes the beta scores in two conditions specified in the design matrix (HL60|beta and KBM7|beta), and the associated statistics. For more information, check the output format specification of gene_summary file.
The Advanced tutorial page provides more complicated examples for experienced users.
Return to [Home]
The main portal of MAGeCK is the mageck program, which includes a couple of different subprograms:
There is also another subprogram plot that plots some figures of the genes you are interested in from the test results.
This subcommand tests and ranks sgRNAs and genes based on the read count tables provided.
usage:
usage: mageck test [-h] -k COUNT_TABLE (-t TREATMENT_ID | --day0-label DAY0_LABEL) [-c CONTROL_ID] [--paired] [--norm-method {none,median,total,control}] [--gene-test-fdr-threshold GENE_TEST_FDR_THRESHOLD] [--adjust-method {fdr,holm,pounds}] [--variance-estimation-samples VARIANCE_ESTIMATION_SAMPLES] [--sort-criteria {neg,pos}] [--remove-zero {none,control,treatment,both,any}] [--remove-zero-threshold REMOVE_ZERO_THRESHOLD] [--pdf-report] [--gene-lfc-method {median,alphamedian,mean,alphamean,secondbest}] [-n OUTPUT_PREFIX] [--control-sgrna CONTROL_SGRNA] [--normcounts-to-file] [--skip-gene SKIP_GENE] [--keep-tmp] [--additional-rra-parameters ADDITIONAL_RRA_PARAMETERS] [--cnv-norm CNV_NORM] [--cell-line CELL_LINE]
required arguments:
Parameter | Explanation |
---|---|
-k COUNT_TABLE, --count-table COUNT_TABLE | Provide a tab-separated count table instead of sam files. Each line in the table should include sgRNA name (1st column), targeting gene (2nd column) and read counts in each sample. See input/#sgrna-read-count-file for a detailed description. |
-t TREATMENT_ID, --treatment-id TREATMENT_ID | Sample label or sample index (0 as the first sample) in the count table as treatment experiments, separated by comma (,). If sample label is provided, the labels must match the labels in the first line of the count table; for example, "HL60.final,KBM7.final". For sample index, "0,2" means the 1st and 3rd samples are treatment experiments. See input/#sample-index for a detailed description. |
--day0-label DAY0_LABEL | Specify the label for control sample (usually day 0 or plasmid). For every other sample label, the module will treat it as a treatment condition and compare with control sample. |
optional general arguments:
Parameter | Explanation |
---|---|
-h, --help | show this help message and exit |
-c CONTROL_ID, --control-id CONTROL_ID | Sample label or sample index in the count table as control experiments, separated by comma (,). Default is all the samples not specified in treatment experiments. See input/#sample-index for a detailed description. |
--paired | Paired sample comparisons. In this mode, the number of samples in -t and -c must match and have an exact order in terms of samples. For example, "-t HL60.final,KBM7.final -c HL60.initial,KBM7.initial". |
--norm-method {none,median,total,control} | Method for normalization, default median. If control is specified, the size factor will be estimated using control sgRNAs specified in --control-sgrna option. |
--gene-test-fdr-threshold GENE_TEST_FDR_THRESHOLD | FDR threshold for gene test, default 0.25. |
--adjust-method {fdr,holm,pounds} | Method for sgrna-level p-value adjustment, including false discovery rate (fdr), holm's method (holm), or pounds's method (pounds). |
--variance-estimation-samples VARIANCE_ESTIMATION_SAMPLES | Sample label or sample index for estimating variances, separated by comma (,). See -t/--treatment-id option for specifying samples. |
--sort-criteria {neg,pos} | Sorting criteria, either by negative selection (neg) or positive selection (pos). Default negative selection. |
--remove-zero {none,control,treatment,both} | Whether to remove zero-count sgRNAs in control and/or treatment experiments. Default: none (do not remove those zero-count sgRNAs). |
--pdf-report | Generate pdf report of the analysis. |
--gene-lfc-method {median,alphamedian,mean,alphamean,secondbest} | Method to calculate gene log fold changes (LFC) from sgRNA LFCs. Available methods include the median/mean of all sgRNAs (median/mean), or the median/mean sgRNAs that are ranked in front of the alpha cutoff in RRA (alphamedian/alphamean), or the sgRNA that has the second strongest LFC (secondbest). In the alphamedian/alphamean case, the number of sgRNAs correspond to the "goodsgrna" column in the output, and the gene LFC will be set to 0 if no sgRNA is in front of the alpha cutoff. Default median. (new since v0.5.5) |
Optional arguments for input and output:
Parameter | Explanation |
---|---|
-n OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX | The prefix of the output file(s). Default sample1. |
--control-sgrna CONTROL_SGRNA | A list of control sgRNAs for normalization and for generating the null distribution of RRA. See the format specification. |
--normcounts-to-file | Write normalized read counts to file ({output-prefix}.normalized.txt). |
--keep-tmp | Keep intermediate files. |
--skip-gene SKIP_GENE | Skip genes in the report. By default, "NA" or "na" will be skipped. |
--additional-rra-parameters ADDITIONAL_RRA_PARAMETERS | Additional arguments to run RRA. They will be appended to the command line for calling RRA. |
Optional arguments for CNV correction:
Parameter | Explanation |
---|---|
--cnv-norm CNV_NORM | A matrix of copy number variation data across cell lines to normalize CNV-biased sgRNA scores prior to gene ranking. |
--cell-line CELL_LINE | The name of the cell line to be used for copy number variation normalization. |
This subcommand collects sgRNA read count information from fastq files. The output count tables can be used directly in the test subcommand.
usage:
usage: mageck count [-h] -l LIST_SEQ (--fastq FASTQ [FASTQ ...] | -k COUNT_TABLE) [--norm-method {none,median,total,control}] [--control-sgrna CONTROL_SGRNA] [--sample-label SAMPLE_LABEL] [-n OUTPUT_PREFIX] [--unmapped-to-file] [--keep-tmp] [--test-run] [--trim-5 TRIM_5] [--sgrna-len SGRNA_LEN] [--count-n] [--reverse-complement] [--pdf-report] [--day0-label DAY0_LABEL] [--gmt-file GMT_FILE]
required arguments:
Parameter | Explanation |
---|---|
-l LIST_SEQ, --list-seq LIST_SEQ | A file containing list of sgRNA names, the sequences and target genes, either in .txt or in .csv format. See input/#sgrna-library-file for more details. If this file is not provided, mageck will count all possible sgRNAs in the fastq. |
--fastq FASTQ | Sample fastq/fastq.gz files (or bam files after v0.5.5. See advanced tutorial), separated by space; use comma (,) to indicate technical replicates of the same sample. For example, "--fastq sample1_replicate1.fastq,sample1_replicate2.fastq sample2_replicate1.fastq,sample2_replicate2.fastq" indicates two samples with 2 technical replicates for each sample. |
-k COUNT_TABLE, --count-table COUNT_TABLE | The read count table file. Only 1 file is accepted. |
optional arguments for normalization:
Parameter | Explanation |
---|---|
--norm-method {none,median,total,control} | Method for normalization, including "none" (no normalization), "median" (median normalization, default), "total" (normalization by total read counts), "control" (normalization by control sgRNAs specified by the --control-sgrna option). |
--control-sgrna CONTROL_SGRNA | A list of control sgRNAs for normalization and for generating the null distribution of RRA. See the format specification. |
optional arguments for input and output:
Parameter | Explanation |
---|---|
--sample-label SAMPLE_LABEL | Sample labels, separated by comma (,). Must be equal to the number of samples provided (in --fastq option). Default "sample1,sample2,...". |
-n OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX | The prefix of the output file(s). Default sample1. |
--unmapped-to-file | Save unmapped reads to file. |
--keep-tmp | Keep intermediate files. |
--test-run | Test running. If this option is on, MAGeCK will only process the first 1M records for each file. |
optional arguments for processing fastq files:
Parameter | Explanation |
---|---|
--trim-5 TRIM_5 | Length of trimming the 5' of the reads. Default 0 |
--sgrna-len SGRNA_LEN | Length of the sgRNA. Default 20. ATTENTION: after v 0.5.3, the program will automatically determine the sgRNA length from library file; so only use this if you turn on the --unmapped-to-file option. |
--count-n | Count sgRNAs with Ns. By default, sgRNAs containing Ns will be discarded. |
--reverse-complement | Reverse complement the sequences in library for read mapping. |
Optional arguments for quality controls:
Parameter | Explanation |
---|---|
--pdf-report | Generate pdf report of the fastq files. |
--day0-label DAY0_LABEL | Turn on the negative selection QC and specify the label for control sample (usually day 0 or plasmid). For every other sample label, the negative selection QC will compare it with day0 sample, and estimate the degree of negative selections in essential genes. |
--gmt-file GMT_FILE | The pathway file used for QC, in GMT format. By default it will use the GMT file provided by MAGeCK. |
MAGeCK can also invoke GSEA (default) or RRA to test if a pathway is enriched in one particular gene ranking.
usage:
usage: mageck pathway [-h] --gene-ranking GENE_RANKING --gmt-file GMT_FILE [-n OUTPUT_PREFIX] [--method {gsea,rra}] [--single-ranking] [--sort-criteria {neg,pos}] [--keep-tmp] [--ranking-column RANKING_COLUMN] [--ranking-column-2 RANKING_COLUMN_2] [--pathway-alpha PATHWAY_ALPHA] [--permutation PERMUTATION]
required arguments:
Parameter | Explanation |
---|---|
--gene-ranking GENE_RANKING | The gene ranking file generated by the gene test step. |
--gmt-file GMT_FILE | The pathway file in GMT format. See input/#pathway-file-gmt for more details of the GMT file format. |
optional arguments:
Parameter | Explanation |
---|---|
-h, --help | show this help message and exit |
--single-ranking | The provided file is a (single) gene ranking file, either positive or negative selection. Only one enrichment comparison will be performed. |
-n OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX | The prefix of the output file(s). Default sample1. |
--method {gsea,rra} | Method for testing pathway enrichment, including gsea (Gene Set Enrichment Analysis) or rra. Default gsea. |
--sort-criteria {neg,pos} | Sorting criteria, either by negative selection (neg) or positive selection (pos). Default negative selection. |
--keep-tmp | Keep intermediate files. |
--ranking-column RANKING_COLUMN | Column number or label in gene summary file for gene ranking; can be either an integer of column number, or a string of column label. Default "2" (the 3rd column). |
--ranking-column-2 RANKING_COLUMN_2 | Column number or label in gene summary file for gene ranking; can be either an integer of column number, or a string of column label. This option is used to determine the column for positive selections and is disabled if --single-ranking is specified. Default "8" (the 9th column). |
--pathway-alpha PATHWAY_ALPHA | The default alpha value for RRA pathway enrichment. Default 0.25. |
--permutation PERMUTATION | The perumtation for gsea. Default 1000. |
The mle subcommand performs maximum-likelihood analysis of gene essentialities, instead of the RRA analysis.
usage:
usage: mageck.beta mle [-h] -k COUNT_TABLE (-d DESIGN_MATRIX | --day0-label DAY0_LABEL) [-n OUTPUT_PREFIX] [-i INCLUDE_SAMPLES] [-b BETA_LABELS] [--control-sgrna CONTROL_SGRNA] [--cnv-norm CNV_NORM] [--cnv-est CNV_EST] [--debug] [--debug-gene DEBUG_GENE] [--norm-method {none,median,total,control}] [--genes-varmodeling GENES_VARMODELING] [--permutation-round PERMUTATION_ROUND] [--no-permutation-by-group] [--max-sgrnapergene-permutation MAX_SGRNAPERGENE_PERMUTATION] [--remove-outliers] [--threads THREADS] [--adjust-method {fdr,holm,pounds}] [--sgrna-efficiency SGRNA_EFFICIENCY] [--sgrna-eff-name-column SGRNA_EFF_NAME_COLUMN] [--sgrna-eff-score-column SGRNA_EFF_SCORE_COLUMN] [--update-efficiency] [--bayes] [-p] [-w PPI_WEIGHTING] [-e NEGATIVE_CONTROL]
required arguments:
Parameter | Explanation |
---|---|
-k COUNT_TABLE, --count-table COUNT_TABLE | Provide a tab-separated count table. Each line in the table should include sgRNA name (1st column), target gene (2nd column) and read counts in each sample. See input/#sgrna-read-count-file for a detailed description. |
-d DESIGN_MATRIX, --design-matrix DESIGN_MATRIX | Provide a design matrix, either a file name or a quoted string of the design matrix. For example, "1,1;1,0". The row of the design matrix must match the order of the samples in the count table (if --include-samples is not specified), or the order of the samples by the --include-samples option. |
--day0-label DAY0_LABEL | Specify the label for control sample (usually day 0 or plasmid). For every other sample label, the MLE module will treat it as a single condition and generate an corresponding design matrix. |
optional arguments for input and output:
Parameter | Explanation |
---|---|
-n OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX | The prefix of the output file(s). Default sample1. |
-i INCLUDE_SAMPLES, --include-samples INCLUDE_SAMPLES | Specify the sample labels if the design matrix is not given by file in the --design-matrix option. Sample labels are separated by ",", and must match the labels in the count table. |
-b BETA_LABELS, --beta-labels BETA_LABELS | Specify the labels of the variables (i.e., beta), if the design matrix is not given by file in the --design-matrix option. Should be separated by ",", and the number of labels must equal to (# columns of design matrix), including baseline labels. Default value: "bata_0,beta_1,beta_2,...". |
--control-sgrna CONTROL_SGRNA | A list of control sgRNAs. See the format specification. |
Optional arguments for CNV correction:
Parameter | Explanation |
---|---|
--cnv-norm CNV_NORM | A matrix of copy number variation data across cell lines to normalize CNV-biased sgRNA scores prior to gene ranking. |
optional arguments for MLE module:
Parameter | Explanation |
---|---|
--debug | Debug mode to output detailed information of the running. |
--debug-gene DEBUG_GENE | Debug mode to only run one gene with specified ID. |
--norm-method {none,median,total,control} | Method for normalization, including "none" (no normalization), "median" (median normalization, default), "total" (normalization by total read counts), "control" (normalization by control sgRNAs specified by the --control-sgrna option). |
--genes-varmodeling GENES_VARMODELING | The number of genes for mean-variance modeling. Default 1000. |
--permutation-round PERMUTATION_ROUND | The rounds for permutation (interger). The permutation time is (# genes) * x for x rounds of permutation. Suggested value: 10 (may take longer time). Default 2. |
--no-permutation-by-group | By default, gene permutation is performed separately, by their number of sgRNAs. Turning this option will perform permutation on all genes together. This makes the program faster, but the p value estimation is accurate only if the number of sgRNAs per gene is approximately the same. |
--max-sgrnapergene-permutation MAX_SGRNAPERGENE_PERMUTATION | Only permute genes by group if the number of sgRNAs per gene is smaller than this number. This will save a lot of time if some regions are targeted by a large number of sgRNAs (usually hundreds). Must be an integer. Default 100. |
--remove-outliers | Try to remove outliers. Turning this option on will slow the algorithm. |
--threads THREADS | Using multiple threads to run the algorithm. Default using only 1 thread. |
--adjust-method {fdr,holm,pounds} | Method for sgrna-level p-value adjustment, including false discovery rate (fdr), holm's method (holm), or pounds's method (pounds). |
optional arguments for the EM iteration:
Parameter | Explanation |
---|---|
--sgrna-efficiency SGRNA_EFFICIENCY | An optional file of sgRNA efficiency prediction. The efficiency prediction will be used as an initial guess of the probability an sgRNA is efficient. Must contain at least two columns, one containing sgRNA ID, the other containing sgRNA efficiency prediction. |
--sgrna-eff-name-column SGRNA_EFF_NAME_COLUMN | The sgRNA ID column in sgRNA efficiency prediction file (specified by the --sgrna-efficiency option). Default is 0 (the first column). |
--sgrna-eff-score-column SGRNA_EFF_SCORE_COLUMN | The sgRNA efficiency prediction column in sgRNA efficiency prediction file (specified by the --sgrna-efficiency option). Default is 1 (the second column). |
--update-efficiency | Iteratively update sgRNA efficiency during EM iteration. |
The plot command generating graphics for selected genes. For interactive visualizations, use our new MAGeCK-VISPR algorithm.
usage:
usage: mageck plot [-h] -k COUNT_TABLE -g GENE_SUMMARY [--genes GENES] [-s SAMPLES] [-n OUTPUT_PREFIX] [--norm-method {none,median,total}] [--keep-tmp]
required arguments:
Parameter | Explanation |
---|---|
-k COUNT_TABLE, --count-table COUNT_TABLE | Provide a tab-separated count table. |
-g GENE_SUMMARY, --gene-summary GENE_SUMMARY | The gene summary file generated by the test command. |
optional arguments:
Parameter | Explanation |
---|---|
-h, --help | show this help message and exit |
--genes GENES | A list of genes to be plotted, separated by comma. Default: none. |
-s SAMPLES, --samples SAMPLES | A list of samples to be plotted, separated by comma. Default: using all samples in the count table. |
-n OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX | The prefix of the output file(s). Default sample1. |
--norm-method {none,median,total} | Method for normalization, default median. |
--keep-tmp | Keep intermediate files. |
This subcommand allows you generate comparison results directly from fastq files, with limited parameter settings available. The parameters for the run sub-command are included in test and count sub-command. See both sub-commands for more details. It is strongly suggested that users run the count and test command separately, in order to gain a finer control of the results.
These programs are used by MAGeCK internally, but can also be executed by users for other purposes.
RRA - Robust Rank Aggreation v 0.5.6.
Usage:
Parameter | Explanation |
---|---|
-i input_data file | Input file name. Format: "item id" "group id" "list id" "value" ["probability"] ["chosen"] |
-o output_file | Output file name. Format: "group id" "number of items in the group" "lo-value" "false discovery rate" |
-p maximum_percentile | RRA only consider the items with percentile smaller than this parameter. Default=0.1 |
--control control_sgrna_list | A list of control sgRNA names. |
--permutation permutation_round | The number of rounds of permutation. Increase this value if the number of genes is small. Default 100. |
--no-permutation-by-group | By default, gene permutation is performed separately, by their number of sgRNAs. Turning this option will perform permutation on all genes together. This makes the program faster, but the p value estimation is accurate only if the number of sgRNAs per gene is approximately the same. |
--skip-gene gene_name | Genes to skip from doing permutation. Specify it multiple times if you need to skip more than 1 genes. |
--min-percentage-goodsgrna min_percentage | Filter genes that have too few percentage of 'good sgrnas', or sgrnas that fall below the -p threshold. Must be a number between 0-1. Default 0 (do not filter genes). |
--min-number-goodsgrna min_number | Filter genes that have too few number of 'good sgrnas', or sgrnas that fall below the -p threshold. Must be an integer. Default 0 (do not filter genes). |
mageckGSEA is a fast implementation of Gene Set Enrichment Analysis (GSEA) using C++. It's used by MAGeCK for quality controls and pathway enrichment tests. Compared with the official GSEA program, the main advantage is its easy use and extremely fast running speed.
In the gsea/demo folder, an example is provided to run GSEA. Use the following command to perform GSEA analysis based on the ranked gene list in demo1.txt (provided in the demo folder), tested on pathways defined in kegg.ribosome.gmt (provided in the demo folder). The scores on the 2nd column will be used to rank genes (-c 1), and permute 10000 times to get p value:
mageckGSEA -r demo1.txt -g kegg.ribosome.gmt -c 1 -p 10000
You can either provide genes with their scores, as is in demo1.txt (genes with smaller scores are ranked in the front).
SYNRG 0.715581582 SREK1 0.992306809 SLC25A46 0.057411873 COL4A5 0.36387645 CCDC22 -0.463887932 MVD 0.020897922
mageckGSEA will first rank genes based on the provided scores, as long as you indicate which column to use (-c 1).
Or you can just provide gene rankings, as is in demo2.txt.
C5orf64 TTC17 MRPS27 PIGY GPAA1 KIF4A EPS15
The output is a tab-separated file to report the following statistics of GSEA:
Pathway Size ES p p_permutation FDR Ranking Hits LFC KEGG_RIBOSOME 88 0.3262 0.00240772 0.0043 0.0043 0 32 0
Item | Explanation |
---|---|
Pathway | The name of the pathway |
Size | The size of the pathway, i.e., the number of genes |
ES | Enrichment Score (ES) in GSEA |
p | The p value of ES |
p_permutation | The permutation p value of ES (usually more accurate than p |
FDR | False Discovery Rate of p_permutation |
Ranking | The ranking of this pathway |
Hits | The number of genes that are ranked before ES score. See "Leading Edge" analysis of GSEA |
LFC | Log fold change (not implemented) |
USAGE:
mageckGSEA -r rank_file -g gmt_file [-e] [-s] [-c score_column] [-p perm_time] [-n pathway_name] [-o output_file] [--] [--version] [-h]
Parameter | Explanation |
---|---|
-e, --reverse_value | Reverse the order of the gene. |
-s, --sort_byp | Sort the pathways by p value. |
-c score_column, --score_column score_column | The column for gene scores. If you just want to use the ranking of the gene (located at the 1st column), use 0. Otherwise, specify which column should be used to rank the gene. The column number starts from 0. Default: 0. |
-p perm_time, --perm_time perm_time | Permutations, default 1000. |
-n pathway_name, --pathway_name pathway_name | Name of the pathway to be tested. If not found, will test all pathways. |
-o output_file, --output_file output_file | The name of the output file. Use - to print to standard output. |
-r rank_file, --rank_file rank_file | (required) Rank file. The first column of the rank file must be the gene name. |
-g gmt_file, --gmt_file gmt_file | (required) The pathway annotation in GMT format. |
--version | Displays version information and exits. |
-h, --help | Displays usage information and exits. |
Return to [Home]
We developed a stand-alone visualization tool, VISPR, to visualize CRISPR screening results. See the paper and the VISPR project for more details.
The MAGeCKFlute package provides a convenient approach to visualze MAGeCK and MAGeCK-VISPR results using R programming language.
Visualization functions are also available within the MAGeCK software. Since version 0.5, MAGeCK enables a couple of visualization functions. With these features on, MAGeCK helps users better interpret datasets and results, and generates figures and tables that can be directly used in presentations or papers. The in-house visualization module provides a simple solution for users with limited knowledge in R.
The Visualization function has additional software dependencies, but they are easy to install in many operating systems. See installation for more details.
Since 0.5.9.3, MAGeCK generates an R markdown file (.Rmd) for count and test options. Users can copy this file (along with all other files generated by MAGeCK) to a computer with RStudio installed, and generate a html based report page.
To generate the report page, simply open the corresponding .Rmd file in RStudio, and press the "Run" --> "Run all" button. A html file will be generated correspondingly.
Users can also modify the "Parameters" section in .Rmd file to adjust the parameters used in the report.
An example of the generated html file (from test command) can be downloaded here.
An example of the generated html file (from count command) can be downloaded here.
Users need to install rmarkdown package as the dependency:
install.packages("rmarkdown")
--pdf-report option will be gradually depreciated after version 0.5.9.3, due to its complicated dependencies on pdflatex.
MAGeCK will generate PDF files in both count and test command, by simply adding the --pdf-report option. If successful, a <prefix>.countsummary.pdf (for count command) or <prefix>_summary.pdf (for test command) will be generated. </prefix></prefix>
You can also try it in the two demos provided. In demo1, note that the command used in the run.sh is:
mageck test -k sample.txt -t HL60.final,KBM7.final -c HL60.initial,KBM7.initial -n demo
Use the following command to generate PDF file:
mageck test -k sample.txt -t HL60.final,KBM7.final -c HL60.initial,KBM7.initial -n demo --pdf-report
You can download the sample PDF file from demo1 here.
In demo2, the command used in the runmageck.sh is:
mageck run --fastq test1.fastq test2.fastq -l library.txt -n demo --sample-label L1,CTRL -t L1 -c CTRL
You can split the run command into count and test command, with --pdf-report option enabled. An alternative way is, note that the Rnw and R files (used for PDF file production) exist after successfully running this demo:
demo.count.median_normalized.csv demo.count.txt demo.R library.txt test2.fastq demo_countsummary.R demo.gene_summary.txt demo.sgrna_summary.txt runmageck.sh demo_countsummary.Rnw demo.log demo_summary.Rnw test1.fastq
Simply execute the two .R files and you can get the PDF files as well:
Rscript demo_countsummary.R Rscript demo.R
You can download the count sample PDF file from demo2 here.
After running test, MAGeCK can generate a couple of figures describing the genes you are interested using the plot command. In the demo1 for example, if you are interested in the ACTR8 gene, use the following command to generate the PDF reports describing the sgRNA read count change of ACTR8, and its RRA score relative to all-gene RRA score distribution:
mageck plot -k sample.txt -g demo.gene_summary.txt --genes ACTR8
The PDF file generated using this command is here.
The sgRNA read count file will be used in -k parameter in the test or run sub-command.
The read count file should list the names of the sgRNA, the gene it is targeting, followed by the read counts in each sample. Each item should be separated by the tab ('\t'). A header line is optional. For example in the studies of T. Wang et al. Science 2014, there are 4 CRISPR screening samples, and they are labeled as: HL60.initial, KBM7.initial, HL60.final, KBM7.final. Here are a few lines of the read count file:
sgRNA gene HL60.initial KBM7.initial HL60.final KBM7.final
A1CF_m52595977 A1CF 213 274 883 175
A1CF_m52596017 A1CF 294 412 1554 1891
A1CF_m52596056 A1CF 421 368 566 759
A1CF_m52603842 A1CF 274 243 314 855
A1CF_m52603847 A1CF 0 50 145 266
The count sub-command will output the read count file like this.
In the -t/--treatment-id, -c/--control-id parameters, you can use either sample label or sample index to specify samples. If sample label is used, the labels [must] match the sample labels in the first line of the count table. For example, "HL60.final,KBM7.final".
You can also use sample index to specify samples. The index of the sample is the order it appears in the sgRNA read count file, starting from 0. The index is used in the -t/--treatment-id, -c/--control-id parameters. In the example above, there are four samples, and the index of each sample is as follows:
sample | index |
---|---|
HL60.initial | 0 |
KBM7.initial | 1 |
HL60.final | 2 |
KBM7.final | 3 |
The design matrix is a txt file indicating the effects of different conditions on different samples. In this file, each row is a sample, each column is a condition, and the value is 1 or 0, indicating whether the sample (in the row) is affected by the condition (in the column).
Here is a simple example of the design matrix from the studies in T. Wang et al. Science 2014. The CRISPR screens are done on two cell lines, HL60 and KBM7, and four samples are generated, two corresponding to the initial states of two cell lines, and two corresponding to the final states of two cell lines. If you want to model the effects of two cell lines, you can have the design matrix as follows:
Samples baseline HL60 KBM7
HL60.initial 1 0 0
KBM7.initial 1 0 0
HL60.final 1 1 0
KBM7.final 1 0 1
Here are some important rules of the design matrix:
Note: different orders of the samples in the design matrix may change the results, because there are preprocessing steps to remove outliers. A good practice will be to always place initial samples (like day0 or plasmid) as the first rows in the design matrix.
When starting from fastq files, MAGeCK needs to know the sgRNA sequence and its targeting gene. Such information is provided in the sgRNA library file, and can be specified by the -l/--list-seq option in run or count subcommand.
The sgRNA library file can be provided either in .txt format or in .csv format. There are three columns in the library file: the sgRNA ID, the sequence, and the gene it is targeting. One example of the library file is provided as library.txt in demo2:
s_10007 TGTTCACAGTATAGTTTGCC CCNA1
s_10008 TTCTCCCTAATTGCTTGCTG CCNA1
s_10027 ACATGTTGCTTCCCCTTGCA CCNC
If provided in .csv format, the file will look like:
s_10007,TGTTCACAGTATAGTTTGCC,CCNA1
s_10008,TTCTCCCTAATTGCTTGCTG,CCNA1
s_10027,ACATGTTGCTTCCCCTTGCA,CCNC
When using --control-sgrna option, users need to provide a plain text file just containing negative control sgRNA IDS (one per each line). For example,
NonTargetingControlGuideForHuman_0001
NonTargetingControlGuideForHuman_0002
NonTargetingControlGuideForHuman_0003
NonTargetingControlGuideForHuman_0004
Some systems may read only 1 control sgRNA ID. Please look at this Q&A for solutions.
The GMT file format stores the pathway information and is consistent with the GMT file in Gene Set Enrichment Analysis (GSEA). The details of the GMT format can be found at GSEA website.
You can also download different pathway files directly from GSEA MSigDB database. They can be used directly by MAGeCK.
The sgRNA/gene mapping file will be used in the --gene-test parameter in the test or run sub-command.
This file should list the names of the sgRNAs and their corresponding genes, separated by the tab ('\t'). For example:
A1CF_m52595977 A1CF
A1CF_m52596017 A1CF
A1CF_m52596056 A1CF
A1CF_m52603842 A1CF
A1CF_m52603847 A1CF
A1CF_p52595870 A1CF
A1CF_p52595881 A1CF
A1CF_p52596023 A1CF
Return to [Home]
The output of the MAGeCK consists of the following files:
The following files are the outputs of RRA. They are intermediate files and are deleted after MAGeCK running is complete. To see these files, use the --keep-tmp option in MAGeCK test subcommand.
The following files are the inputs of RRA and will be deleted after MAGeCK is complete.
This file is generated by count command, and summarizes QC measurements of the fastq (or count table) files.
An example is as follows:
File Label Reads Mapped Percentage TotalsgRNAs Zerocounts GiniIndex NegSelQC NegSelQCPval NegSelQCPvalPermutation NegSelQCPvalPermutationFDR NegSelQCGene
S6_R1_001.fastq.gz LNCaP_Day21 15567122 13033442 0.8372 92817 2204 0.1472 0.68965 1.6688e-31 0 0 86
S5_R1_001.fastq.gz LNCaP_Day0 16659017 14497805 0.8703 92817 461 0.0996 0 1 1 1 0.0
The contents of each column are as follows. To help you evaluate the quality of the data, recommended values are shown in bold.
Column | Content |
---|---|
File | The fastq (or the count table) file used. |
Label | The label of that fastq file assigned. |
Reads | Total number reads in the fastq file. (Recommended: 100~300 times the number of sgRNAs) |
Mapped | Total number of reads that can be mapped to library |
Percentage | Mapped percentage, calculated as Mapped/Reads (Recommended: at least 60%) |
TotalsgRNAs | Total number of sgRNAs in the library |
Zerocounts | Total number of missing sgRNAs (sgRNAs that have 0 counts) (Recommended: no more than 1%) |
GiniIndex | The Gini Index of the read count distribution. A smaller value indicates more eveness of the count distribution. (Recommended: around 0.1 for plasmid or initial state samples, and around 0.2-0.3 for negative selection samples ) |
The following column is used to evaluate the degree of negative selection in known essential genes. It is set only if you provide the --day0-label option. MAGeCK will run pathway analysis for each sample, and use several GSEA metrics to evaluate the quality of the samples.
Column | Content |
---|---|
NegSelQC | The Enrichment Score (ES) of GSEA |
NegSelQCPval | The p value of the GSEA analysis (Recommended: smaller than 1e-10) |
NegSelQCPvalPermutation | The permutation p value |
NegSelQCPvalPermutationFDR | The FDR of the permutation p value |
NegSelQCGene | The number of essential genes found in the library that are evaluated for GSEA analysis. |
An example of the sgRNA ranking results is as follows:
sgrna Gene control_count treatment_count control_mean treat_mean LFC control_var adj_var score p.low p.high p.twosided FDR high_in_treatment
INO80B_m74682554 INO80B 0.0/0.0 1220.1598778/1476.14096301 0.810860655738 1348.15042041 10.70 0.0 19.0767988005 308.478081895 1.0 1.11022302463e-16 2.22044604925e-16 1.57651669497e-14 True
NHS_p17705966 NHS 1.62172131148/3.90887850467 2327.09368635/1849.95115143 2.76529990807 2088.52241889 9.54 2.6155440132 68.2450168229 252.480744404 1.0 1.11022302463e-16 2.22044604925e-16 1.57651669497e-14 True
The contents of each column are as follows.
Column | Content |
---|---|
sgrna | sgRNA ID |
Gene | The targeting gene |
control_count | Normalized read counts in control samples |
treatment_count | Normalized read counts in treatment samples |
control_mean | Median read counts in control samples |
treat_mean | Median read counts in treatment samples |
LFC | The log2 fold change of sgRNA |
control_var | The raw variance in control samples |
adj_var | The adjusted variance in control samples |
score | The score of this sgRNA |
p.low | p-value (lower tail) |
p.high | p-value (higher tail) |
p.twosided | p-value (two sided) |
FDR | false discovery rate |
high_in_treatment | Whether the abundance is higher in treatment samples |
Note that this file will have different meaning in mle subcommand: it records the estimated efficiency probability of the guides in the MLE model, after the termination of iteration.
Note that by default, this value is 1 since --sgrna-efficiency is turned off. The values will be between 0-1 if you turn this option on and/or if you explicitly set up the --sgrna-efficiency parameter.
An example of the gene summary file is as follows:
id num neg|score neg|p-value neg|fdr neg|rank neg|goodsgrna neg|lfc pos|score pos|p-value pos|fdr pos|rank pos|goodsgrna pos|lfc
ESPL1 12 6.4327e-10 7.558e-06 7.9e-05 1 -2.35 11 0.99725 0.99981 0.999992 615 0 -0.07
RPL18 12 6.4671e-10 7.558e-06 7.9e-05 2 -2.12 11 0.99799 0.99989 0.999992 620 0 -0.32
CDK1 12 2.6439e-09 7.558e-06 7.9e-05 3 -1.93 12 1.0 0.99999 0.999992 655 0 -0.12
The contents of each column is as follows.
Column | Content |
---|---|
id | Gene ID |
num | The number of targeting sgRNAs for each gene |
neg|score | The RRA lo value of this gene in negative selection |
neg|p-value | The raw p-value (using permutation) of this gene in negative selection |
neg|fdr | The false discovery rate of this gene in negative selection |
neg|rank | The ranking of this gene in negative selection |
neg|goodsgrna | The number of "good" sgRNAs, i.e., sgRNAs whose ranking is below the alpha cutoff (determined by the --gene-test-fdr-threshold option), in negative selection. |
neg|lfc | The log2 fold change of this gene in negative selection. The way to calculate gene lfc is controlled by the --gene-lfc-method option |
pos|score | The RRA lo value of this gene in positive selection |
pos|p-value | The raw p-value (using permutation) of this gene in positive selection |
pos|fdr | The false discovery rate of this gene in positive selection |
pos|rank | The ranking of this gene in positive selection |
pos|goodsgrna | The number of "good" sgRNAs, i.e., sgRNAs whose ranking is below the alpha cutoff (determined by the --gene-test-fdr-threshold option), in positive selection. |
pos|lfc | The log fold change of this gene in positive selection |
Genes are ranked by the p.neg field (by default). If you need a ranking by the p.pos, you can use the --sort-criteria option.
The output of the gene_summary.txt in mle subcommand is pretty similar to the gene_summary.txt format above, except a few new columns. Here is an example of the gene_summary.txt generated from the mle subcommand:
Gene sgRNA HL60|beta HL60|z HL60|p-value HL60|fdr HL60|wald-p-value HL60|wald-fdr KBM7|beta KBM7|z KBM7|p-value KBM7|fdr KBM7|wald-p-value KBM7|wald-fdr
RNF14 10 0.24927 0.72077 0.36256 0.75648 0.47105 0.9999 0.57276 1.6565 0.06468 0.32386 0.097625
0.73193
RNF10 10 0.10159 0.29373 0.92087 0.98235 0.76896 0.9999 0.11341 0.32794 0.90145 0.97365 0.74296 0.98421
RNF11 10 3.6354 10.513 0.0002811 0.021739 7.5197e-26 1.3376e-22 2.5928 7.4925 0.0014898 0.032024 6.7577e-14 1.33e-11
Column | Content |
---|---|
Gene | Gene ID |
sgRNA | The number of targeting sgRNAs for each gene |
HL60|beta;KBM7|beta | The beta scores of this gene in conditions "HL60" and "KBM7", respectively. The conditions are specified in the design matrix as an input of the mle subcommand. |
HL60|p-value | The raw p-value (using permutation) of this gene |
HL60|fdr | The false discovery rate of this gene |
HL60|z | The z-score associated with Wald test |
HL60|wald-p-value | The p value using Wald test |
HL60|wald-fdr | The false discovery rate of the Wald test |
The output of the pathway summary is similar to the gene summary. Here is an example:
id num neg|score neg|p-value neg|fdr neg|rank neg|goodsgrna pos|score pos|p-value pos|fdr pos|rank pos|goodsgrna
KEGG_RIBOSOME 87 8.3272e-23 2.6473e-05 0.001238 1 50 0.051213 0.20927 0.841006 38 4
KEGG_SPLICEOSOME 125 3.7084e-08 2.6473e-05 0.001238 2 41 0.52219 0.80968 0.99902 149 13
KEGG_PROTEASOME 44 1.9586e-06 2.6473e-05 0.001238 3 18 0.52149 0.80905 0.99902 148 5
This table shows a pathway KEGG_RIBOSOME has 87 genes, its RRA lo value 8.3272e-23, permutation p value is 2.6473e-05 (negative selection), FDR 0.001238, its ranking is 1, and there are 50 genes that are below the alpha cutoff. This shows the genes in this pathway (i.e., ribosomal genes) are strongly negatively selected, which is expected in negative selection CRISPR experiments.
This file includes the logging information during the execution. For count command, it will list some basic statistics of the dataset at the end, including the number of reads, the number of reads mapped to the library, the number of zero-count sgRNAs, etc.
If the "--pdf-report" option is on for count or test command, MAGeCK may generate Rnw and R files that are used to create PDF files. MAGeCK calls the Sweave function in R to generate PDF files.
These files will be automatically deleted after the completion of each command. To keep these files, use the "--keep-tmp" option during the execution.
An example of the gene ranking file (.gene.high.txt or .gene.low.txt) is as follows:
group_id #_items_in_group lo_value FDR
RPL3 93 4.9169e-36 0.000080
RPL8 67 1.8232e-24 0.000080
RPS2 61 1.6928e-20 0.000080
RPS18 40 1.0152e-18 0.000080
The contents of each column is as follows.
Column | Content |
---|---|
group_id | Gene ID |
#_items_in_group | The number of targeting sgRNAs for each gene |
lo_value | The raw p-value |
FDR | The false discovery rate |
An example of the sgrna ranking file (.plow.txt or ..phigh.txt) is as follows. These files are the input of RRA.
sgrna symbol pool p.low prob chosen
Drug_0009853 TOP2A list -31.3383375285032 1 1
Drug_0010808 RPS11 list -29.865960506388134 1 1
The contents of each column is as follows.
Column | Content |
---|---|
sgrna | sgRNA ID |
symbol | Gene ID |
pool | Depreciated column. Set all the values in this column as a single value (e.g., "list") |
p.low | The score used to sort sgRNA (increasing order) |
prob | Reserved column. Set to 1 |
chosen | Reserved column. Set to 1 |
Return to [Home]
Download frequently used libraries
For your convenience, we provide a set of library files that are ready to be used in MAGeCK (in the -l/--list-seq option of the count command) in the libraries folder. You can also create your own library files, see sgrna-library-file for more details.
File | Explanation |
---|---|
broadgpp-brunello-library-corrected.txt.zip | Human Brunello genome-wide library developed by Broad Institute |
Human_GeCKOv2_Library_A_3_mageck.csv.zip | Human GeCKO v2 half-library A (can be used in either 1- or 2-plasmid systems) |
Human_GeCKOv2_Library_B_1_mageck.csv.zip | Human GeCKO v2 half-library B |
Human_GeCKOv2_Library_combine.csv.zip | Human GeCKO v2 combined library of A and B |
mouse_geckov2_library_a_2_mageck.csv.zip | Mouse GeCKO v2 half-library A (can be used in either 1- or 2-plasmid systems) |
mouse_geckov2_library_b_1_mageck.csv.zip | Mouse GeCKO v2 half-library B |
mouse_geckov2_library_combine.csv.zip | Mouse GeCKO v2 combined library of A and B |
GeCKOv1.txt.zip | GeCKO v1 library file (from the GeCKO Science paper) |
human_sam_library.csv.zip | Human Synergistic Activation Mediator (SAM) pooled library (CRISPRa library), generated by Feng Zhang laboratory. |
yusa_library.csv.zip | Mouse knockout library generated by Kosuke Yusa laboratory. |
tim_library.txt.zip | Human CRISPR knockout library of 7,000 genes (from T. Wang Science 2014). |
tim_science2015_library.txt.zip | Human CRISPR pooled library of 18,166 genes (from T.Wang Science 2015). |
You can always ask questions on our Google group. Usually your questions are also other's questions, so please help us better improve our algorithm by joining our Google group and asking questions there!
A: Probably you are installing MAGeCK to your own directory, which is not recognized by Python. The solution is to set up the PYTHONPATH environment: see install/#setting-up-the-environment-variables for more details.
A: If you add the "--user" option during installation, mageck executable is usually located on your local directory ($HOME/bin or $HOME/.local/bin). If you don't have this option, mageck is installed in the system bin (/usr/bin or /usr/sbin).
There are two ways you can check the path of MAGeCK. You can either type
which mageck
to determine the path of the mageck executable. Or, at the end of the installation, you will see a few lines of the log like this:
copying build/scripts-2.7/mageck -> /Users/john/.local/bin changing mode of /Users/john/.local/bin/mageck to 755 running install_data copying bin/RRA -> /Users/john/.local/bin
That means your mageck is installed at /Users/john/.local/bin. On the other hand, if you see a message like this:
copying build/scripts-2.7/mageck -> /Users/john/Library/Python/2.7/bin changing mode of /Users/john/Library/Python/2.7/bin/mageck to 755 running install_data copying bin/RRA -> /Users/john/Library/Python/2.7/bin
That means your mageck is installed at /Users/john/Library/Python/2.7/bin.
Depending on your system, the path may look like one of the following:
A: You can use a similar approach to identify MAGeCK python module, but look for pattern like python2.7/site-packages. During installation, if you see a message like this:
copying bin/mageckGSEA -> /home/john/.pyenv/versions/2.7.13/bin running install_egg_info Removing /home/john/.pyenv/versions/2.7.13/lib/python2.7/site-packages/mageck-0.5.6-py2.7.egg-info Writing /home/john/.pyenv/versions/2.7.13/lib/python2.7/site-packages/mageck-0.5.6-py2.7.egg-info
That means your MAGeCK python module is installed in /home/john/.pyenv/versions/2.7.13/lib/python2.7/site-packages.
A: This usually happens when you have both conda version of MAGeCK and your previously installed version of MAGeCK. Even if your "mageck" command comes from conda, the libraries may still come from your previously installed MAGeCK. To solve this problem, you can manually install MAGeCK to the latest version.
A: There are two different solutions to do this.
Solution 1: Uninstall the conda MAGeCK version using the followig command:
conda uninstall mageck
You can always re-install MAGeCK later.
To avoid frequent un-installing and re-installing the software, consider using conda environments. For example, you can install the MAGeCK conda version under some certain environment, and activate it only the environment is activated.
Here is an example. First, create a python 3 environment named "mageckenv":
conda create -n mageckenv anaconda python=3
Then activate the environment using the following command:
source activate mageckenv
Now, install mageck under that environment
conda install -c bioconda mageck
You can use the MAGeCK conda version under the mageckenv environment now. To disable it, simply deactivate the environment:
source deactivate
Solution 2: The conda MAGeCK is run under python 3, while the MAGeCK in sourceforge and bitbucket is run under python 2. So the best way to run the installed version other than conda version is to create an python 2 conda environment and run mageck under that environment.
To create a python 2 envorinment when you have miniconda3 (where MAGeCK-VISPR is hosted), type the following command:
conda create -n py2k anaconda python=2
After that, you can activate the environment by typing
source activate py2k
If you run mageck now, it will invoke the installed version. You can also deactivate your environment by typing:
source deactivate
You may also need to manually edit the PATH variable such that the system will run your local mageck first. To do this, first locate the directory of mageck from your own installation (see the question "where is MAGeCK binary installed?"). If it's in /Users/john/.local/bin, then edit the PATH variable as follows:
::bash
export PATH=/Users/john/.local/bin:$PATH
Then you should be able to run your own installed mageck, not the conda mageck. For more information, go to Setting up the environment variables.
A: Usually you can pool the read counts for technical replicates of the same sample. To do this, use comma (,) to separate the fastq files of the technical replicates from the same sample in the --fastq option. For example, "--fastq sample1_replicate1.fastq,sample1_replicate2.fastq sample2_replicate1.fastq,sample2_replicate2.fastq" indicates two samples with 2 technical replicates for each sample.
For biological replicates, treat them as separate samples and use them together when doing the comparison; so MAGeCK can analyze the variance of these samples. For example in the test command, "-t sample1_bio_replicate1,sample1_bio_replicate2 -c sample2_bio_replicate1,sample2_bio_replicate2" compares 2 samples (with 2 biological replicates in each sample).
A: Since version 0.5.6, MAGeCK enables automatically determining trimming length, even the length may be different within the same fastq files. Alternatively, you can use cutadapt to trim the adaptor sequences of variable length before running MAGeCK.
A: Since version 0.5, MAGeCK produces a "countsummary.txt" file containing all the statistics of the fastq files. If you use "--pdf-report" option, the statistics of fastq files are also in the PDF file from the test.
The statistics can also be found in the log file (for run and count command). Here is an example of the log file generated from count command (the last few lines):
INFO @ Mon, 02 Feb 2015 08:12:15: Summary of file sample1_R1.fastq: INFO @ Mon, 02 Feb 2015 08:12:15: reads 45631055 INFO @ Mon, 02 Feb 2015 08:12:15: mappedreads 34300176 INFO @ Mon, 02 Feb 2015 08:12:15: zerosgrnas 119315 INFO @ Mon, 02 Feb 2015 08:12:15: label sample_1 INFO @ Mon, 02 Feb 2015 08:12:15: Summary of file sample2_R1.fastq: INFO @ Mon, 02 Feb 2015 08:12:15: reads 36344414 INFO @ Mon, 02 Feb 2015 08:12:15: mappedreads 27042629 INFO @ Mon, 02 Feb 2015 08:12:15: zerosgrnas 119002 INFO @ Mon, 02 Feb 2015 08:12:15: label sample_2
It provides the total number of reads, the number of mapped reads, the number of sgRNAs with 0 read counts, and the sample label of the fastq file.
A: We published a paper (MAGeCK-VISPR) to describe some quality control (QC) terms to help you determine the quality of your samples.
For simple QC terms, you can just take a look at the sample statistics. Generally in a good negative selection sample, (1) the mapped reads should be over 60 percent of the total number reads, and (2) the number of zero-count sgRNAs should be few (<5%, and prefered <1%). One exception is in positive selection experiments, where the number of zero-count sgRNAs may be much higher, but the percentage of mapped reads should be reasonably high.
You can also inspect the results by taking a look at the comparison results, see the related question below.
A: One possible reason is: you may save your library file or control sgRNA file to txt or csv format using some Microsoft softwares (like excel). Sometimes the line break representation is different between Windows and Linux/Mac systems, and it creates some problems for the program to read these files.
One solution is to open your txt file using Microsoft excel, copy all the contents (Ctrl+A, Ctrl+C), paste to another plain text editor like Vim (Ctrl+V), and save it to plain txt format.
A: The reason is numpy and scipy use MKL and openBLAS. Both libraries use multipel CPUs to accelerate numeric calculation (e.g., matrix operation). To limit the number of CPU to 1 per thread, set up the OMP_NUM_THREADS environment variable in Linux system. In other words, before running the mageck mle command, type the following command in the terminal:
export OMP_NUM_THREADS=1
This solution comes from the discussion here.
A: Since version 0.5.9, MAGeCK RRA introduces paired comparison between treatments and controls (--paired option). This option allows MAGeCK to make full use of paired samples to boost the statistical power. It is especially useful if the data between two (or more) replicates is poorly correlated, and you want to find top hits that are consistent between paired samples.
Paired samples are usually biological replicates that have treatment and control conditions independently. For example, you have two replicates (r1, r2), and for each replicate you perform screens on treatment and control conditions separately. In the end you have four samples (treatment_r1, treatment_r2, control_r1, control_r2).
You can now run MAGeCK RRA to compare treatment and control conditions, but add an additional --paired parameter to tell MAGeCK that (treatment_r1, control_r1) and (treatment_r2, control_r2) are paired:
mageck test -k count.txt -t treatment_r1,treatment_r2 -c control_r1,control_r2 --paired
In the --paired mode, the number of samples in -t and -c must match and have an exactly the same order in terms of samples.
The way MAGeCK deals with paired samples is to consider sgRNAs in paired samples as independent sgRNAs; therefore, it is equivalent to doubling the number of sgRNAs per gene (if you have two paired samples). The assumption of independence is not always hold, especially if the correlation between replicates is high. If this is the case, it may introduce false positives. Therefore, use the --paired option only if the correlation between paired samples is low, and you want to find consistent signals between paired replicates.
A: First of all, make sure your sample statistics looks good (see the related question in "Counting sgRNAs from fastq files"). Next, take a look at the rankings of some well-known genes. In negative selection experiments, you will expect some ribosomal genes and well-known oncogenes that are on the top; for example, MYC, RAS, etc. In positive selection experiments, TP53 usually has a high ranking.
Besides visually inspecting top-ranked genes, a good validation is to run the pathway command to test on MSigDB KEGG pathways (see MSigDB website). In negative selection experiments (usually on some condition compared with day 0 condition), you will expect to see a set of essential pathways ranking on the top, like ribosome, splicesome, proteasome and cell cycle genes. If you see these pathways coming out, this is a good sign that your experiments are working. The smaller the RRA lo_value and p values they have, the better they are.
A: There are a couple of reasons that the top ranked genes have a high FDR. First, many CRISPR/Cas9 libraries designed few sgRNAs (<7) for each gene. Since some of them may have low cutting efficiency or off-target effect, there may not be enough statistical power to detect essential genes. Second, if there are two many comparisons (or genes), the multiple comparison adjustment may lead to a high FDR estimation. Also, MAGeCK employs a pretty stringent statistical framework to evaluate the statistical significance, its FDR estimation may be conservative.
There are a couple of procedures you can do to increase the sensitivity. First, try to filter out genes that you think are not hits before running MAGeCK; for example, remove genes that have extremely low expression, genes that have very few targeting sgRNAs (<4). Second, If you have a list of negative control genes (genes that you think are not essential, like AAVS1), you can specify the corresponding sgRNA IDs using the --control-sgrna option (see below), thus allowing MAGeCK to have a better estimation of null distribution. Third, if your replicates are paired samples, consider using the --paired option (see here).
A: This option tells MAGeCK to use provided negative control sgRNAs to generate the null distribution when calculating the p values. If this option is not specified, MAGeCK generates the null distribution of RRA scores by assuming all of the genes in the library are non-essential. This approach is sometimes over-conservative, and you can improve this if you know some genes are not essential. By providing the corresponding sgRNA IDs in the --control-sgrna option, MAGeCK will have a better estimation of p values.
In addition, you can use the list of negative control sgRNAs to do the normalization. If --norm-method control option is specified, the median factor used for normalization will be calculated based on control sgRNAs only, rather than all the sgRNAs (by default).
New since 0.5.9.3: We include a new demo (demo5) in the MAGeCK source code to demonstrate the usage of control-sgrnas. Besides, we have an additional --control-gene option to specify the control genes instead of control sgRNAs.
To use this option, you need to prepare a text file specifying the IDs of control sgRNAs, one line for one sgRNA ID. Here is an example of the file:
NonTargetingControlGuideForHuman_0001 NonTargetingControlGuideForHuman_0002 NonTargetingControlGuideForHuman_0003 NonTargetingControlGuideForHuman_0004
There are several issues that you need to keep in mind:
A: MAGeCK will generate .R and .Rnw file even if the "--pdf-report" option is not specified. You can copy these files to a new computer where both R and pdflatex are properly installed, and use the following command to generate PDF files:
Rscript *.R
Note the for count command, the median-normalized read count file (.median_normalized.csv) should also be copied to the same directory. For test command, the gene summary file (.gene_summary.txt) should also be copied to the same directory.
A: You may get some error messages like this:
Error in texi2dvi("recount_countsummary.tex", pdf = TRUE) : Running 'texi2dvi' on 'recount_countsummary.tex' failed.
This may be due to the system compatibility issue of latex. You can still get some figures generated from MAGeCK, by adding the "--keep-tmp" option to keep intermediate files.
For the latest releases and version history, see our bitbucket repo.
2019.07.01 Version 0.5.9
2019.01.04 Version 0.5.8
2018.01.05 Version 0.5.7
2017.05.17 Version 0.5.6
2016.12.02 Version 0.5.5
2016.06.29 Version 0.5.4
2016.01.15 Version 0.5.3
2015.08.09 Version 0.5.2
2015.06.23 Version 0.5.1
2015.04.26 Version 0.5
2015.03.19 Version 0.4.4
2015.02.12 Version 0.4.3
2015.02.04 Version 0.4.2
2014.12.01 Version 0.4.1
2014.11.13 Version 0.4
2014.07.01 Version 0.3
2014.04.17 Version 0.2
2014.04.04 Version 0.1
Wiki: QA
Wiki: advanced_tutorial
Wiki: demo
Wiki: history
Wiki: input
Wiki: install
Wiki: libraries
Wiki: output
Wiki: usage
Wiki: visualization
Last edit: yun 2022-07-11
Hello,
I'm getting the following error message when running mageck test in --paired mode
"An error occurs while trying to compute p values. Quit.."
Up until then, the log file appears normal.
I can't figure out why this error is occurring. I'm trying to run analysis on x3 control and x3 treated samples. This error occurs if I try to run all the samples together (1+2+3), or if I try to run 1+3, leaving out sample 2. However, the program works fine if I run it on each sample individually, or if I run samples 1+2 or 2+3. The program also ran successfully on all samples when not using paired mode, so I'm confident I'm using the program correctly and that my input files are as they should be.
Could you please advise what would cause the program to error at the pvalue stage? Thank you :)