Model-based Analysis of Genome-wide CRISPR-Cas9 Knockout (MAGeCK) is a computational tool to identify important genes from the recent genome-scale CRISPR-Cas9 knockout screens (or GeCKO) technology.
MAGeCK is developed and maintained by Wei Li and Han Xu from Dr. Xiaole Shirley Liu's lab at Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute and Harvard School of Public Health.
MAGeCK is a free, open source software under the BSD license.
This documentation includes the following items:
There are several ways to install MAGeCK.
To install MAGeCK through bioconda channel, first download and install the Python 3 variant of the Miniconda Python distribution. Then, in the command line, type
conda install -c bioconda -c conda-forge mageck
That's it!
An optional step (but recommended) is, you can create an isolated software environment for mageck by executing
conda create -c bioconda -c conda-forge -n mageckenv mageck
in a terminal. The environment can be activated via
source activate mageckenv
To update mageck, run
conda update mageck
from within the environment.
This environment can be deactivated via
source deactivate
You can install MAGeCK-VISPR (which includes MAGeCK) using conda, a commonly used package management softare. The instructions to install MAGeCK-VISPR can be found on MAGeCK-VISPR manual.
You can also run MAGeCK via Docker image which is automatically built upon each commit in our bitbucket source code.
To run through Docker image, install Docker on your own system, and follow the instructions in tutorials on running Docker images.
You can also download the software and install it by yourself. See the detailed instructions below.
The latest version of MAGeCK (0.5.9) can be downloaded here:
Or click the link here (in cases the button points to a wrong file).
For earlier versions (< 0.5.4), the zip file is encrypted, but you can get the password easily by one of the following options:
You need to go to the Terminal to unzip and install the software. See the instructions for Installation below.
MAGeCK can be run on either Mac or Linux system. Since MAGeCK is written in Python and C, Python (version 3) and a C compiler is needed.
Other dependencies include numpy and scipy.
Due to the end of Python 2 life cycle,mageck 0.5.9 or higher versions are not designated to run on Python 2.
Since version 0.5.9.3, MAGeCK updates the visualization module by generating a R markdown file (.Rmd) for count and test subcommand. This allows users to easily create a html-based report webpage using RStudio. No additional dependencies are needed for running MAGeCK. However, to generate the report webpage, a computer with Rstudio and rmarkdown are needed.
To use the --pdf-report option, which is mainly used for visualization before 0.5.9.3, two optional softwares include R and pdflatex. MAGeCK relies on both softwares to generating PDF reports if the --pdf-report option is used. If it is not possible to install them, you can also generate PDF reports by copying some MAGeCK output files to another computers with R and pdflatex are properly installed. See Q and A for more information.
If you use the --pdf-report option, xtable is required, and gplots as well as ggplot2 is optional. Use install.packages("xtable") and install.packages(c("gplots","ggplot2")) in R to install them.
You won't get any error messages if you don't have gplots, but you will get a more beautiful clustering figure in the pdf report of the count command.
You can run MAGeCK without --pdf-report option, and copy some files to another machine with these R packages to generate pdf report. See Q and A for more details.
You can still get some figures generated from MAGeCK, by adding the "--keep-tmp" option to keep intermediate files.
Since version 0.3, MAGeCK uses standard Python installation procedures (distutils) for compiling and installation of the software.
The installation procedure is extremely easy. First, download the source code, unzip it by using the following command (or just double-clicking it), and go into the directory in the command line:
tar xvzf mageck-0.5.4.tar.gz
cd mageck-0.5.4
After that, invoke python setup.py:
python setup.py install
And it is done! If you want MAGeCK to be installed on your own directory , use the following command instead:
python setup.py install --user
This is the easiest way to install mageck. An alternative approach is (you may have one additional step to set up the environment variables; see below)
python setup.py install --prefix=$HOME
where $HOME is the root directory you want to install (usually the user home).
The manual installation is deprecated since version 0.3. Please refer to the installation instructions above.
After downloading the source code, follow the instructions below for manual installation.
In most systems you don't need to set up the environment variables. Just type "mageck" in the command line to see if the mageck program works.
If you get a "command not found" error, that indicates the environment variables are not properly set up. There are several additional steps to finish the installation. First you need to add the path of the mageck program to your PATH variable.
There are several different situations.
Set up the PATH variable by typing:
export PATH=$PATH:$HOME/bin
You first need to determine where MAGeCK is installed. See this Q and A for additional steps to determine the correct bin directory.
If your bin directory is located in /Users/john/.local/bin, then type the following:
export PATH=$PATH:/Users/john/.local/bin
You may also need to add the path of the MAGeCK module to the PYTHONPATH variable. Again, follow the steps above to determine the correct Python installation path (see the Q&A). This variable should be set as, for example,
export PYTHONPATH=/Users/john/.local/lib/python2.7/site-packages:$PYTHONPATH
To save the path configuration (so you don't have to type it every time), place the above command in your ~/.bashrc (for Linux) or ~/.bash_profile (for Mac).
The experimental version of MAGeCK is available at bitbucket. Note that the source codes on BitBucket are experimental and are not fully tested, and it may not be stable or function well. It is strongly recommended to use the MAGeCK software downloaded from sourceforge or from bioconda.
Return to [Home]
Running MAGeCK is extremely easy and convenient. The demo folder contains two mini examples to go through all steps in MAGeCK. Simply execute the sh script in the command line in each example to run the demos. To see how you can enable visualization functions of MAGeCK in both demos, see the visualization manual.
Some advanced tutorial topics can be found in the Advanced Tutorial page.
Also check out the following videos in YouTube to learn how to install and run MAGeCK:
Tutorial 2: Comparison between samples
Check demo/demo1 folder in the source code for the first tutorial.
There is only one command line in the tutorial:
mageck test -k sample.txt -t HL60.final,KBM7.final -c HL60.initial,KBM7.initial -n demo
The parameters are explained as follows.
| Parameters | Meaning |
|---|---|
| mageck | The main portal of the MAGeCK program |
| test | A sub-command to ask MAGeCK to perform sgRNA and gene ranking based on provided read count tables |
| -k sgrna_count.txt | The provided read count table file. The format of the file is specified here. |
| -t HL60.final,KBM7.final | The treatment samples are defined as HL60.final,KBM7.final (or the 2nd and 3rd sample, starting from 0) in sgrna_count.txt. See input files for a detailed explanation. |
| -c HL60.initial,KBM7.initial | The control samples are defined as HL60.initial,KBM7.initial (or the 0th and 1st sample, starting from 0) in sgrna_count.txt. See input files for a detailed explanation. |
| -n demo | The prefix of the output files is demo, so you will expect the output files are: demo.sgrna_summary.txt, demo.gene_summary.txt, etc. |
An explanation of the output files can be found in the [output] page. For all available parameters, see the [usage] page.
You can also specify the treatment and control samples using sample index. For example,
mageck test -k sgrna_count.txt -t 2,3 -c 0,1 -n demo
Check demo/demo2 folder in the source code for this tutorial
This demo shows an mini example of how to go through the whole pipeline from raw fastq files. In this example, we have fastq files from two conditions, and we would like to compare which gene and sgRNA is significant between conditions. The command line used in the runmageck.sh script is:
mageck count -l library.txt -n demo --sample-label L1,CTRL --fastq test1.fastq test2.fastq
mageck test -k demo.count.txt -t L1 -c CTRL -n demo
The "test" command is the same as the first demo. The parameters of the "count" command are explained as follows.
| Parameters | Meaning |
|---|---|
| mageck | The main portal of the MAGeCK program |
| count | A sub-command to ask MAGeCK to generate sgRNA read count table. |
| -l library.txt | The provided sgRNA information, including the sgRNA id, the sequence, and the gene it is targeting. See input files for a detailed explanation. |
| -n demo | The prefix of the output files. |
| --sample-label L1,CTRL | The labels of the two samples are L1 (test1.fastq) and CTRL (test2.fastq). |
| --fastq test1.fastq test2.fastq | The provided fastq file, separated by space. (Technical replicates of the same sample can also indicated using comma as a separator; for example, "sample1_replicate1.fastq,sample1_replicate2.fastq") |
After the first two demos, you have a basic sense of how MAGeCK works. In this demo, let us go through a real dataset which is more complicated, and see how to handle some practical problems, like the trimming of the 5' end.
The dataset we use comes from the following paper: Genome-wide recessive genetic screening in mammalian cells with a lentiviral CRISPR-guide RNA library. In this paper, the authors did the CRISPR/Cas9 screening on mouse ESC cells, and identify genes that are essential in mouse ESC cells.
The fastq files of screens are public available on ENA archive. There are different replicates for one condition, but for simplicity, let us only download the following two fastq files and use them to test MAGeCK functions.
| Accession | Sample | Download Link |
|---|---|---|
| ERR376998 | one replicate of plasmid | ERR376998 |
| ERR376999 | one replicate of ESC | ERR376999 |
You can download these files, double click to unzip them (or use gunzip in the terminal), and place them into one separate folder:
gunzip ERR376998.fastq.gz
gunzip ERR376999.fastq.gz
The next step is to prepare the library file so MAGeCK will know which sgRNA targets which gene. If you are using one of the standard GeCKO libraries, you can just download the files from MAGeCK sourceforge. For non-standard libraries, you need to prepare the library file according to the library file format.
In this demo, you can generate the library file using Supplementary Data 2 (or Supplementary Table S7) from the paper, or download it directly from our collection of libraries (the file name is "yusa_library.csv.zip). Double click to unzip it (or use "unzip" in the terminal).
**Note: since version 0.5.6, MAGeCK is now able to automatically determine the trimming length and sgRNA length, in most cases. Therefore, you don't need to go to this step unless MAGeCK fails to do so by itself. **
In many cases, your sequencing primer is not exactly in front of the first base of guide RNA. This is indeed the case in this demo, where the the first few bases in the fastq file are identical. Make sure you know exactly how many bases to trim before running MAGeCK. You can talk to experimental people, or get this information by taking a look at the first few lines of the fastq files.
Here are the first few lines of ERR376998.fastq (only sequences are shown):
CTTGTGGAAAGGACGAAACACCGGTGAAGGTGCCGTTGTGTAGTTTTAGA
CTTGTGGAAAGGACGAAACACCGAGCAGCACAACAATATGGGTTTTAGAG
CTTGTGGAAAGGACGAAACACCGCTCTTGGGTTTGGATGTTTGTTTTAGA
CTTGTGGAAAGGACGAAACACCGTTTGGCGAGGGGAGCGCCGGTTTTAGA
......
You can see that the first 23 nucleotides are identical, so in this case you need to tell MAGeCK to trim the first 23 nucleotides to collect read counts (--trim-5 23). If the nucleotide length in front of sgRNA varies between different reads, use cutadapt to remove the adaptor sequences.
The sgRNA length can be determined from the experimental design. It is usually 20 nucleotide, but in this demo, the sgRNA length is 19.
Now we have everything ready to generate count tables from MAGeCK. Place two fastq files and one library file into the same directory, and under that directory, run MAGeCK on terminal:
mageck count -l yusa_library.csv -n escneg --sample-label "plasmid,ESC1" --fastq ERR376998.fastq ERR376999.fastq
This command also tells MAGeCK to assign labels to each library ("plasmid" for ERR376998.fastq, and "ESC1" for ERR376999.fastq), and output the file with prefix "escneg". Note that MAGeCK will automatically determine the length of the sgRNAs from the library, so you don't have to specify it here.
If it is running successfully, you will see one file "escneg.count.txt" collecting all read counts. The top lines are as follows:
sgRNA Gene plasmid ESC1
chr19:5884430-5884453 SLC25A45 13 32
chr11:58831475-58831498 OLFR312 94 108
chr4:49282352-49282375 E130309F12RIK 85 128
If you use the --pdf-report option (see Visualization), it will generate a nice PDF report of the sample statistics of the fastq files. Click Here to see the PDF results.
If you want to manually use the --trim-5 option determined in step 3, the command becomes:
mageck count -l yusa_library.csv -n escneg --sample-label "plasmid,ESC1" --trim-5 23 --fastq ERR376998.fastq ERR376999.fastq
With the read count table, now you can compare ESC1 vs. plasmid condition to see which genes are negatively or positively selected:
mageck test -k escneg.count.txt -t ESC1 -c plasmid -n esccp
This command tells MAGeCK to compare ESC1 with plasmid in the read count table escneg.count.txt, and output results with prefix "esccp".
If successful, you should see a file "esccp.gene_summary.txt". The top lines are as follows:
id num neg|score neg|p-value neg|fdr neg|rank neg|goodsgrna pos|score pos|p-value pos|fdr pos|rank pos|goodsgrna
GTF2B 5 2.0462e-10 2.5851e-07 0.000707 1 5 1.0 1.0 1.0 19150 0
RPS5 5 5.9353e-10 2.5851e-07 0.000707 2 5 1.0 1.0 1.0 19149 0
RPL19 4 2.695e-09 2.5851e-07 0.000707 3 4 1.0 1.0 1.0 19148 0
KIF18B 5 1.0136e-08 2.5851e-07 0.000707 4 5 1.0 1.0 1.0 19146 0
....
You can immediately see two ribosomal genes, RPS5 and RPL19, are on the top of negatively selected genes. If you rank the genes by "rank.pos" (11th column), you will see TRP53 (mouse homolog of TP53) on the top of positively selected genes:
sort -k 11,11n esccp.gene_summary.txt | less
id num neg|score neg|p-value neg|fdr neg|rank neg|goodsgrna pos|score pos|p-value pos|fdr pos|rank pos|goodsgrna
ZFP945 5 1.0 1.0 0.999999 19150 0 9.6166e-07 5.4287e-06 0.05198 1 5
TRP53 5 0.95411 0.95409 0.999999 17901 0 1.0347e-06 5.4287e-06 0.05198 2 4
PDAP1 5 0.85937 0.86223 0.999999 15753 1 7.6412e-06 2.8178e-05 0.174505 3 2
As is in the count command, if you use --pdf-report option, a nice PDF file will be generated. Here is the example of generated PDF file in this demo.
Right now you should be quite familiar with basic functions of MAGeCK. MAGeCK also provides additional functions for you to further explore the data, for example, test the enrichment of pathways, plot the top-ranked genes or genes you are interested in, etc. If you have further questions, feel free to ask in our google group. Enjoy your MAGeCK trip!
Since version 0.5, MAGeCK provides a new subcommand, mle, to calculate gene essentiality from CRISPR screens. Compared with the original algorithm in "test" subcommand, MAGeCK-mle uses a measurement called beta score to call gene essentialities: a positive beta score means a gene is positively selected, and a negative beta score means a gene is negatively selected. It is similar to the term log fold change in differential expression, and compared with the original RRA algorithm, this measurement has the following advantages:
This demo will help you go through all the steps in running the mle module.
**The demo/demo3 folder provides an example for running MAGeCK MLE, plus an optional copy number correction module (see advanced tutorials section). **
For simplicity, let's assume you already know how to generate read count table from fastq files; if not, check the third demo above. We will use the read count table presented in T Wang et al. Science 2014.
Download the read count table here.
The design matrix file indicates which sample is affected by which condition. It is generally a binary matrix indicating which sample (indicated by the first column) is affected by which condition (indicated by the first row). For the meanings of the design matrix, check the input file format page.
To create a design matrix file, copy the following content to a text editing software, and save it as a plain txt file:
Samples baseline HL60 KBM7
HL60.initial 1 0 0
KBM7.initial 1 0 0
HL60.final 1 1 0
KBM7.final 1 0 1
Remember the following rules of a design matrix file:
In the design matrix above, we have four samples, two corresponding to the initial states of two cell lines, and two corresponding to the final states of two cell lines. We design two conditions (HL60 and KBM7) that model the cell type-specific effects.
Now we have the minimum requirements to run the MAGeCK mle module. Assuming you save the design matrix file as "designmat.txt", type the following command to run
mageck mle -k leukemia.new.csv -d designmat.txt -n beta_leukemia
If successful, MAGeCK mle will generate three files, the log file, the gene_summary file (including gene beta scores), and the sgrna_summary file (including sgRNA efficiency probability predictions). Here are a few lines of the gene_summary file:
Gene sgRNA HL60|beta HL60|z HL60|p-value HL60|fdr HL60|wald-p-value HL60|wald-fdr KBM7|beta KBM7|z KBM7|p-value KBM7|fdr KBM7|wald-p-value KBM7|wald-fdr
RNF14 10 0.24927 0.72077 0.36256 0.75648 0.47105 0.9999 0.57276 1.6565 0.06468 0.32386 0.097625
0.73193
RNF10 10 0.10159 0.29373 0.92087 0.98235 0.76896 0.9999 0.11341 0.32794 0.90145 0.97365 0.74296 0.98421
RNF11 10 3.6354 10.513 0.0002811 0.021739 7.5197e-26 1.3376e-22 2.5928 7.4925 0.0014898 0.032024 6.7577e-14 1.33e-11
This file includes the beta scores in two conditions specified in the design matrix (HL60|beta and KBM7|beta), and the associated statistics. For more information, check the output format specification of gene_summary file.
The Advanced tutorial page provides more complicated examples for experienced users.
Return to [Home]
The main portal of MAGeCK is the mageck program, which includes a couple of different subprograms:
There is also another subprogram plot that plots some figures of the genes you are interested in from the test results.
This subcommand tests and ranks sgRNAs and genes based on the read count tables provided.
usage:
usage: mageck test [-h] -k COUNT_TABLE
(-t TREATMENT_ID | --day0-label DAY0_LABEL)
[-c CONTROL_ID]
[--paired] [--norm-method {none,median,total,control}]
[--gene-test-fdr-threshold GENE_TEST_FDR_THRESHOLD]
[--adjust-method {fdr,holm,pounds}]
[--variance-estimation-samples VARIANCE_ESTIMATION_SAMPLES]
[--sort-criteria {neg,pos}]
[--remove-zero {none,control,treatment,both,any}]
[--remove-zero-threshold REMOVE_ZERO_THRESHOLD]
[--pdf-report]
[--gene-lfc-method {median,alphamedian,mean,alphamean,secondbest}]
[-n OUTPUT_PREFIX] [--control-sgrna CONTROL_SGRNA]
[--normcounts-to-file] [--skip-gene SKIP_GENE]
[--keep-tmp]
[--additional-rra-parameters ADDITIONAL_RRA_PARAMETERS]
[--cnv-norm CNV_NORM] [--cell-line CELL_LINE]
required arguments:
| Parameter | Explanation |
|---|---|
| -k COUNT_TABLE, --count-table COUNT_TABLE | Provide a tab-separated count table instead of sam files. Each line in the table should include sgRNA name (1st column), targeting gene (2nd column) and read counts in each sample. See input/#sgrna-read-count-file for a detailed description. |
| -t TREATMENT_ID, --treatment-id TREATMENT_ID | Sample label or sample index (0 as the first sample) in the count table as treatment experiments, separated by comma (,). If sample label is provided, the labels must match the labels in the first line of the count table; for example, "HL60.final,KBM7.final". For sample index, "0,2" means the 1st and 3rd samples are treatment experiments. See input/#sample-index for a detailed description. |
| --day0-label DAY0_LABEL | Specify the label for control sample (usually day 0 or plasmid). For every other sample label, the module will treat it as a treatment condition and compare with control sample. |
optional general arguments:
| Parameter | Explanation |
|---|---|
| -h, --help | show this help message and exit |
| -c CONTROL_ID, --control-id CONTROL_ID | Sample label or sample index in the count table as control experiments, separated by comma (,). Default is all the samples not specified in treatment experiments. See input/#sample-index for a detailed description. |
| --paired | Paired sample comparisons. In this mode, the number of samples in -t and -c must match and have an exact order in terms of samples. For example, "-t HL60.final,KBM7.final -c HL60.initial,KBM7.initial". |
| --norm-method {none,median,total,control} | Method for normalization, default median. If control is specified, the size factor will be estimated using control sgRNAs specified in --control-sgrna option. |
| --gene-test-fdr-threshold GENE_TEST_FDR_THRESHOLD | FDR threshold for gene test, default 0.25. |
| --adjust-method {fdr,holm,pounds} | Method for sgrna-level p-value adjustment, including false discovery rate (fdr), holm's method (holm), or pounds's method (pounds). |
| --variance-estimation-samples VARIANCE_ESTIMATION_SAMPLES | Sample label or sample index for estimating variances, separated by comma (,). See -t/--treatment-id option for specifying samples. |
| --sort-criteria {neg,pos} | Sorting criteria, either by negative selection (neg) or positive selection (pos). Default negative selection. |
| --remove-zero {none,control,treatment,both} | Whether to remove zero-count sgRNAs in control and/or treatment experiments. Default: none (do not remove those zero-count sgRNAs). |
| --pdf-report | Generate pdf report of the analysis. |
| --gene-lfc-method {median,alphamedian,mean,alphamean,secondbest} | Method to calculate gene log fold changes (LFC) from sgRNA LFCs. Available methods include the median/mean of all sgRNAs (median/mean), or the median/mean sgRNAs that are ranked in front of the alpha cutoff in RRA (alphamedian/alphamean), or the sgRNA that has the second strongest LFC (secondbest). In the alphamedian/alphamean case, the number of sgRNAs correspond to the "goodsgrna" column in the output, and the gene LFC will be set to 0 if no sgRNA is in front of the alpha cutoff. Default median. (new since v0.5.5) |
Optional arguments for input and output:
| Parameter | Explanation |
|---|---|
| -n OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX | The prefix of the output file(s). Default sample1. |
| --control-sgrna CONTROL_SGRNA | A list of control sgRNAs for normalization and for generating the null distribution of RRA. See the format specification. |
| --normcounts-to-file | Write normalized read counts to file ({output-prefix}.normalized.txt). |
| --keep-tmp | Keep intermediate files. |
| --skip-gene SKIP_GENE | Skip genes in the report. By default, "NA" or "na" will be skipped. |
| --additional-rra-parameters ADDITIONAL_RRA_PARAMETERS | Additional arguments to run RRA. They will be appended to the command line for calling RRA. |
Optional arguments for CNV correction:
| Parameter | Explanation |
|---|---|
| --cnv-norm CNV_NORM | A matrix of copy number variation data across cell lines to normalize CNV-biased sgRNA scores prior to gene ranking. |
| --cell-line CELL_LINE | The name of the cell line to be used for copy number variation normalization. |
This subcommand collects sgRNA read count information from fastq files. The output count tables can be used directly in the test subcommand.
usage:
usage: mageck count [-h] -l LIST_SEQ
(--fastq FASTQ [FASTQ ...] | -k COUNT_TABLE)
[--norm-method {none,median,total,control}]
[--control-sgrna CONTROL_SGRNA]
[--sample-label SAMPLE_LABEL] [-n OUTPUT_PREFIX]
[--unmapped-to-file] [--keep-tmp] [--test-run]
[--trim-5 TRIM_5] [--sgrna-len SGRNA_LEN] [--count-n]
[--reverse-complement] [--pdf-report]
[--day0-label DAY0_LABEL] [--gmt-file GMT_FILE]
required arguments:
| Parameter | Explanation |
|---|---|
| -l LIST_SEQ, --list-seq LIST_SEQ | A file containing list of sgRNA names, the sequences and target genes, either in .txt or in .csv format. See input/#sgrna-library-file for more details. If this file is not provided, mageck will count all possible sgRNAs in the fastq. |
| --fastq FASTQ | Sample fastq/fastq.gz files (or bam files after v0.5.5. See advanced tutorial), separated by space; use comma (,) to indicate technical replicates of the same sample. For example, "--fastq sample1_replicate1.fastq,sample1_replicate2.fastq sample2_replicate1.fastq,sample2_replicate2.fastq" indicates two samples with 2 technical replicates for each sample. |
| -k COUNT_TABLE, --count-table COUNT_TABLE | The read count table file. Only 1 file is accepted. |
optional arguments for normalization:
| Parameter | Explanation |
|---|---|
| --norm-method {none,median,total,control} | Method for normalization, including "none" (no normalization), "median" (median normalization, default), "total" (normalization by total read counts), "control" (normalization by control sgRNAs specified by the --control-sgrna option). |
| --control-sgrna CONTROL_SGRNA | A list of control sgRNAs for normalization and for generating the null distribution of RRA. See the format specification. |
optional arguments for input and output:
| Parameter | Explanation |
|---|---|
| --sample-label SAMPLE_LABEL | Sample labels, separated by comma (,). Must be equal to the number of samples provided (in --fastq option). Default "sample1,sample2,...". |
| -n OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX | The prefix of the output file(s). Default sample1. |
| --unmapped-to-file | Save unmapped reads to file. |
| --keep-tmp | Keep intermediate files. |
| --test-run | Test running. If this option is on, MAGeCK will only process the first 1M records for each file. |
optional arguments for processing fastq files:
| Parameter | Explanation |
|---|---|
| --trim-5 TRIM_5 | Length of trimming the 5' of the reads. Default 0 |
| --sgrna-len SGRNA_LEN | Length of the sgRNA. Default 20. ATTENTION: after v 0.5.3, the program will automatically determine the sgRNA length from library file; so only use this if you turn on the --unmapped-to-file option. |
| --count-n | Count sgRNAs with Ns. By default, sgRNAs containing Ns will be discarded. |
| --reverse-complement | Reverse complement the sequences in library for read mapping. |
Optional arguments for quality controls:
| Parameter | Explanation |
|---|---|
| --pdf-report | Generate pdf report of the fastq files. |
| --day0-label DAY0_LABEL | Turn on the negative selection QC and specify the label for control sample (usually day 0 or plasmid). For every other sample label, the negative selection QC will compare it with day0 sample, and estimate the degree of negative selections in essential genes. |
| --gmt-file GMT_FILE | The pathway file used for QC, in GMT format. By default it will use the GMT file provided by MAGeCK. |
MAGeCK can also invoke GSEA (default) or RRA to test if a pathway is enriched in one particular gene ranking.
usage:
usage: mageck pathway [-h] --gene-ranking GENE_RANKING --gmt-file GMT_FILE
[-n OUTPUT_PREFIX] [--method {gsea,rra}]
[--single-ranking] [--sort-criteria {neg,pos}]
[--keep-tmp] [--ranking-column RANKING_COLUMN]
[--ranking-column-2 RANKING_COLUMN_2]
[--pathway-alpha PATHWAY_ALPHA]
[--permutation PERMUTATION]
required arguments:
| Parameter | Explanation |
|---|---|
| --gene-ranking GENE_RANKING | The gene ranking file generated by the gene test step. |
| --gmt-file GMT_FILE | The pathway file in GMT format. See input/#pathway-file-gmt for more details of the GMT file format. |
optional arguments:
| Parameter | Explanation |
|---|---|
| -h, --help | show this help message and exit |
| --single-ranking | The provided file is a (single) gene ranking file, either positive or negative selection. Only one enrichment comparison will be performed. |
| -n OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX | The prefix of the output file(s). Default sample1. |
| --method {gsea,rra} | Method for testing pathway enrichment, including gsea (Gene Set Enrichment Analysis) or rra. Default gsea. |
| --sort-criteria {neg,pos} | Sorting criteria, either by negative selection (neg) or positive selection (pos). Default negative selection. |
| --keep-tmp | Keep intermediate files. |
| --ranking-column RANKING_COLUMN | Column number or label in gene summary file for gene ranking; can be either an integer of column number, or a string of column label. Default "2" (the 3rd column). |
| --ranking-column-2 RANKING_COLUMN_2 | Column number or label in gene summary file for gene ranking; can be either an integer of column number, or a string of column label. This option is used to determine the column for positive selections and is disabled if --single-ranking is specified. Default "8" (the 9th column). |
| --pathway-alpha PATHWAY_ALPHA | The default alpha value for RRA pathway enrichment. Default 0.25. |
| --permutation PERMUTATION | The perumtation for gsea. Default 1000. |
The mle subcommand performs maximum-likelihood analysis of gene essentialities, instead of the RRA analysis.
usage:
usage: mageck.beta mle [-h] -k COUNT_TABLE
(-d DESIGN_MATRIX | --day0-label DAY0_LABEL)
[-n OUTPUT_PREFIX] [-i INCLUDE_SAMPLES]
[-b BETA_LABELS] [--control-sgrna CONTROL_SGRNA]
[--cnv-norm CNV_NORM] [--cnv-est CNV_EST] [--debug]
[--debug-gene DEBUG_GENE]
[--norm-method {none,median,total,control}]
[--genes-varmodeling GENES_VARMODELING]
[--permutation-round PERMUTATION_ROUND]
[--no-permutation-by-group]
[--max-sgrnapergene-permutation MAX_SGRNAPERGENE_PERMUTATION]
[--remove-outliers] [--threads THREADS]
[--adjust-method {fdr,holm,pounds}]
[--sgrna-efficiency SGRNA_EFFICIENCY]
[--sgrna-eff-name-column SGRNA_EFF_NAME_COLUMN]
[--sgrna-eff-score-column SGRNA_EFF_SCORE_COLUMN]
[--update-efficiency] [--bayes] [-p] [-w PPI_WEIGHTING]
[-e NEGATIVE_CONTROL]
required arguments:
| Parameter | Explanation |
|---|---|
| -k COUNT_TABLE, --count-table COUNT_TABLE | Provide a tab-separated count table. Each line in the table should include sgRNA name (1st column), target gene (2nd column) and read counts in each sample. See input/#sgrna-read-count-file for a detailed description. |
| -d DESIGN_MATRIX, --design-matrix DESIGN_MATRIX | Provide a design matrix, either a file name or a quoted string of the design matrix. For example, "1,1;1,0". The row of the design matrix must match the order of the samples in the count table (if --include-samples is not specified), or the order of the samples by the --include-samples option. |
| --day0-label DAY0_LABEL | Specify the label for control sample (usually day 0 or plasmid). For every other sample label, the MLE module will treat it as a single condition and generate an corresponding design matrix. |
optional arguments for input and output:
| Parameter | Explanation |
|---|---|
| -n OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX | The prefix of the output file(s). Default sample1. |
| -i INCLUDE_SAMPLES, --include-samples INCLUDE_SAMPLES | Specify the sample labels if the design matrix is not given by file in the --design-matrix option. Sample labels are separated by ",", and must match the labels in the count table. |
| -b BETA_LABELS, --beta-labels BETA_LABELS | Specify the labels of the variables (i.e., beta), if the design matrix is not given by file in the --design-matrix option. Should be separated by ",", and the number of labels must equal to (# columns of design matrix), including baseline labels. Default value: "bata_0,beta_1,beta_2,...". |
| --control-sgrna CONTROL_SGRNA | A list of control sgRNAs. See the format specification. |
Optional arguments for CNV correction:
| Parameter | Explanation |
|---|---|
| --cnv-norm CNV_NORM | A matrix of copy number variation data across cell lines to normalize CNV-biased sgRNA scores prior to gene ranking. |
optional arguments for MLE module:
| Parameter | Explanation |
|---|---|
| --debug | Debug mode to output detailed information of the running. |
| --debug-gene DEBUG_GENE | Debug mode to only run one gene with specified ID. |
| --norm-method {none,median,total,control} | Method for normalization, including "none" (no normalization), "median" (median normalization, default), "total" (normalization by total read counts), "control" (normalization by control sgRNAs specified by the --control-sgrna option). |
| --genes-varmodeling GENES_VARMODELING | The number of genes for mean-variance modeling. Default 1000. |
| --permutation-round PERMUTATION_ROUND | The rounds for permutation (interger). The permutation time is (# genes) * x for x rounds of permutation. Suggested value: 10 (may take longer time). Default 2. |
| --no-permutation-by-group | By default, gene permutation is performed separately, by their number of sgRNAs. Turning this option will perform permutation on all genes together. This makes the program faster, but the p value estimation is accurate only if the number of sgRNAs per gene is approximately the same. |
| --max-sgrnapergene-permutation MAX_SGRNAPERGENE_PERMUTATION | Only permute genes by group if the number of sgRNAs per gene is smaller than this number. This will save a lot of time if some regions are targeted by a large number of sgRNAs (usually hundreds). Must be an integer. Default 100. |
| --remove-outliers | Try to remove outliers. Turning this option on will slow the algorithm. |
| --threads THREADS | Using multiple threads to run the algorithm. Default using only 1 thread. |
| --adjust-method {fdr,holm,pounds} | Method for sgrna-level p-value adjustment, including false discovery rate (fdr), holm's method (holm), or pounds's method (pounds). |
optional arguments for the EM iteration:
| Parameter | Explanation |
|---|---|
| --sgrna-efficiency SGRNA_EFFICIENCY | An optional file of sgRNA efficiency prediction. The efficiency prediction will be used as an initial guess of the probability an sgRNA is efficient. Must contain at least two columns, one containing sgRNA ID, the other containing sgRNA efficiency prediction. |
| --sgrna-eff-name-column SGRNA_EFF_NAME_COLUMN | The sgRNA ID column in sgRNA efficiency prediction file (specified by the --sgrna-efficiency option). Default is 0 (the first column). |
| --sgrna-eff-score-column SGRNA_EFF_SCORE_COLUMN | The sgRNA efficiency prediction column in sgRNA efficiency prediction file (specified by the --sgrna-efficiency option). Default is 1 (the second column). |
| --update-efficiency | Iteratively update sgRNA efficiency during EM iteration. |
The plot command generating graphics for selected genes. For interactive visualizations, use our new MAGeCK-VISPR algorithm.
usage:
usage: mageck plot [-h] -k COUNT_TABLE -g GENE_SUMMARY [--genes GENES]
[-s SAMPLES] [-n OUTPUT_PREFIX]
[--norm-method {none,median,total}] [--keep-tmp]
required arguments:
| Parameter | Explanation |
|---|---|
| -k COUNT_TABLE, --count-table COUNT_TABLE | Provide a tab-separated count table. |
| -g GENE_SUMMARY, --gene-summary GENE_SUMMARY | The gene summary file generated by the test command. |
optional arguments:
| Parameter | Explanation |
|---|---|
| -h, --help | show this help message and exit |
| --genes GENES | A list of genes to be plotted, separated by comma. Default: none. |
| -s SAMPLES, --samples SAMPLES | A list of samples to be plotted, separated by comma. Default: using all samples in the count table. |
| -n OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX | The prefix of the output file(s). Default sample1. |
| --norm-method {none,median,total} | Method for normalization, default median. |
| --keep-tmp | Keep intermediate files. |
This subcommand allows you generate comparison results directly from fastq files, with limited parameter settings available. The parameters for the run sub-command are included in test and count sub-command. See both sub-commands for more details. It is strongly suggested that users run the count and test command separately, in order to gain a finer control of the results.
These programs are used by MAGeCK internally, but can also be executed by users for other purposes.
RRA - Robust Rank Aggreation v 0.5.6.
Usage:
| Parameter | Explanation |
|---|---|
| -i input_data file | Input file name. Format: "item id" "group id" "list id" "value" ["probability"] ["chosen"] |
| -o output_file | Output file name. Format: "group id" "number of items in the group" "lo-value" "false discovery rate" |
| -p maximum_percentile | RRA only consider the items with percentile smaller than this parameter. Default=0.1 |
| --control control_sgrna_list | A list of control sgRNA names. |
| --permutation permutation_round | The number of rounds of permutation. Increase this value if the number of genes is small. Default 100. |
| --no-permutation-by-group | By default, gene permutation is performed separately, by their number of sgRNAs. Turning this option will perform permutation on all genes together. This makes the program faster, but the p value estimation is accurate only if the number of sgRNAs per gene is approximately the same. |
| --skip-gene gene_name | Genes to skip from doing permutation. Specify it multiple times if you need to skip more than 1 genes. |
| --min-percentage-goodsgrna min_percentage | Filter genes that have too few percentage of 'good sgrnas', or sgrnas that fall below the -p threshold. Must be a number between 0-1. Default 0 (do not filter genes). |
| --min-number-goodsgrna min_number | Filter genes that have too few number of 'good sgrnas', or sgrnas that fall below the -p threshold. Must be an integer. Default 0 (do not filter genes). |
mageckGSEA is a fast implementation of Gene Set Enrichment Analysis (GSEA) using C++. It's used by MAGeCK for quality controls and pathway enrichment tests. Compared with the official GSEA program, the main advantage is its easy use and extremely fast running speed.
In the gsea/demo folder, an example is provided to run GSEA. Use the following command to perform GSEA analysis based on the ranked gene list in demo1.txt (provided in the demo folder), tested on pathways defined in kegg.ribosome.gmt (provided in the demo folder). The scores on the 2nd column will be used to rank genes (-c 1), and permute 10000 times to get p value:
mageckGSEA -r demo1.txt -g kegg.ribosome.gmt -c 1 -p 10000
You can either provide genes with their scores, as is in demo1.txt (genes with smaller scores are ranked in the front).
SYNRG 0.715581582
SREK1 0.992306809
SLC25A46 0.057411873
COL4A5 0.36387645
CCDC22 -0.463887932
MVD 0.020897922
mageckGSEA will first rank genes based on the provided scores, as long as you indicate which column to use (-c 1).
Or you can just provide gene rankings, as is in demo2.txt.
C5orf64
TTC17
MRPS27
PIGY
GPAA1
KIF4A
EPS15
The output is a tab-separated file to report the following statistics of GSEA:
Pathway Size ES p p_permutation FDR Ranking Hits LFC
KEGG_RIBOSOME 88 0.3262 0.00240772 0.0043 0.0043 0 32 0
| Item | Explanation |
|---|---|
| Pathway | The name of the pathway |
| Size | The size of the pathway, i.e., the number of genes |
| ES | Enrichment Score (ES) in GSEA |
| p | The p value of ES |
| p_permutation | The permutation p value of ES (usually more accurate than p |
| FDR | False Discovery Rate of p_permutation |
| Ranking | The ranking of this pathway |
| Hits | The number of genes that are ranked before ES score. See "Leading Edge" analysis of GSEA |
| LFC | Log fold change (not implemented) |
USAGE:
mageckGSEA -r rank_file -g gmt_file
[-e] [-s] [-c score_column]
[-p perm_time] [-n pathway_name]
[-o output_file] [--] [--version] [-h]
| Parameter | Explanation |
|---|---|
| -e, --reverse_value | Reverse the order of the gene. |
| -s, --sort_byp | Sort the pathways by p value. |
| -c score_column, --score_column score_column | The column for gene scores. If you just want to use the ranking of the gene (located at the 1st column), use 0. Otherwise, specify which column should be used to rank the gene. The column number starts from 0. Default: 0. |
| -p perm_time, --perm_time perm_time | Permutations, default 1000. |
| -n pathway_name, --pathway_name pathway_name | Name of the pathway to be tested. If not found, will test all pathways. |
| -o output_file, --output_file output_file | The name of the output file. Use - to print to standard output. |
| -r rank_file, --rank_file rank_file | (required) Rank file. The first column of the rank file must be the gene name. |
| -g gmt_file, --gmt_file gmt_file | (required) The pathway annotation in GMT format. |
| --version | Displays version information and exits. |
| -h, --help | Displays usage information and exits. |
Return to [Home]
The sgRNA read count file will be used in -k parameter in the test or run sub-command.
The read count file should list the names of the sgRNA, the gene it is targeting, followed by the read counts in each sample. Each item should be separated by the tab ('\t'). A header line is optional. For example in the studies of T. Wang et al. Science 2014, there are 4 CRISPR screening samples, and they are labeled as: HL60.initial, KBM7.initial, HL60.final, KBM7.final. Here are a few lines of the read count file:
sgRNA gene HL60.initial KBM7.initial HL60.final KBM7.final
A1CF_m52595977 A1CF 213 274 883 175
A1CF_m52596017 A1CF 294 412 1554 1891
A1CF_m52596056 A1CF 421 368 566 759
A1CF_m52603842 A1CF 274 243 314 855
A1CF_m52603847 A1CF 0 50 145 266
The count sub-command will output the read count file like this.
In the -t/--treatment-id, -c/--control-id parameters, you can use either sample label or sample index to specify samples. If sample label is used, the labels [must] match the sample labels in the first line of the count table. For example, "HL60.final,KBM7.final".
You can also use sample index to specify samples. The index of the sample is the order it appears in the sgRNA read count file, starting from 0. The index is used in the -t/--treatment-id, -c/--control-id parameters. In the example above, there are four samples, and the index of each sample is as follows:
| sample | index |
|---|---|
| HL60.initial | 0 |
| KBM7.initial | 1 |
| HL60.final | 2 |
| KBM7.final | 3 |
The design matrix is a txt file indicating the effects of different conditions on different samples. In this file, each row is a sample, each column is a condition, and the value is 1 or 0, indicating whether the sample (in the row) is affected by the condition (in the column).
Here is a simple example of the design matrix from the studies in T. Wang et al. Science 2014. The CRISPR screens are done on two cell lines, HL60 and KBM7, and four samples are generated, two corresponding to the initial states of two cell lines, and two corresponding to the final states of two cell lines. If you want to model the effects of two cell lines, you can have the design matrix as follows:
Samples baseline HL60 KBM7
HL60.initial 1 0 0
KBM7.initial 1 0 0
HL60.final 1 1 0
KBM7.final 1 0 1
Here are some important rules of the design matrix:
Note: different orders of the samples in the design matrix may change the results, because there are preprocessing steps to remove outliers. A good practice will be to always place initial samples (like day0 or plasmid) as the first rows in the design matrix.
When starting from fastq files, MAGeCK needs to know the sgRNA sequence and its targeting gene. Such information is provided in the sgRNA library file, and can be specified by the -l/--list-seq option in run or count subcommand.
The sgRNA library file can be provided either in .txt format or in .csv format. There are three columns in the library file: the sgRNA ID, the sequence, and the gene it is targeting. One example of the library file is provided as library.txt in demo2:
s_10007 TGTTCACAGTATAGTTTGCC CCNA1
s_10008 TTCTCCCTAATTGCTTGCTG CCNA1
s_10027 ACATGTTGCTTCCCCTTGCA CCNC
If provided in .csv format, the file will look like:
s_10007,TGTTCACAGTATAGTTTGCC,CCNA1
s_10008,TTCTCCCTAATTGCTTGCTG,CCNA1
s_10027,ACATGTTGCTTCCCCTTGCA,CCNC
When using --control-sgrna option, users need to provide a plain text file just containing negative control sgRNA IDS (one per each line). For example,
NonTargetingControlGuideForHuman_0001
NonTargetingControlGuideForHuman_0002
NonTargetingControlGuideForHuman_0003
NonTargetingControlGuideForHuman_0004
Some systems may read only 1 control sgRNA ID. Please look at this Q&A for solutions.
The GMT file format stores the pathway information and is consistent with the GMT file in Gene Set Enrichment Analysis (GSEA). The details of the GMT format can be found at GSEA website.
You can also download different pathway files directly from GSEA MSigDB database. They can be used directly by MAGeCK.
The sgRNA/gene mapping file will be used in the --gene-test parameter in the test or run sub-command.
This file should list the names of the sgRNAs and their corresponding genes, separated by the tab ('\t'). For example:
A1CF_m52595977 A1CF
A1CF_m52596017 A1CF
A1CF_m52596056 A1CF
A1CF_m52603842 A1CF
A1CF_m52603847 A1CF
A1CF_p52595870 A1CF
A1CF_p52595881 A1CF
A1CF_p52596023 A1CF
Return to [Home]
The output of the MAGeCK consists of the following files:
The following files are the outputs of RRA. They are intermediate files and are deleted after MAGeCK running is complete. To see these files, use the --keep-tmp option in MAGeCK test subcommand.
The following files are the inputs of RRA and will be deleted after MAGeCK is complete.
This file is generated by count command, and summarizes QC measurements of the fastq (or count table) files.
An example is as follows:
File Label Reads Mapped Percentage TotalsgRNAs Zerocounts GiniIndex NegSelQC NegSelQCPval NegSelQCPvalPermutation NegSelQCPvalPermutationFDR NegSelQCGene
S6_R1_001.fastq.gz LNCaP_Day21 15567122 13033442 0.8372 92817 2204 0.1472 0.68965 1.6688e-31 0 0 86
S5_R1_001.fastq.gz LNCaP_Day0 16659017 14497805 0.8703 92817 461 0.0996 0 1 1 1 0.0
The contents of each column are as follows. To help you evaluate the quality of the data, recommended values are shown in bold.
| Column | Content |
|---|---|
| File | The fastq (or the count table) file used. |
| Label | The label of that fastq file assigned. |
| Reads | Total number reads in the fastq file. (Recommended: 100~300 times the number of sgRNAs) |
| Mapped | Total number of reads that can be mapped to library |
| Percentage | Mapped percentage, calculated as Mapped/Reads (Recommended: at least 60%) |
| TotalsgRNAs | Total number of sgRNAs in the library |
| Zerocounts | Total number of missing sgRNAs (sgRNAs that have 0 counts) (Recommended: no more than 1%) |
| GiniIndex | The Gini Index of the read count distribution. A smaller value indicates more eveness of the count distribution. (Recommended: around 0.1 for plasmid or initial state samples, and around 0.2-0.3 for negative selection samples ) |
The following column is used to evaluate the degree of negative selection in known essential genes. It is set only if you provide the --day0-label option. MAGeCK will run pathway analysis for each sample, and use several GSEA metrics to evaluate the quality of the samples.
| Column | Content |
|---|---|
| NegSelQC | The Enrichment Score (ES) of GSEA |
| NegSelQCPval | The p value of the GSEA analysis (Recommended: smaller than 1e-10) |
| NegSelQCPvalPermutation | The permutation p value |
| NegSelQCPvalPermutationFDR | The FDR of the permutation p value |
| NegSelQCGene | The number of essential genes found in the library that are evaluated for GSEA analysis. |
An example of the sgRNA ranking results is as follows:
sgrna Gene control_count treatment_count control_mean treat_mean LFC control_var adj_var score p.low p.high p.twosided FDR high_in_treatment
INO80B_m74682554 INO80B 0.0/0.0 1220.1598778/1476.14096301 0.810860655738 1348.15042041 10.70 0.0 19.0767988005 308.478081895 1.0 1.11022302463e-16 2.22044604925e-16 1.57651669497e-14 True
NHS_p17705966 NHS 1.62172131148/3.90887850467 2327.09368635/1849.95115143 2.76529990807 2088.52241889 9.54 2.6155440132 68.2450168229 252.480744404 1.0 1.11022302463e-16 2.22044604925e-16 1.57651669497e-14 True
The contents of each column are as follows.
| Column | Content |
|---|---|
| sgrna | sgRNA ID |
| Gene | The targeting gene |
| control_count | Normalized read counts in control samples |
| treatment_count | Normalized read counts in treatment samples |
| control_mean | Median read counts in control samples |
| treat_mean | Median read counts in treatment samples |
| LFC | The log2 fold change of sgRNA |
| control_var | The raw variance in control samples |
| adj_var | The adjusted variance in control samples |
| score | The score of this sgRNA |
| p.low | p-value (lower tail) |
| p.high | p-value (higher tail) |
| p.twosided | p-value (two sided) |
| FDR | false discovery rate |
| high_in_treatment | Whether the abundance is higher in treatment samples |
Note that this file will have different meaning in mle subcommand: it records the estimated efficiency probability of the guides in the MLE model, after the termination of iteration.
Note that by default, this value is 1 since --sgrna-efficiency is turned off. The values will be between 0-1 if you turn this option on and/or if you explicitly set up the --sgrna-efficiency parameter.
An example of the gene summary file is as follows:
id num neg|score neg|p-value neg|fdr neg|rank neg|goodsgrna neg|lfc pos|score pos|p-value pos|fdr pos|rank pos|goodsgrna pos|lfc
ESPL1 12 6.4327e-10 7.558e-06 7.9e-05 1 -2.35 11 0.99725 0.99981 0.999992 615 0 -0.07
RPL18 12 6.4671e-10 7.558e-06 7.9e-05 2 -2.12 11 0.99799 0.99989 0.999992 620 0 -0.32
CDK1 12 2.6439e-09 7.558e-06 7.9e-05 3 -1.93 12 1.0 0.99999 0.999992 655 0 -0.12
The contents of each column is as follows.
| Column | Content |
|---|---|
| id | Gene ID |
| num | The number of targeting sgRNAs for each gene |
| neg|score | The RRA lo value of this gene in negative selection |
| neg|p-value | The raw p-value (using permutation) of this gene in negative selection |
| neg|fdr | The false discovery rate of this gene in negative selection |
| neg|rank | The ranking of this gene in negative selection |
| neg|goodsgrna | The number of "good" sgRNAs, i.e., sgRNAs whose ranking is below the alpha cutoff (determined by the --gene-test-fdr-threshold option), in negative selection. |
| neg|lfc | The log2 fold change of this gene in negative selection. The way to calculate gene lfc is controlled by the --gene-lfc-method option |
| pos|score | The RRA lo value of this gene in positive selection |
| pos|p-value | The raw p-value (using permutation) of this gene in positive selection |
| pos|fdr | The false discovery rate of this gene in positive selection |
| pos|rank | The ranking of this gene in positive selection |
| pos|goodsgrna | The number of "good" sgRNAs, i.e., sgRNAs whose ranking is below the alpha cutoff (determined by the --gene-test-fdr-threshold option), in positive selection. |
| pos|lfc | The log fold change of this gene in positive selection |
Genes are ranked by the p.neg field (by default). If you need a ranking by the p.pos, you can use the --sort-criteria option.
The output of the gene_summary.txt in mle subcommand is pretty similar to the gene_summary.txt format above, except a few new columns. Here is an example of the gene_summary.txt generated from the mle subcommand:
Gene sgRNA HL60|beta HL60|z HL60|p-value HL60|fdr HL60|wald-p-value HL60|wald-fdr KBM7|beta KBM7|z KBM7|p-value KBM7|fdr KBM7|wald-p-value KBM7|wald-fdr
RNF14 10 0.24927 0.72077 0.36256 0.75648 0.47105 0.9999 0.57276 1.6565 0.06468 0.32386 0.097625
0.73193
RNF10 10 0.10159 0.29373 0.92087 0.98235 0.76896 0.9999 0.11341 0.32794 0.90145 0.97365 0.74296 0.98421
RNF11 10 3.6354 10.513 0.0002811 0.021739 7.5197e-26 1.3376e-22 2.5928 7.4925 0.0014898 0.032024 6.7577e-14 1.33e-11
| Column | Content |
|---|---|
| Gene | Gene ID |
| sgRNA | The number of targeting sgRNAs for each gene |
| HL60|beta;KBM7|beta | The beta scores of this gene in conditions "HL60" and "KBM7", respectively. The conditions are specified in the design matrix as an input of the mle subcommand. |
| HL60|p-value | The raw p-value (using permutation) of this gene |
| HL60|fdr | The false discovery rate of this gene |
| HL60|z | The z-score associated with Wald test |
| HL60|wald-p-value | The p value using Wald test |
| HL60|wald-fdr | The false discovery rate of the Wald test |
The output of the pathway summary is similar to the gene summary. Here is an example:
id num neg|score neg|p-value neg|fdr neg|rank neg|goodsgrna pos|score pos|p-value pos|fdr pos|rank pos|goodsgrna
KEGG_RIBOSOME 87 8.3272e-23 2.6473e-05 0.001238 1 50 0.051213 0.20927 0.841006 38 4
KEGG_SPLICEOSOME 125 3.7084e-08 2.6473e-05 0.001238 2 41 0.52219 0.80968 0.99902 149 13
KEGG_PROTEASOME 44 1.9586e-06 2.6473e-05 0.001238 3 18 0.52149 0.80905 0.99902 148 5
This table shows a pathway KEGG_RIBOSOME has 87 genes, its RRA lo value 8.3272e-23, permutation p value is 2.6473e-05 (negative selection), FDR 0.001238, its ranking is 1, and there are 50 genes that are below the alpha cutoff. This shows the genes in this pathway (i.e., ribosomal genes) are strongly negatively selected, which is expected in negative selection CRISPR experiments.
This file includes the logging information during the execution. For count command, it will list some basic statistics of the dataset at the end, including the number of reads, the number of reads mapped to the library, the number of zero-count sgRNAs, etc.
If the "--pdf-report" option is on for count or test command, MAGeCK may generate Rnw and R files that are used to create PDF files. MAGeCK calls the Sweave function in R to generate PDF files.
These files will be automatically deleted after the completion of each command. To keep these files, use the "--keep-tmp" option during the execution.
An example of the gene ranking file (.gene.high.txt or .gene.low.txt) is as follows:
group_id #_items_in_group lo_value FDR
RPL3 93 4.9169e-36 0.000080
RPL8 67 1.8232e-24 0.000080
RPS2 61 1.6928e-20 0.000080
RPS18 40 1.0152e-18 0.000080
The contents of each column is as follows.
| Column | Content |
|---|---|
| group_id | Gene ID |
| #_items_in_group | The number of targeting sgRNAs for each gene |
| lo_value | The raw p-value |
| FDR | The false discovery rate |
An example of the sgrna ranking file (.plow.txt or ..phigh.txt) is as follows. These files are the input of RRA.
sgrna symbol pool p.low prob chosen
Drug_0009853 TOP2A list -31.3383375285032 1 1
Drug_0010808 RPS11 list -29.865960506388134 1 1
The contents of each column is as follows.
| Column | Content |
|---|---|
| sgrna | sgRNA ID |
| symbol | Gene ID |
| pool | Depreciated column. Set all the values in this column as a single value (e.g., "list") |
| p.low | The score used to sort sgRNA (increasing order) |
| prob | Reserved column. Set to 1 |
| chosen | Reserved column. Set to 1 |
Return to [Home]
Download frequently used libraries
For your convenience, we provide a set of library files that are ready to be used in MAGeCK (in the -l/--list-seq option of the count command) in the libraries folder. You can also create your own library files, see sgrna-library-file for more details.
| File | Explanation |
|---|---|
| broadgpp-brunello-library-corrected.txt.zip | Human Brunello genome-wide library developed by Broad Institute |
| Human_GeCKOv2_Library_A_3_mageck.csv.zip | Human GeCKO v2 half-library A (can be used in either 1- or 2-plasmid systems) |
| Human_GeCKOv2_Library_B_1_mageck.csv.zip | Human GeCKO v2 half-library B |
| Human_GeCKOv2_Library_combine.csv.zip | Human GeCKO v2 combined library of A and B |
| mouse_geckov2_library_a_2_mageck.csv.zip | Mouse GeCKO v2 half-library A (can be used in either 1- or 2-plasmid systems) |
| mouse_geckov2_library_b_1_mageck.csv.zip | Mouse GeCKO v2 half-library B |
| mouse_geckov2_library_combine.csv.zip | Mouse GeCKO v2 combined library of A and B |
| GeCKOv1.txt.zip | GeCKO v1 library file (from the GeCKO Science paper) |
| human_sam_library.csv.zip | Human Synergistic Activation Mediator (SAM) pooled library (CRISPRa library), generated by Feng Zhang laboratory. |
| yusa_library.csv.zip | Mouse knockout library generated by Kosuke Yusa laboratory. |
| tim_library.txt.zip | Human CRISPR knockout library of 7,000 genes (from T. Wang Science 2014). |
| tim_science2015_library.txt.zip | Human CRISPR pooled library of 18,166 genes (from T.Wang Science 2015). |
For the latest releases and version history, see our bitbucket repo.
2019.07.01 Version 0.5.9
2019.01.04 Version 0.5.8
2018.01.05 Version 0.5.7
2017.05.17 Version 0.5.6
2016.12.02 Version 0.5.5
2016.06.29 Version 0.5.4
2016.01.15 Version 0.5.3
2015.08.09 Version 0.5.2
2015.06.23 Version 0.5.1
2015.04.26 Version 0.5
2015.03.19 Version 0.4.4
2015.02.12 Version 0.4.3
2015.02.04 Version 0.4.2
2014.12.01 Version 0.4.1
2014.11.13 Version 0.4
2014.07.01 Version 0.3
2014.04.17 Version 0.2
2014.04.04 Version 0.1