MAGeCK Wiki

Model-based Analysis of Genome-wide CRISPR-Cas9 Knockout

Brought to you by: davidliwei

demo

Tutorial

Tutorial
Advanced tutorial

Running MAGeCK is extremely easy and convenient. The demo folder contains two mini examples to go through all steps in MAGeCK. Simply execute the sh script in the command line in each example to run the demos. To see how you can enable visualization functions of MAGeCK in both demos, see the visualization manual.

Some advanced tutorial topics can be found in the Advanced Tutorial page.

Also check out the following videos in YouTube to learn how to install and run MAGeCK:

Tutorial 1: Installation

Tutorial 2: Comparison between samples

The first tutorial: starting from read count tables

Check demo/demo1 folder in the source code for the first tutorial.

There is only one command line in the tutorial:

mageck test -k sample.txt -t HL60.final,KBM7.final -c HL60.initial,KBM7.initial  -n demo

The parameters are explained as follows.

Parameters	Meaning
mageck	The main portal of the MAGeCK program
test	A sub-command to ask MAGeCK to perform sgRNA and gene ranking based on provided read count tables
-k sgrna_count.txt	The provided read count table file. The format of the file is specified here.
-t HL60.final,KBM7.final	The treatment samples are defined as HL60.final,KBM7.final (or the 2nd and 3rd sample, starting from 0) in sgrna_count.txt. See input files for a detailed explanation.
-c HL60.initial,KBM7.initial	The control samples are defined as HL60.initial,KBM7.initial (or the 0th and 1st sample, starting from 0) in sgrna_count.txt. See input files for a detailed explanation.
-n demo	The prefix of the output files is demo, so you will expect the output files are: demo.sgrna_summary.txt, demo.gene_summary.txt, etc.

An explanation of the output files can be found in the [output] page. For all available parameters, see the [usage] page.

You can also specify the treatment and control samples using sample index. For example,

mageck test -k sgrna_count.txt -t 2,3 -c 0,1 -n demo

The second tutorial: starting from raw fastq files

Check demo/demo2 folder in the source code for this tutorial

This demo shows an mini example of how to go through the whole pipeline from raw fastq files. In this example, we have fastq files from two conditions, and we would like to compare which gene and sgRNA is significant between conditions. The command line used in the runmageck.sh script is:

mageck count -l library.txt -n demo --sample-label L1,CTRL  --fastq test1.fastq test2.fastq 
mageck test -k demo.count.txt -t L1 -c CTRL -n demo

The "test" command is the same as the first demo. The parameters of the "count" command are explained as follows.

Parameters	Meaning
mageck	The main portal of the MAGeCK program
count	A sub-command to ask MAGeCK to generate sgRNA read count table.
-l library.txt	The provided sgRNA information, including the sgRNA id, the sequence, and the gene it is targeting. See input files for a detailed explanation.
-n demo	The prefix of the output files.
--sample-label L1,CTRL	The labels of the two samples are L1 (test1.fastq) and CTRL (test2.fastq).
--fastq test1.fastq test2.fastq	The provided fastq file, separated by space. (Technical replicates of the same sample can also indicated using comma as a separator; for example, "sample1_replicate1.fastq,sample1_replicate2.fastq")

The third tutorial: going through a public CRISPR/Cas9 screening dataset

After the first two demos, you have a basic sense of how MAGeCK works. In this demo, let us go through a real dataset which is more complicated, and see how to handle some practical problems, like the trimming of the 5' end.

The dataset we use comes from the following paper: Genome-wide recessive genetic screening in mammalian cells with a lentiviral CRISPR-guide RNA library. In this paper, the authors did the CRISPR/Cas9 screening on mouse ESC cells, and identify genes that are essential in mouse ESC cells.

Step 1: download the fastq file

The fastq files of screens are public available on ENA archive. There are different replicates for one condition, but for simplicity, let us only download the following two fastq files and use them to test MAGeCK functions.

Accession	Sample	Download Link
ERR376998	one replicate of plasmid	ERR376998
ERR376999	one replicate of ESC	ERR376999

You can download these files, double click to unzip them (or use gunzip in the terminal), and place them into one separate folder:

gunzip ERR376998.fastq.gz
gunzip ERR376999.fastq.gz

Step 2: prepare the library file

The next step is to prepare the library file so MAGeCK will know which sgRNA targets which gene. If you are using one of the standard GeCKO libraries, you can just download the files from MAGeCK sourceforge. For non-standard libraries, you need to prepare the library file according to the library file format.

In this demo, you can generate the library file using Supplementary Data 2 (or Supplementary Table S7) from the paper, or download it directly from our collection of libraries (the file name is "yusa_library.csv.zip). Double click to unzip it (or use "unzip" in the terminal).

(Optional) Step 3: determine the trimming length and sgRNA length

**Note: since version 0.5.6, MAGeCK is now able to automatically determine the trimming length and sgRNA length, in most cases. Therefore, you don't need to go to this step unless MAGeCK fails to do so by itself. **

In many cases, your sequencing primer is not exactly in front of the first base of guide RNA. This is indeed the case in this demo, where the the first few bases in the fastq file are identical. Make sure you know exactly how many bases to trim before running MAGeCK. You can talk to experimental people, or get this information by taking a look at the first few lines of the fastq files.

Here are the first few lines of ERR376998.fastq (only sequences are shown):

CTTGTGGAAAGGACGAAACACCGGTGAAGGTGCCGTTGTGTAGTTTTAGA
CTTGTGGAAAGGACGAAACACCGAGCAGCACAACAATATGGGTTTTAGAG
CTTGTGGAAAGGACGAAACACCGCTCTTGGGTTTGGATGTTTGTTTTAGA
CTTGTGGAAAGGACGAAACACCGTTTGGCGAGGGGAGCGCCGGTTTTAGA
......

You can see that the first 23 nucleotides are identical, so in this case you need to tell MAGeCK to trim the first 23 nucleotides to collect read counts (--trim-5 23). If the nucleotide length in front of sgRNA varies between different reads, use cutadapt to remove the adaptor sequences.

The sgRNA length can be determined from the experimental design. It is usually 20 nucleotide, but in this demo, the sgRNA length is 19.

Step 4: run the MAGeCK count command

Now we have everything ready to generate count tables from MAGeCK. Place two fastq files and one library file into the same directory, and under that directory, run MAGeCK on terminal:

mageck count -l yusa_library.csv -n escneg --sample-label "plasmid,ESC1" --fastq ERR376998.fastq  ERR376999.fastq

This command also tells MAGeCK to assign labels to each library ("plasmid" for ERR376998.fastq, and "ESC1" for ERR376999.fastq), and output the file with prefix "escneg". Note that MAGeCK will automatically determine the length of the sgRNAs from the library, so you don't have to specify it here.

If it is running successfully, you will see one file "escneg.count.txt" collecting all read counts. The top lines are as follows:

sgRNA   Gene    plasmid ESC1
chr19:5884430-5884453   SLC25A45        13      32
chr11:58831475-58831498 OLFR312 94      108
chr4:49282352-49282375  E130309F12RIK   85      128

If you use the --pdf-report option (see Visualization), it will generate a nice PDF report of the sample statistics of the fastq files. Click Here to see the PDF results.

If you want to manually use the --trim-5 option determined in step 3, the command becomes:

mageck count -l yusa_library.csv -n escneg --sample-label "plasmid,ESC1" --trim-5 23 --fastq ERR376998.fastq  ERR376999.fastq

Step 5: compare samples using MAGeCK test subcommand

With the read count table, now you can compare ESC1 vs. plasmid condition to see which genes are negatively or positively selected:

mageck test -k escneg.count.txt -t ESC1 -c plasmid -n esccp

This command tells MAGeCK to compare ESC1 with plasmid in the read count table escneg.count.txt, and output results with prefix "esccp".

If successful, you should see a file "esccp.gene_summary.txt". The top lines are as follows:

id      num     neg|score  neg|p-value   neg|fdr neg|rank        neg|goodsgrna   pos|score  pos|p-value   pos|fdr pos|rank  pos|goodsgrna
GTF2B   5       2.0462e-10      2.5851e-07      0.000707        1       5       1.0     1.0     1.0     19150   0
RPS5    5       5.9353e-10      2.5851e-07      0.000707        2       5       1.0     1.0     1.0     19149   0
RPL19   4       2.695e-09       2.5851e-07      0.000707        3       4       1.0     1.0     1.0     19148   0
KIF18B  5       1.0136e-08      2.5851e-07      0.000707        4       5       1.0     1.0     1.0     19146   0
....

You can immediately see two ribosomal genes, RPS5 and RPL19, are on the top of negatively selected genes. If you rank the genes by "rank.pos" (11th column), you will see TRP53 (mouse homolog of TP53) on the top of positively selected genes:

sort -k 11,11n esccp.gene_summary.txt | less

id      num     neg|score  neg|p-value   neg|fdr neg|rank        neg|goodsgrna   pos|score  pos|p-value   pos|fdr pos|rank  pos|goodsgrna
ZFP945  5       1.0     1.0     0.999999        19150   0       9.6166e-07      5.4287e-06      0.05198 1  5
TRP53   5       0.95411 0.95409 0.999999        17901   0       1.0347e-06      5.4287e-06      0.05198 2  4
PDAP1   5       0.85937 0.86223 0.999999        15753   1       7.6412e-06      2.8178e-05      0.174505  3       2

As is in the count command, if you use --pdf-report option, a nice PDF file will be generated. Here is the example of generated PDF file in this demo.

Final remarks

Right now you should be quite familiar with basic functions of MAGeCK. MAGeCK also provides additional functions for you to further explore the data, for example, test the enrichment of pathways, plot the top-ranked genes or genes you are interested in, etc. If you have further questions, feel free to ask in our google group. Enjoy your MAGeCK trip!

The fourth tutorial: using MAGeCK mle module

Since version 0.5, MAGeCK provides a new subcommand, mle, to calculate gene essentiality from CRISPR screens. Compared with the original algorithm in "test" subcommand, MAGeCK-mle uses a measurement called beta score to call gene essentialities: a positive beta score means a gene is positively selected, and a negative beta score means a gene is negatively selected. It is similar to the term log fold change in differential expression, and compared with the original RRA algorithm, this measurement has the following advantages:

It has only one score for one gene, instead of two scores in RRA: one for positive selection, one for negative selection;
It allows a direct comparison across multiple conditions, or even experiments;
It is able to incorporate sgRNA efficiency information.

This demo will help you go through all the steps in running the mle module.

**The demo/demo3 folder provides an example for running MAGeCK MLE, plus an optional copy number correction module (see advanced tutorials section). **

Step 1: download the count table

For simplicity, let's assume you already know how to generate read count table from fastq files; if not, check the third demo above. We will use the read count table presented in T Wang et al. Science 2014.

Download the read count table here.

Step 2: prepare the design matrix file

The design matrix file indicates which sample is affected by which condition. It is generally a binary matrix indicating which sample (indicated by the first column) is affected by which condition (indicated by the first row). For the meanings of the design matrix, check the input file format page.

To create a design matrix file, copy the following content to a text editing software, and save it as a plain txt file:

Samples        baseline        HL60        KBM7
HL60.initial   1               0           0
KBM7.initial   1               0           0
HL60.final     1               1           0
KBM7.final     1               0           1

Remember the following rules of a design matrix file:

The design matrix file must include a header line of condition labels;
The first column is the sample labels that must match sample labels in read count file;
The second column must be a "baseline" column that sets all values to "1";
The element in the design matrix is either "0" or "1";
You must have at least one sample of "initial state" (e.g., day 0 or plasmid) that has only one "1" in the corresponding row. That only "1" must be in the baseline column.

In the design matrix above, we have four samples, two corresponding to the initial states of two cell lines, and two corresponding to the final states of two cell lines. We design two conditions (HL60 and KBM7) that model the cell type-specific effects.

Step 3: run the module

Now we have the minimum requirements to run the MAGeCK mle module. Assuming you save the design matrix file as "designmat.txt", type the following command to run

mageck mle -k leukemia.new.csv -d designmat.txt -n beta_leukemia

If successful, MAGeCK mle will generate three files, the log file, the gene_summary file (including gene beta scores), and the sgrna_summary file (including sgRNA efficiency probability predictions). Here are a few lines of the gene_summary file:

Gene    sgRNA   HL60|beta       HL60|z  HL60|p-value    HL60|fdr        HL60|wald-p-value       HL60|wald-fdr   KBM7|beta       KBM7|z  KBM7|p-value    KBM7|fdr        KBM7|wald-p-value       KBM7|wald-fdr
RNF14   10      0.24927 0.72077 0.36256 0.75648 0.47105 0.9999  0.57276 1.6565  0.06468 0.32386 0.097625
0.73193
RNF10   10      0.10159 0.29373 0.92087 0.98235 0.76896 0.9999  0.11341 0.32794 0.90145 0.97365 0.74296 0.98421
RNF11   10      3.6354  10.513  0.0002811       0.021739        7.5197e-26      1.3376e-22      2.5928  7.4925  0.0014898       0.032024        6.7577e-14      1.33e-11

This file includes the beta scores in two conditions specified in the design matrix (HL60|beta and KBM7|beta), and the associated statistics. For more information, check the output format specification of gene_summary file.

Advanced tutorial

The Advanced tutorial page provides more complicated examples for experienced users.

Return to [Home]

Wiki: Home
Wiki: advanced_tutorial
Wiki: output
Wiki: usage
Wiki: visualization