MAGeCK Wiki

Model-based Analysis of Genome-wide CRISPR-Cas9 Knockout

Brought to you by: davidliwei

advanced_tutorial

Advanced Tutorials

Advanced Tutorials

Tutorial 1: Allow mismatches for read mapping

Since version 0.5.5, MAGeCK count module supports collecting read counts from BAM files. This will allow you to use a third-party aligner to map reads to the library with mismatches, providing more usable reads for the analysis. However, we still recommend directly using the fastq file in the count module (which does not allow any mismatches), because:

Some mismatches in the sgRNAs may have unwanted behaviors (have no on-target cleavages or have other off-target cleavages);
In most cases the read counts are enough if we allow no mismatches;
The mapping procedure is more complicated; for example, you need to know the exact length of 3' adapter sequence.

We will start with the mouse ESC demo which we already used in the tutorial. We will use Bowtie2, a commonly used read aligner, for alignments. Make sure you have the fastq files and the library file on your computer.

Step 1: install bowtie2 and samtools

Both bowtie2 and samtools are frequently used tools for high-throughput sequencing analysis. You can install them by yourself, or, if you have conda installed, install them through the bioconda channel using one simple command:

conda install -c bioconda bowtie2 samtools

See th manual of mageck-vispr for conda installations.

Step 2: build bowtie2 index

To do this, first convert the library file into fasta format, which is required for building index. For library file with CSV format (like the library file yusa_library.csv used here), you can do it using a simple awk command in unix and windows:

awk -F ',' '{print ">"$1"\n"$2}' yusa_library.csv > yusa_lib.fa

This command will convert one entry in the csv file like:

chr3:90056560-90056583,TAGATGCCATCCGTGCCTC,4933434E20RIK

into a fasta format like:

>chr3:90056560-90056583
TAGATGCCATCCGTGCCTC

For a library format with txt format, use awk without the "-F" option:

awk '{print ">"$1"\n"$2}' yusa_library.txt > yusa_library.fa

Next, run bowtie2-build to generate bowtie2 index

bowtie2-build yusa_lib.fa bowtie2_ind_yusa

Step 3: determine the 5' and 3' trimming length and sgRNA length

If you provide MAGeCK count module with fastq files, you only need to examine the exact 5' trimming length. But if this case you need to determine the exact 5' and 3' trimming length. Here are the first few lines of ERR376998.fastq (only sequences are shown):

CTTGTGGAAAGGACGAAACACCGGTGAAGGTGCCGTTGTGTAGTTTTAGA
CTTGTGGAAAGGACGAAACACCGAGCAGCACAACAATATGGGTTTTAGAG
CTTGTGGAAAGGACGAAACACCGCTCTTGGGTTTGGATGTTTGTTTTAGA
CTTGTGGAAAGGACGAAACACCGTTTGGCGAGGGGAGCGCCGGTTTTAGA
......

You can see that the first 23 nucleotides are identical, and the last 8 nucleotides are identical. Te sequences in the middle are exactly the sgRNA sequences.

If the trimming lengths are different, use cutadapt with -a and -b options to remove the adaptor sequences.

Step 4: run bowtie2 to map reads and generate bam files

With the information at hand, now we can run bowtie2 (or other aligners) to align reads. Here are some important aspects you need to keep in mind.

important notes for the alignment:

Make sure 5' and 3' adapters are properly removed before mapping. Either use cutadapt or -5/-3 option in bowtie2.
Make sure no reverse-complement mapping is allowed; otherwise, there will be multiple mappings for sgRNAs whose sequnces are reverse complemented. In bowtie2, the "--no-rc" parameter should be specified.
Carefully check the alignment strategy used in the aligner. Some sequences may map to multiple sgRNAs and are counted multiple times.
When building index, some aligners (like bowtie2) will remove sgRNAs with identical sequence. This will create some warning messages "sgRNA in the BAM file does not match th provided library file".

The command to map reads is as folows:

bowtie2 -x bowtie2_ind_yusa -U ERR376999.fastq -5 23 -3 8 --norc | samtools view -bS - > ERR376999.bam
bowtie2 -x bowtie2_ind_yusa -U ERR376998.fastq -5 23 -3 8 --norc | samtools view -bS - > ERR376998.bam

Bowtie2 is used to align reads, and samtools is used to convert into bam files. Here are the explanations of these parameters.

Parameters	Meaning
-x	the bowtie2 index file, generated through step 2.
-U	fastq files provided.
-5	the trimming length for 5' reads. The value (23) is determined in step 3.
-3	the trimming length for 3' reads. The value (8) is determined in step 3.
--norc	do not allow reverse complement mapping.

The command:

samtools view -bS -

will be used to convert sam file into bam file.

If running goes smoothly, you will see a message like:

[samopen] SAM header is present: 87438 sequences.
10093905 reads; of these:
   10093905 (100.00%) were unpaired; of these:
      1112643 (11.02%) aligned 0 times
      8910585 (88.28%) aligned exactly 1 time
      70677 (0.70%) aligned >1 times
      88.98% overall alignment rate

Step 5: collect reads through mageck count command

The count command is the same as before, instead you provide it with bam files:

mageck count -l yusa_library.csv -n escneg --sample-label "plasmid,ESC1" --fastq ERR376998.bam ERR376999.bam

Note that in this case, the read count statistics may not be accurate as it will depend on the strategy of the read mapping. If one read is mapped to multiple locations, it will probably be counted multiple times.

Tutorial 2: Correct the effects from copy number variations

One well-known issue of CRISPR knock-out screens is that copy number variations (CNV) may affect the outcome of the screens and generate false positives (or false negatives). This is confirmed by several different groups (e.g., Aguirre et al., Munoz et al., Wang et al.)as well as our own experiments. Since version 0.5.6, MAGeCK provides a solution to correct CNV effects, based on the available CNV data for cell lines. The idea is to find an association between sgRNA scores (in RRA) and gene beta scores (in MLE) with CNV, and build a simple linear regression model to reduce the effect of CNVs.

MAGeCK provides to demos (demo3 and demo4) in the demo folder to guide you how to correct CNV effects. Here are the step-to-step tutorial, based on the two demos.

Step 1: prepare CNV profile

In demo3 or demo4, we provide an example of CNV profile (cnv_data.txt) to provie the CNV status for each gene. The tab-delimited file looks like this:

SYMBOL  HL60_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE
A1BG    0.1105
NAT2    0.0104
ADA     0.052299999999999999
CDH2    0.57869999999999999
AKT3    0.0567

The first line indicates a word "SYMBOL", followed by a cell line name (HL60_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE). The following lines are the gene symbol, and the CNV status. CNV measurement is represented in log2 ratio; in other words, 0 means no CNV change, -1 means heterozygous loss, -2 means homozygous loss, and >1 means amplifications.

Requirements:

The first word in the first line must be "SYMBOL".
You can also have multiple columns to record different CNV status from different cell lines.

Many cell line CNV data can be directly downloaded from Broad CCLE project. Go to the CCLE page, click the "Show Available Data" link under "DNA Copy Number" section, and download the CNV profile "byGene" (see the image below).

Downloading CNV profiles from CCLE website

Step 2: Correct CNV effect in MAGeCK test

The demo4 folder provides an example to correct CNV effect in MAGeCK test. Here is the content in "run.sh":

mageck test -k sample.txt -t HL60.final,KBM7.final -c HL60.initial,KBM7.initial -n demo4 --cnv-norm cnv_data.txt --cell-line HL60_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE

Note there are two additional parameters: --cnv-norm tells the program to read CNV profile from the specified file (cnv_data.txt), while --cell-line tells the program to use the CNV profile corresponding to the column named "HL60_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE".

Requirements:

The cell line name in the "--cell-line" option must match the corresponding column names in CNV profile. Otherwise, the program assumes no CNV data is available (and does not do any correction).

Step 3: Correct CNV effect in MAGeCK MLE

The demo3 folder provides an example to correct CNV effect in MAGeCK test. Here is the content in "run.sh":

mageck mle -k leukemia.new.csv -d designmat.txt -n beta_leukemia --cnv-norm cnv_data.txt --permutation-round 2

Note there is one additional parameter: --cnv-norm tells the program to read CNV profile from the specified file (cnv_data.txt). You don't have to specify the cell line name, as MAGeCK will try to match sample labels in the design matrix for available CNV profiles. Take a look at the first line of the design matrix:

Samples        baseline        HL60_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE        KBM7
HL60.initial   1               0           0
......

Requirements:

The sample labels in the design matrix must match the corresponding column names in CNV profile. Otherwise, the program assumes no CNV data is available (and does not do any correction).

Tutorial 3: Run MAGeCK on Docker Image

Once you have Docker installed, you don't need to install MAGeCK; just pull the pre-configured Docker image and run it!

On your terminal, first type

 docker pull davidliwei/mageck

This will download the latest mageck built.

Then, figure out the folder you want to work with -- the folder should have all the necessary files to run MAGeCK (e.g., count file or fastq file). For example, "/User/John/data". Then, from the command line, type:

 docker run -it --volume=/User/John/data:/work --workdir="/work"  davidliwei/mageck

If you are already in the folder you are working with in the terminal, you can also use the following command:

docker run -it --volume=`pwd`:/work --workdir="/work"  davidliwei/mageck

If everything goes well, the terminal shall displays a welcome message and the current version of MAGeCK:

Welcome to MAGeCK Docker
root@14fddf28ef4c:/work# 0.5.9.2

root@14fddf28ef4c:/work#

You can type "mageck" to test if it works.

Tutorial 4: make full use of MAGeCK MLE for more complicated experimental design (e.g., paired samples, time series)

Besides simple design matrix (i.e., one "beta score" for one type of sample), users can design more complicated design matrix for their needs including taking time series CRISPR screens into consideration.

A time-series drug treatment example

One particular example is given below. Genome-wide CRISPR screens are performed on a 4-week time series on cells treated with one drug and control (DMSO). In the end, users get the following samples:

Sample label	Time	Drug
day0	0 week (0w)
dmso-1wk	1 week	DMSO
e2-1wk	1 week	Estrogen (E2)
dmso-2wk	2 weeks	DMSO
e2-2wk	2 weeks	E2
dmso-3wk	3 weeks	DMSO
e2-3wk	3 weeks	E2
dmso-4wk	4 weeks	DMSO
e2-4wk	4 weeks	E2

Users can have multiple ways to set up the design matrix.

Option 1: a simple design matrix

Users can design one beta score (i.e., one column in the design matrix) for each sample type (1-4 weeks * drug type):

sample	baseline	dmso-1wk	e2-1wk	dmso-2wk	e2-2wk	dmso-3wk	e2-3wk	dmso-4wk	e2-4wk
day0	1	0	0	0	0	0	0	0	0
dmso-1wk	1	1	0	0	0	0	0	0	0
e2-1wk	1	0	1	0	0	0	0	0	0
dmso-2wk	1	0	0	1	0	0	0	0	0
e2-2wk	1	0	0	0	1	0	0	0	0
dmso-3wk	1	0	0	0	0	1	0	0	0
e2-3wk	1	0	0	0	0	0	1	0	0
dmso-4wk	1	0	0	0	0	0	0	1	0
e2-4wk	1	0	0	0	0	0	0	0	1

In the end, MAGeCK-VISPR will output eight beta scores (and their associated p values), each beta score corresponds to one of the eight samples (dmso-1wk,e2-1wk,dmso-2wk,e2-2wk,dmso-3wk,e2-3wk,dmso-4wk,e2-4wk). Further analysis can be performed (e.g., clustering) based on these beta scores

Option 2: consider early vs late time points, and drug treatment differences

Instead using the naive approach in option 1, users may want to get the genes that have differences in later time points (weeks 3,4) vs. early time points (weeks 1,2), as well as genes that show differences in drug treatment vs. DMSO controls. The design matrix can be set up as follows.

sample	baseline	common	latevsearly	E2
day0	1	0	0	0
dmso-1wk	1	1	0	0
e2-1wk	1	1	0	1
dmso-2wk	1	1	0	0
e2-2wk	1	1	0	1
dmso-3wk	1	1	1	0
e2-3wk	1	1	1	1
dmso-4wk	1	1	1	0
e2-4wk	1	1	1	1

In the end, MAGeCK-VISPR will output three beta scores (and their associated p values). The meanings are described below.

beta score	meaning
common	The corresponding gene, if this score is nonzero, has common signals across all the samples. In particular, this column should capture signals in early time points (weeks 1-2), DMSO samples, as "common" factor is the only non-zero factor for these samples. Negative score means this gene is essential.
latevsearly	The corresponding gene, if this score is nonzero, shows different behavior in late time points (weeks 3-4) vs. early time points in DMSO samples. Negative score means this gene becomes more essential in late time points.
E2	The corresponding gene, if this score is nonzero, shows difference in drug treated(E2) samples vs. DMSO samples. Negative score means this gene is more essential within E2 drug treatment.

Option 3: linear time points and drug treatments

Assuming genes may behave linearly in time points, users may design a matrix like the following.

sample	baseline	time	E2
day0	1	0	0
dmso-1wk	1	1	0
e2-1wk	1	1	1
dmso-2wk	1	2	0
e2-2wk	1	2	1
dmso-3wk	1	3	0
e2-3wk	1	3	1
dmso-4wk	1	4	0
e2-4wk	1	4	1

Note that we can have other positive numbers than "1" in the design matrix. Relatively, if a log fold change of a gene in week 1 is x, the expected log fold change of that gene in week 2 (and 3,4) would be 2x (3x, 4x), etc. This assumes a linear reduction (or enrichment) of the cell population on a log scale; for example, a cell with certain gene knockout reduces 50% of its population every week. Therefore the log2 fold change for weeks 1-4 would be -1, -2, -3, -4. *Note that this simplified linear scenario may not hold true in many cases. *

In the end, MAGeCK-VISPR will output two beta scores (and their associated p values). The meanings are described below.

beta score	meaning
time	The linear factor for the corresponding genes in time series, assuming the log fold change is linear. if the abs(time beta score) is high, this gene has consistent changes in DMSO samples. Negative score means this gene is essential.
E2	The corresponding genes, if the abs(E2 beta score) is high, shows difference in drug treated(E2) samples vs. DMSO samples. Negative score means this gene is more essential within E2 drug treatment.

We tested this model in our CRISPR screening dataset (E2 vs DMSO, weeks 1-4) on an ER+ breast cancer cell line (T47D). The results of two example genes are shown below.

Gene	time_beta	E2_beta
CSK	0.463	-0.58
FOXA1	0.027	-0.465

These results demonstrated CSK is a tumor suppressor gene shown in DMSO samples (score=0.463) but the function is somehow reverted in E2 (E2 score=-0.58), consistent with our previous findings. In contrast, FOXA1, a known oncogene in Estrogen and Estrogen Receptor signaling, has little effect on DMSO (score=0.027) but is depleted in E2 treated cells (score=-0.465).

Considering paired sample information in MLE

Note that in MAGeCK-RRA, users can add a --paired option to take paired sample information into consideration. In the MLE module, users can easily set up design matrix to consider paired sample. For example, users want to directly compare treatments vs controls, and the experiment is performed on two paired samples, where (treatment_rep1, control_rep1) and (treatment_rep2, control_rep2) are performed independently.

In MAGeCK-RRA, the corresponding command line would be

mageck test -t treatment_rep1,treatment_rep2 -c control_rep1,control_rep2 --paired

In MAGeCK-VISPR, the corresponding design matrix would be

sample	baseline	t_vs_c	rep2_vs_rep1
control_rep1	1	0	0
control_rep2	1	0	1
treatment_rep1	1	1	0
treatment_rep2	1	1	1

The meanings of the scores are as follows.

beta score	meaning
t_vs_c	The beta score differences between treatment vs control.
rep2_vs_rep1	This beta score models the differences between rep 1 and rep 2.

Tutorial 5: Include the sgRNA efficiency into mle calculation

This is the step 4 of the MAGeCK-MLE tutorial. Make sure you are familiar with the first 3 steps before trying this tutorial.

MAGeCK mle also allows an optional input of sgRNA efficiencies, which are calculated from SSC, a computational algorithm to predict sgRNA efficiencies from sgRNA sequence. To incorporate sgRNA efficiency information into the calculation, first download the SSC software, and compile the SSC program as indicated.

Next, we use the SSC program to calculate the sgRNA efficiencies. Download the library file related to this study from our collection of sgRNA libraries (the file name is "tim.library.txt.zip"), and unzip it into a plain txt file. Here are a few lines of the library file:

Row.names       Sequence        Gene.Symbol
INO80B_m74682554        CATGGCCATGGGCACCCGCC    INO80B
NHS_p17705966   AGGAGCTGCACCGCCACGCC    NHS
MED14_m40594623 AACCACCAGCTGGTCCCGCC    MED14

THe library file format is slightly different from the input of SSC, so we need to treak it a little bit. SSC requires the first column of the input file to be sgRNA sequence, and the second column to be sgRNA label, so use the following command to adjust the column orders:

perl -ane 'print "$F[1]\t$F[0]\t$F[2]\n"' tim.library.txt > tim.library.in

The first few lines of "tim.library.in" are as follows:

Sequence        Row.names       Gene.Symbol
CATGGCCATGGGCACCCGCC    INO80B_m74682554        INO80B
AGGAGCTGCACCGCCACGCC    NHS_p17705966   NHS
AACCACCAGCTGGTCCCGCC    MED14_m40594623 MED14

Now we can use SSC to calculate sgRNA efficiencies:

./SSC -l 20 -m SSC0.1/matrix/human_CRISPRi_20bp.matrix -i tim.library.in -o tim.library.out

where -i indicates the sgRNA sequence length, and SSC0.1/matrix/human_CRISPRi_20bp.matrix is the weight matrix provided in the SSC source. The few lines of "tim.library.out" are as follows:

CATGGCCATGGGCACCCGCC    INO80B_m74682554        INO80B  -0.005239
AGGAGCTGCACCGCCACGCC    NHS_p17705966   NHS     -0.018779
AACCACCAGCTGGTCCCGCC    MED14_m40594623 MED14   -0.018132

The last column (4th column) is the sgRNA efficiency score calculated by SSC.

Finally, use this file as input of the MAGeCK mle module:

mageck mle -k leukemia.new.csv -d designmat.txt -n beta_leukemia --sgrna-efficiency tim.library.ko.out --sgrna-eff-name-column 1 --sgrna-eff-score-column 3

Here, three optional arguments related to sgRNA efficiencies are provided. --sgrna-efficiency provides the file name of the efficiency prediction, --sgrna-eff-name-column provides the column index corresponding to the sgRNA names (index = 1 means the name column is the 2nd column), and the --sgrna-eff-score-column provides the column index corresponding to the sgRNA scores (index=3, or the 4th column).

This command will tell MAGeCK mle to incorporate sgRNA efficiency information into the calculation.

For more information of the SSC program, refer to the following publication:

Xu et al. Genome Research 2015 Aug;25(8):1147-57

Return to [Home]

Wiki: Home
Wiki: demo
Wiki: usage

Discussion

Wei Li - 2017-05-17

Downloading CNV profiles of cell lines in Broad CCLE page.

ccle_example.png

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

MAGeCK Wiki

Model-based Analysis of Genome-wide CRISPR-Cas9 Knockout

advanced_tutorial

Advanced Tutorials

Tutorial 1: Allow mismatches for read mapping

Step 1: install bowtie2 and samtools

Step 2: build bowtie2 index

Step 3: determine the 5' and 3' trimming length and sgRNA length

Step 4: run bowtie2 to map reads and generate bam files

Step 5: collect reads through mageck count command

Tutorial 2: Correct the effects from copy number variations

Step 1: prepare CNV profile

Step 2: Correct CNV effect in MAGeCK test

Step 3: Correct CNV effect in MAGeCK MLE

Tutorial 3: Run MAGeCK on Docker Image

Tutorial 4: make full use of MAGeCK MLE for more complicated experimental design (e.g., paired samples, time series)

A time-series drug treatment example

Option 1: a simple design matrix

Option 2: consider early vs late time points, and drug treatment differences

Option 3: linear time points and drug treatments

Considering paired sample information in MLE

Tutorial 5: Include the sgRNA efficiency into mle calculation

Related

Discussion