Since version 0.5.5, MAGeCK count module supports collecting read counts from BAM files. This will allow you to use a third-party aligner to map reads to the library with mismatches, providing more usable reads for the analysis. However, we still recommend directly using the fastq file in the count module (which does not allow any mismatches), because:
We will start with the mouse ESC demo which we already used in the tutorial. We will use Bowtie2, a commonly used read aligner, for alignments. Make sure you have the fastq files and the library file on your computer.
Both bowtie2 and samtools are frequently used tools for high-throughput sequencing analysis. You can install them by yourself, or, if you have conda installed, install them through the bioconda channel using one simple command:
conda install -c bioconda bowtie2 samtools
See th manual of mageck-vispr for conda installations.
To do this, first convert the library file into fasta format, which is required for building index. For library file with CSV format (like the library file yusa_library.csv used here), you can do it using a simple awk command in unix and windows:
awk -F ',' '{print ">"$1"\n"$2}' yusa_library.csv > yusa_lib.fa
This command will convert one entry in the csv file like:
chr3:90056560-90056583,TAGATGCCATCCGTGCCTC,4933434E20RIK
into a fasta format like:
>chr3:90056560-90056583 TAGATGCCATCCGTGCCTC
For a library format with txt format, use awk without the "-F" option:
awk '{print ">"$1"\n"$2}' yusa_library.txt > yusa_library.fa
Next, run bowtie2-build to generate bowtie2 index
bowtie2-build yusa_lib.fa bowtie2_ind_yusa
If you provide MAGeCK count module with fastq files, you only need to examine the exact 5' trimming length. But if this case you need to determine the exact 5' and 3' trimming length. Here are the first few lines of ERR376998.fastq (only sequences are shown):
CTTGTGGAAAGGACGAAACACCGGTGAAGGTGCCGTTGTGTAGTTTTAGA
CTTGTGGAAAGGACGAAACACCGAGCAGCACAACAATATGGGTTTTAGAG
CTTGTGGAAAGGACGAAACACCGCTCTTGGGTTTGGATGTTTGTTTTAGA
CTTGTGGAAAGGACGAAACACCGTTTGGCGAGGGGAGCGCCGGTTTTAGA
......
You can see that the first 23 nucleotides are identical, and the last 8 nucleotides are identical. Te sequences in the middle are exactly the sgRNA sequences.
If the trimming lengths are different, use cutadapt with -a and -b options to remove the adaptor sequences.
With the information at hand, now we can run bowtie2 (or other aligners) to align reads. Here are some important aspects you need to keep in mind.
important notes for the alignment:
The command to map reads is as folows:
bowtie2 -x bowtie2_ind_yusa -U ERR376999.fastq -5 23 -3 8 --norc | samtools view -bS - > ERR376999.bam bowtie2 -x bowtie2_ind_yusa -U ERR376998.fastq -5 23 -3 8 --norc | samtools view -bS - > ERR376998.bam
Bowtie2 is used to align reads, and samtools is used to convert into bam files. Here are the explanations of these parameters.
Parameters | Meaning |
---|---|
-x | the bowtie2 index file, generated through step 2. |
-U | fastq files provided. |
-5 | the trimming length for 5' reads. The value (23) is determined in step 3. |
-3 | the trimming length for 3' reads. The value (8) is determined in step 3. |
--norc | do not allow reverse complement mapping. |
The command:
samtools view -bS -
will be used to convert sam file into bam file.
If running goes smoothly, you will see a message like:
[samopen] SAM header is present: 87438 sequences. 10093905 reads; of these: 10093905 (100.00%) were unpaired; of these: 1112643 (11.02%) aligned 0 times 8910585 (88.28%) aligned exactly 1 time 70677 (0.70%) aligned >1 times 88.98% overall alignment rate
The count command is the same as before, instead you provide it with bam files:
mageck count -l yusa_library.csv -n escneg --sample-label "plasmid,ESC1" --fastq ERR376998.bam ERR376999.bam
Note that in this case, the read count statistics may not be accurate as it will depend on the strategy of the read mapping. If one read is mapped to multiple locations, it will probably be counted multiple times.
One well-known issue of CRISPR knock-out screens is that copy number variations (CNV) may affect the outcome of the screens and generate false positives (or false negatives). This is confirmed by several different groups (e.g., Aguirre et al., Munoz et al., Wang et al.)as well as our own experiments. Since version 0.5.6, MAGeCK provides a solution to correct CNV effects, based on the available CNV data for cell lines. The idea is to find an association between sgRNA scores (in RRA) and gene beta scores (in MLE) with CNV, and build a simple linear regression model to reduce the effect of CNVs.
MAGeCK provides to demos (demo3 and demo4) in the demo folder to guide you how to correct CNV effects. Here are the step-to-step tutorial, based on the two demos.
In demo3 or demo4, we provide an example of CNV profile (cnv_data.txt) to provie the CNV status for each gene. The tab-delimited file looks like this:
SYMBOL HL60_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE A1BG 0.1105 NAT2 0.0104 ADA 0.052299999999999999 CDH2 0.57869999999999999 AKT3 0.0567
The first line indicates a word "SYMBOL", followed by a cell line name (HL60_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE). The following lines are the gene symbol, and the CNV status. CNV measurement is represented in log2 ratio; in other words, 0 means no CNV change, -1 means heterozygous loss, -2 means homozygous loss, and >1 means amplifications.
Requirements:
Many cell line CNV data can be directly downloaded from Broad CCLE project. Go to the CCLE page, click the "Show Available Data" link under "DNA Copy Number" section, and download the CNV profile "byGene" (see the image below).
The demo4 folder provides an example to correct CNV effect in MAGeCK test. Here is the content in "run.sh":
mageck test -k sample.txt -t HL60.final,KBM7.final -c HL60.initial,KBM7.initial -n demo4 --cnv-norm cnv_data.txt --cell-line HL60_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE
Note there are two additional parameters: --cnv-norm tells the program to read CNV profile from the specified file (cnv_data.txt), while --cell-line tells the program to use the CNV profile corresponding to the column named "HL60_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE".
Requirements:
The demo3 folder provides an example to correct CNV effect in MAGeCK test. Here is the content in "run.sh":
mageck mle -k leukemia.new.csv -d designmat.txt -n beta_leukemia --cnv-norm cnv_data.txt --permutation-round 2
Note there is one additional parameter: --cnv-norm tells the program to read CNV profile from the specified file (cnv_data.txt). You don't have to specify the cell line name, as MAGeCK will try to match sample labels in the design matrix for available CNV profiles. Take a look at the first line of the design matrix:
Samples baseline HL60_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE KBM7 HL60.initial 1 0 0 ......
Requirements:
Once you have Docker installed, you don't need to install MAGeCK; just pull the pre-configured Docker image and run it!
On your terminal, first type
docker pull davidliwei/mageck
This will download the latest mageck built.
Then, figure out the folder you want to work with -- the folder should have all the necessary files to run MAGeCK (e.g., count file or fastq file). For example, "/User/John/data". Then, from the command line, type:
docker run -it --volume=/User/John/data:/work --workdir="/work" davidliwei/mageck
If you are already in the folder you are working with in the terminal, you can also use the following command:
docker run -it --volume=`pwd`:/work --workdir="/work" davidliwei/mageck
If everything goes well, the terminal shall displays a welcome message and the current version of MAGeCK:
Welcome to MAGeCK Docker root@14fddf28ef4c:/work# 0.5.9.2 root@14fddf28ef4c:/work#
You can type "mageck" to test if it works.
Besides simple design matrix (i.e., one "beta score" for one type of sample), users can design more complicated design matrix for their needs including taking time series CRISPR screens into consideration.
One particular example is given below. Genome-wide CRISPR screens are performed on a 4-week time series on cells treated with one drug and control (DMSO). In the end, users get the following samples:
Sample label | Time | Drug |
---|---|---|
day0 | 0 week (0w) | |
dmso-1wk | 1 week | DMSO |
e2-1wk | 1 week | Estrogen (E2) |
dmso-2wk | 2 weeks | DMSO |
e2-2wk | 2 weeks | E2 |
dmso-3wk | 3 weeks | DMSO |
e2-3wk | 3 weeks | E2 |
dmso-4wk | 4 weeks | DMSO |
e2-4wk | 4 weeks | E2 |
Users can have multiple ways to set up the design matrix.
Users can design one beta score (i.e., one column in the design matrix) for each sample type (1-4 weeks * drug type):
sample | baseline | dmso-1wk | e2-1wk | dmso-2wk | e2-2wk | dmso-3wk | e2-3wk | dmso-4wk | e2-4wk |
---|---|---|---|---|---|---|---|---|---|
day0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
dmso-1wk | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
e2-1wk | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
dmso-2wk | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
e2-2wk | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
dmso-3wk | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
e2-3wk | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
dmso-4wk | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
e2-4wk | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
In the end, MAGeCK-VISPR will output eight beta scores (and their associated p values), each beta score corresponds to one of the eight samples (dmso-1wk,e2-1wk,dmso-2wk,e2-2wk,dmso-3wk,e2-3wk,dmso-4wk,e2-4wk). Further analysis can be performed (e.g., clustering) based on these beta scores
Instead using the naive approach in option 1, users may want to get the genes that have differences in later time points (weeks 3,4) vs. early time points (weeks 1,2), as well as genes that show differences in drug treatment vs. DMSO controls. The design matrix can be set up as follows.
sample | baseline | common | latevsearly | E2 |
---|---|---|---|---|
day0 | 1 | 0 | 0 | 0 |
dmso-1wk | 1 | 1 | 0 | 0 |
e2-1wk | 1 | 1 | 0 | 1 |
dmso-2wk | 1 | 1 | 0 | 0 |
e2-2wk | 1 | 1 | 0 | 1 |
dmso-3wk | 1 | 1 | 1 | 0 |
e2-3wk | 1 | 1 | 1 | 1 |
dmso-4wk | 1 | 1 | 1 | 0 |
e2-4wk | 1 | 1 | 1 | 1 |
In the end, MAGeCK-VISPR will output three beta scores (and their associated p values). The meanings are described below.
beta score | meaning |
---|---|
common | The corresponding gene, if this score is nonzero, has common signals across all the samples. In particular, this column should capture signals in early time points (weeks 1-2), DMSO samples, as "common" factor is the only non-zero factor for these samples. Negative score means this gene is essential. |
latevsearly | The corresponding gene, if this score is nonzero, shows different behavior in late time points (weeks 3-4) vs. early time points in DMSO samples. Negative score means this gene becomes more essential in late time points. |
E2 | The corresponding gene, if this score is nonzero, shows difference in drug treated(E2) samples vs. DMSO samples. Negative score means this gene is more essential within E2 drug treatment. |
Assuming genes may behave linearly in time points, users may design a matrix like the following.
sample | baseline | time | E2 |
---|---|---|---|
day0 | 1 | 0 | 0 |
dmso-1wk | 1 | 1 | 0 |
e2-1wk | 1 | 1 | 1 |
dmso-2wk | 1 | 2 | 0 |
e2-2wk | 1 | 2 | 1 |
dmso-3wk | 1 | 3 | 0 |
e2-3wk | 1 | 3 | 1 |
dmso-4wk | 1 | 4 | 0 |
e2-4wk | 1 | 4 | 1 |
Note that we can have other positive numbers than "1" in the design matrix. Relatively, if a log fold change of a gene in week 1 is x, the expected log fold change of that gene in week 2 (and 3,4) would be 2x (3x, 4x), etc. This assumes a linear reduction (or enrichment) of the cell population on a log scale; for example, a cell with certain gene knockout reduces 50% of its population every week. Therefore the log2 fold change for weeks 1-4 would be -1, -2, -3, -4. *Note that this simplified linear scenario may not hold true in many cases. *
In the end, MAGeCK-VISPR will output two beta scores (and their associated p values). The meanings are described below.
beta score | meaning |
---|---|
time | The linear factor for the corresponding genes in time series, assuming the log fold change is linear. if the abs(time beta score) is high, this gene has consistent changes in DMSO samples. Negative score means this gene is essential. |
E2 | The corresponding genes, if the abs(E2 beta score) is high, shows difference in drug treated(E2) samples vs. DMSO samples. Negative score means this gene is more essential within E2 drug treatment. |
We tested this model in our CRISPR screening dataset (E2 vs DMSO, weeks 1-4) on an ER+ breast cancer cell line (T47D). The results of two example genes are shown below.
Gene | time_beta | E2_beta |
---|---|---|
CSK | 0.463 | -0.58 |
FOXA1 | 0.027 | -0.465 |
These results demonstrated CSK is a tumor suppressor gene shown in DMSO samples (score=0.463) but the function is somehow reverted in E2 (E2 score=-0.58), consistent with our previous findings. In contrast, FOXA1, a known oncogene in Estrogen and Estrogen Receptor signaling, has little effect on DMSO (score=0.027) but is depleted in E2 treated cells (score=-0.465).
Note that in MAGeCK-RRA, users can add a --paired option to take paired sample information into consideration. In the MLE module, users can easily set up design matrix to consider paired sample. For example, users want to directly compare treatments vs controls, and the experiment is performed on two paired samples, where (treatment_rep1, control_rep1) and (treatment_rep2, control_rep2) are performed independently.
In MAGeCK-RRA, the corresponding command line would be
mageck test -t treatment_rep1,treatment_rep2 -c control_rep1,control_rep2 --paired
In MAGeCK-VISPR, the corresponding design matrix would be
sample | baseline | t_vs_c | rep2_vs_rep1 |
---|---|---|---|
control_rep1 | 1 | 0 | 0 |
control_rep2 | 1 | 0 | 1 |
treatment_rep1 | 1 | 1 | 0 |
treatment_rep2 | 1 | 1 | 1 |
The meanings of the scores are as follows.
beta score | meaning |
---|---|
t_vs_c | The beta score differences between treatment vs control. |
rep2_vs_rep1 | This beta score models the differences between rep 1 and rep 2. |
This is the step 4 of the MAGeCK-MLE tutorial. Make sure you are familiar with the first 3 steps before trying this tutorial.
MAGeCK mle also allows an optional input of sgRNA efficiencies, which are calculated from SSC, a computational algorithm to predict sgRNA efficiencies from sgRNA sequence. To incorporate sgRNA efficiency information into the calculation, first download the SSC software, and compile the SSC program as indicated.
Next, we use the SSC program to calculate the sgRNA efficiencies. Download the library file related to this study from our collection of sgRNA libraries (the file name is "tim.library.txt.zip"), and unzip it into a plain txt file. Here are a few lines of the library file:
Row.names Sequence Gene.Symbol INO80B_m74682554 CATGGCCATGGGCACCCGCC INO80B NHS_p17705966 AGGAGCTGCACCGCCACGCC NHS MED14_m40594623 AACCACCAGCTGGTCCCGCC MED14
THe library file format is slightly different from the input of SSC, so we need to treak it a little bit. SSC requires the first column of the input file to be sgRNA sequence, and the second column to be sgRNA label, so use the following command to adjust the column orders:
perl -ane 'print "$F[1]\t$F[0]\t$F[2]\n"' tim.library.txt > tim.library.in
The first few lines of "tim.library.in" are as follows:
Sequence Row.names Gene.Symbol CATGGCCATGGGCACCCGCC INO80B_m74682554 INO80B AGGAGCTGCACCGCCACGCC NHS_p17705966 NHS AACCACCAGCTGGTCCCGCC MED14_m40594623 MED14
Now we can use SSC to calculate sgRNA efficiencies:
./SSC -l 20 -m SSC0.1/matrix/human_CRISPRi_20bp.matrix -i tim.library.in -o tim.library.out
where -i indicates the sgRNA sequence length, and SSC0.1/matrix/human_CRISPRi_20bp.matrix is the weight matrix provided in the SSC source. The few lines of "tim.library.out" are as follows:
CATGGCCATGGGCACCCGCC INO80B_m74682554 INO80B -0.005239 AGGAGCTGCACCGCCACGCC NHS_p17705966 NHS -0.018779 AACCACCAGCTGGTCCCGCC MED14_m40594623 MED14 -0.018132
The last column (4th column) is the sgRNA efficiency score calculated by SSC.
Finally, use this file as input of the MAGeCK mle module:
mageck mle -k leukemia.new.csv -d designmat.txt -n beta_leukemia --sgrna-efficiency tim.library.ko.out --sgrna-eff-name-column 1 --sgrna-eff-score-column 3
Here, three optional arguments related to sgRNA efficiencies are provided. --sgrna-efficiency provides the file name of the efficiency prediction, --sgrna-eff-name-column provides the column index corresponding to the sgRNA names (index = 1 means the name column is the 2nd column), and the --sgrna-eff-score-column provides the column index corresponding to the sgRNA scores (index=3, or the 4th column).
This command will tell MAGeCK mle to incorporate sgRNA efficiency information into the calculation.
For more information of the SSC program, refer to the following publication:
Xu et al. Genome Research 2015 Aug;25(8):1147-57
Return to [Home]
Wiki: Home
Wiki: demo
Wiki: usage
Downloading CNV profiles of cell lines in Broad CCLE page.