You can always ask questions on our Google group. Usually your questions are also other's questions, so please help us better improve our algorithm by joining our Google group and asking questions there!
A: Probably you are installing MAGeCK to your own directory, which is not recognized by Python. The solution is to set up the PYTHONPATH environment: see install/#setting-up-the-environment-variables for more details.
A: If you add the "--user" option during installation, mageck executable is usually located on your local directory ($HOME/bin or $HOME/.local/bin). If you don't have this option, mageck is installed in the system bin (/usr/bin or /usr/sbin).
There are two ways you can check the path of MAGeCK. You can either type
which mageck
to determine the path of the mageck executable. Or, at the end of the installation, you will see a few lines of the log like this:
copying build/scripts-2.7/mageck -> /Users/john/.local/bin changing mode of /Users/john/.local/bin/mageck to 755 running install_data copying bin/RRA -> /Users/john/.local/bin
That means your mageck is installed at /Users/john/.local/bin. On the other hand, if you see a message like this:
copying build/scripts-2.7/mageck -> /Users/john/Library/Python/2.7/bin changing mode of /Users/john/Library/Python/2.7/bin/mageck to 755 running install_data copying bin/RRA -> /Users/john/Library/Python/2.7/bin
That means your mageck is installed at /Users/john/Library/Python/2.7/bin.
Depending on your system, the path may look like one of the following:
A: You can use a similar approach to identify MAGeCK python module, but look for pattern like python2.7/site-packages. During installation, if you see a message like this:
copying bin/mageckGSEA -> /home/john/.pyenv/versions/2.7.13/bin running install_egg_info Removing /home/john/.pyenv/versions/2.7.13/lib/python2.7/site-packages/mageck-0.5.6-py2.7.egg-info Writing /home/john/.pyenv/versions/2.7.13/lib/python2.7/site-packages/mageck-0.5.6-py2.7.egg-info
That means your MAGeCK python module is installed in /home/john/.pyenv/versions/2.7.13/lib/python2.7/site-packages.
A: This usually happens when you have both conda version of MAGeCK and your previously installed version of MAGeCK. Even if your "mageck" command comes from conda, the libraries may still come from your previously installed MAGeCK. To solve this problem, you can manually install MAGeCK to the latest version.
A: There are two different solutions to do this.
Solution 1: Uninstall the conda MAGeCK version using the followig command:
conda uninstall mageck
You can always re-install MAGeCK later.
To avoid frequent un-installing and re-installing the software, consider using conda environments. For example, you can install the MAGeCK conda version under some certain environment, and activate it only the environment is activated.
Here is an example. First, create a python 3 environment named "mageckenv":
conda create -n mageckenv anaconda python=3
Then activate the environment using the following command:
source activate mageckenv
Now, install mageck under that environment
conda install -c bioconda mageck
You can use the MAGeCK conda version under the mageckenv environment now. To disable it, simply deactivate the environment:
source deactivate
Solution 2: The conda MAGeCK is run under python 3, while the MAGeCK in sourceforge and bitbucket is run under python 2. So the best way to run the installed version other than conda version is to create an python 2 conda environment and run mageck under that environment.
To create a python 2 envorinment when you have miniconda3 (where MAGeCK-VISPR is hosted), type the following command:
conda create -n py2k anaconda python=2
After that, you can activate the environment by typing
source activate py2k
If you run mageck now, it will invoke the installed version. You can also deactivate your environment by typing:
source deactivate
You may also need to manually edit the PATH variable such that the system will run your local mageck first. To do this, first locate the directory of mageck from your own installation (see the question "where is MAGeCK binary installed?"). If it's in /Users/john/.local/bin, then edit the PATH variable as follows:
::bash
export PATH=/Users/john/.local/bin:$PATH
Then you should be able to run your own installed mageck, not the conda mageck. For more information, go to Setting up the environment variables.
A: Usually you can pool the read counts for technical replicates of the same sample. To do this, use comma (,) to separate the fastq files of the technical replicates from the same sample in the --fastq option. For example, "--fastq sample1_replicate1.fastq,sample1_replicate2.fastq sample2_replicate1.fastq,sample2_replicate2.fastq" indicates two samples with 2 technical replicates for each sample.
For biological replicates, treat them as separate samples and use them together when doing the comparison; so MAGeCK can analyze the variance of these samples. For example in the test command, "-t sample1_bio_replicate1,sample1_bio_replicate2 -c sample2_bio_replicate1,sample2_bio_replicate2" compares 2 samples (with 2 biological replicates in each sample).
A: Since version 0.5.6, MAGeCK enables automatically determining trimming length, even the length may be different within the same fastq files. Alternatively, you can use cutadapt to trim the adaptor sequences of variable length before running MAGeCK.
A: Since version 0.5, MAGeCK produces a "countsummary.txt" file containing all the statistics of the fastq files. If you use "--pdf-report" option, the statistics of fastq files are also in the PDF file from the test.
The statistics can also be found in the log file (for run and count command). Here is an example of the log file generated from count command (the last few lines):
INFO @ Mon, 02 Feb 2015 08:12:15: Summary of file sample1_R1.fastq: INFO @ Mon, 02 Feb 2015 08:12:15: reads 45631055 INFO @ Mon, 02 Feb 2015 08:12:15: mappedreads 34300176 INFO @ Mon, 02 Feb 2015 08:12:15: zerosgrnas 119315 INFO @ Mon, 02 Feb 2015 08:12:15: label sample_1 INFO @ Mon, 02 Feb 2015 08:12:15: Summary of file sample2_R1.fastq: INFO @ Mon, 02 Feb 2015 08:12:15: reads 36344414 INFO @ Mon, 02 Feb 2015 08:12:15: mappedreads 27042629 INFO @ Mon, 02 Feb 2015 08:12:15: zerosgrnas 119002 INFO @ Mon, 02 Feb 2015 08:12:15: label sample_2
It provides the total number of reads, the number of mapped reads, the number of sgRNAs with 0 read counts, and the sample label of the fastq file.
A: We published a paper (MAGeCK-VISPR) to describe some quality control (QC) terms to help you determine the quality of your samples.
For simple QC terms, you can just take a look at the sample statistics. Generally in a good negative selection sample, (1) the mapped reads should be over 60 percent of the total number reads, and (2) the number of zero-count sgRNAs should be few (<5%, and prefered <1%). One exception is in positive selection experiments, where the number of zero-count sgRNAs may be much higher, but the percentage of mapped reads should be reasonably high.
You can also inspect the results by taking a look at the comparison results, see the related question below.
A: One possible reason is: you may save your library file or control sgRNA file to txt or csv format using some Microsoft softwares (like excel). Sometimes the line break representation is different between Windows and Linux/Mac systems, and it creates some problems for the program to read these files.
One solution is to open your txt file using Microsoft excel, copy all the contents (Ctrl+A, Ctrl+C), paste to another plain text editor like Vim (Ctrl+V), and save it to plain txt format.
A: The reason is numpy and scipy use MKL and openBLAS. Both libraries use multipel CPUs to accelerate numeric calculation (e.g., matrix operation). To limit the number of CPU to 1 per thread, set up the OMP_NUM_THREADS environment variable in Linux system. In other words, before running the mageck mle command, type the following command in the terminal:
export OMP_NUM_THREADS=1
This solution comes from the discussion here.
A: Since version 0.5.9, MAGeCK RRA introduces paired comparison between treatments and controls (--paired option). This option allows MAGeCK to make full use of paired samples to boost the statistical power. It is especially useful if the data between two (or more) replicates is poorly correlated, and you want to find top hits that are consistent between paired samples.
Paired samples are usually biological replicates that have treatment and control conditions independently. For example, you have two replicates (r1, r2), and for each replicate you perform screens on treatment and control conditions separately. In the end you have four samples (treatment_r1, treatment_r2, control_r1, control_r2).
You can now run MAGeCK RRA to compare treatment and control conditions, but add an additional --paired parameter to tell MAGeCK that (treatment_r1, control_r1) and (treatment_r2, control_r2) are paired:
mageck test -k count.txt -t treatment_r1,treatment_r2 -c control_r1,control_r2 --paired
In the --paired mode, the number of samples in -t and -c must match and have an exactly the same order in terms of samples.
The way MAGeCK deals with paired samples is to consider sgRNAs in paired samples as independent sgRNAs; therefore, it is equivalent to doubling the number of sgRNAs per gene (if you have two paired samples). The assumption of independence is not always hold, especially if the correlation between replicates is high. If this is the case, it may introduce false positives. Therefore, use the --paired option only if the correlation between paired samples is low, and you want to find consistent signals between paired replicates.
A: First of all, make sure your sample statistics looks good (see the related question in "Counting sgRNAs from fastq files"). Next, take a look at the rankings of some well-known genes. In negative selection experiments, you will expect some ribosomal genes and well-known oncogenes that are on the top; for example, MYC, RAS, etc. In positive selection experiments, TP53 usually has a high ranking.
Besides visually inspecting top-ranked genes, a good validation is to run the pathway command to test on MSigDB KEGG pathways (see MSigDB website). In negative selection experiments (usually on some condition compared with day 0 condition), you will expect to see a set of essential pathways ranking on the top, like ribosome, splicesome, proteasome and cell cycle genes. If you see these pathways coming out, this is a good sign that your experiments are working. The smaller the RRA lo_value and p values they have, the better they are.
A: There are a couple of reasons that the top ranked genes have a high FDR. First, many CRISPR/Cas9 libraries designed few sgRNAs (<7) for each gene. Since some of them may have low cutting efficiency or off-target effect, there may not be enough statistical power to detect essential genes. Second, if there are two many comparisons (or genes), the multiple comparison adjustment may lead to a high FDR estimation. Also, MAGeCK employs a pretty stringent statistical framework to evaluate the statistical significance, its FDR estimation may be conservative.
There are a couple of procedures you can do to increase the sensitivity. First, try to filter out genes that you think are not hits before running MAGeCK; for example, remove genes that have extremely low expression, genes that have very few targeting sgRNAs (<4). Second, If you have a list of negative control genes (genes that you think are not essential, like AAVS1), you can specify the corresponding sgRNA IDs using the --control-sgrna option (see below), thus allowing MAGeCK to have a better estimation of null distribution. Third, if your replicates are paired samples, consider using the --paired option (see here).
A: This option tells MAGeCK to use provided negative control sgRNAs to generate the null distribution when calculating the p values. If this option is not specified, MAGeCK generates the null distribution of RRA scores by assuming all of the genes in the library are non-essential. This approach is sometimes over-conservative, and you can improve this if you know some genes are not essential. By providing the corresponding sgRNA IDs in the --control-sgrna option, MAGeCK will have a better estimation of p values.
In addition, you can use the list of negative control sgRNAs to do the normalization. If --norm-method control option is specified, the median factor used for normalization will be calculated based on control sgRNAs only, rather than all the sgRNAs (by default).
New since 0.5.9.3: We include a new demo (demo5) in the MAGeCK source code to demonstrate the usage of control-sgrnas. Besides, we have an additional --control-gene option to specify the control genes instead of control sgRNAs.
To use this option, you need to prepare a text file specifying the IDs of control sgRNAs, one line for one sgRNA ID. Here is an example of the file:
NonTargetingControlGuideForHuman_0001 NonTargetingControlGuideForHuman_0002 NonTargetingControlGuideForHuman_0003 NonTargetingControlGuideForHuman_0004
There are several issues that you need to keep in mind:
A: MAGeCK will generate .R and .Rnw file even if the "--pdf-report" option is not specified. You can copy these files to a new computer where both R and pdflatex are properly installed, and use the following command to generate PDF files:
Rscript *.R
Note the for count command, the median-normalized read count file (.median_normalized.csv) should also be copied to the same directory. For test command, the gene summary file (.gene_summary.txt) should also be copied to the same directory.
A: You may get some error messages like this:
Error in texi2dvi("recount_countsummary.tex", pdf = TRUE) : Running 'texi2dvi' on 'recount_countsummary.tex' failed.
This may be due to the system compatibility issue of latex. You can still get some figures generated from MAGeCK, by adding the "--keep-tmp" option to keep intermediate files.
I noticed in my mageck test result that the neg|p-value are not consistent.
Why is that, please? I thought the rank was according to the p-values...
Question: mappedreads o I MAGeCK v0.5.9.4 I Ubuntu 16.04.6 LTS 64-bit
I am using a public dataset (PRJNA542321), and library is Addgene #1000000049 (file csv : id, gRNA.sequence, Gene).
When I am trying to run the mageck count function , the software give me the reads info, ex: reads 21339717 , but said that mappedreads are zero.
Do you have any suggestion?
Thanks in advance.
Question: MAGeCK results substantially different to DESeq2 results
Dear all,
I performed a CRISPR activator screen using the Calabrese library (Sanson et al. 2018). I sorted my cells at the flow cytometer regarding a phenotype population as treatment group (sample) and whole population as control group (control). I performed three biological replicates. After DNA-sequencing, I trimmed the samples using cutadapt yielding only the targeting sequence of the gRNA. To map the trimmed sequences to the reference sequence set and to obtain a count matrix, I performed MAGeCK count: mageck count -l library.txt -n Calabrese --sample-label Sample1,Sample2,Sample3,Ctrl1,Ctrl2,Ctrl3 --fastq sample1.fastq sample2.fastq sample3.fastq control1.fastq control2.fastq control3.fastq
This count matrix I feeded into the attached R script for DESeq2 analysis as well as into MAGeCK test for the enrichment analysis (mageck test -k Calabrese.count.txt -t Sample1,Sample2,Sample3 -c Ctrl1,Ctrl2,Ctrl3 -n Calabrese). I plotted the results I got in a Volcano Plot with the -log10 of the false discovery rate (Benjamini-Hochberg) at the y-axis and the log2 fold change at the x-axis. I attached the plots as well. As you can see, I get lots of depleted as well as enriched sgRNAs with DESeq2 and only enriched sgRNAs with MAGeCK. On top, the FDRs of MAGeCK is much lower than those calculated with DESeq2. If I overlap the significant results I get by the two different methods, only 13.4% hits are shared.
I don't know where these substantial differences come from. I checked at the set of negative control gRNAs and they look fine in both methods. Do you have an explanation or suggestions how I can approach this problem?
I appreciate any help! Thank you a lot,
Andrea
I noticed problem with using CNV normalization. For example CNV data for HT1080 cell line form Q4 21 (depmap) works but most recent 22Q2 doesn't and there is error message:
INFO @ Thu, 06 Oct 2022 11:57:08: Performing copy number normalization ...
Traceback (most recent call last):
File "/usr/local/bin/mageck", line 66, in <module>
main();
File "/usr/local/bin/mageck", line 43, in main
args=crisprseq_parseargs();
File "/usr/local/lib/python3.10/site-packages/mageck/argsParser.py", line 258, in crisprseq_parseargs
mageckmle_main(parsedargs=args); # ignoring the script path, and the sub command
File "/usr/local/lib/python3.10/site-packages/mageck/mlemageck.py", line 225, in mageckmle_main
betascore_piecewisenorm(allgenedict,CN_celllabel,CN_arr,CN_celldict,CN_genedict,selectGenes=genes2correct)
File "/usr/local/lib/python3.10/site-packages/mageck/cnv_normalization.py", line 125, in betascore_piecewisenorm
opt_bp = optimize.minimize(leastsq_bp,2,bounds=((1,np.percentile(CN_vals,99.9)),))
File "/usr/local/lib/python3.10/site-packages/scipy/optimize/_minimize.py", line 699, in minimize
res = _minimize_lbfgsb(fun, x0, args, jac, bounds,
File "/usr/local/lib/python3.10/site-packages/scipy/optimize/_lbfgsb_py.py", line 362, in _minimize_lbfgsb
f, g = func_and_grad(x)
File "/usr/local/lib/python3.10/site-packages/scipy/optimize/_differentiable_functions.py", line 285, in fun_and_grad
self._update_fun()
File "/usr/local/lib/python3.10/site-packages/scipy/optimize/_differentiable_functions.py", line 251, in _update_fun
self._update_fun_impl()
File "/usr/local/lib/python3.10/site-packages/scipy/optimize/_differentiable_functions.py", line 155, in update_fun
self.f = fun_wrapped(self.x)
File "/usr/local/lib/python3.10/site-packages/scipy/optimize/_differentiable_functions.py", line 137, in fun_wrapped
fx = fun(np.copy(x), *args)
File "/usr/local/lib/python3.10/site-packages/mageck/cnv_normalization.py", line 115, in leastsq_bp
(slope,intercept) = linreg_bp(bp)
File "/usr/local/lib/python3.10/site-packages/mageck/cnv_normalization.py", line 110, in linreg_bp
stats.linregress(CN_vals[CN_vals<=bp],score_vals[CN_vals<=bp])
File "/usr/local/lib/python3.10/site-packages/scipy/stats/_stats_mstats_common.py", line 153, in linregress
raise ValueError("Inputs must not be empty.")</module>
Anyone has idea what is going on?
Dear
I have a matter to submit. I'm trying to do a paired analysis, however, I'm getting this error.
Error: incorrect number of dimensions in line 2 (16) compared with the header line (4). Please double-check your read count table file.
This is the header of my count matrix and it doesn't seem to have apparently problem
$ head COUNTS_paired_sor_T0.count.txt
sgRNA Gene SOR_r1 SOR_r1 SOR_r1 SOR_r2 SOR_r2 SOR_r2 SOR_r2 T0_r1 T0_r1 T0_r1 T0_r2 T0_r2 T0_r2 T0_r2 T0_r2 T0_r2
Pgd_sg155_1 Pgd 2989 995 1685 908 1127 1264 1196 1110 666 413 959 1391 1357 1235 1375 1366
Smn1_sg208_4 Smn1 4792 1854 2605 1333 1772 2158 2242 2046 1201 710 1577 2433 2574 2037 2010 2033
Cyp27a1_sg044_3 Cyp27a1 3210 1307 1746 885 713 562 642 660 671 362 819 1281 770 647 616 673
Gca_sg079_3 Gca 4815 1544 2527 1571 869 810 866 880 860 533 1240 1895 1218 748 1046 831
Gstk1_sg087_5 Gstk1 2998 1130 1819 1037 2214 2459 2508 2431 849 443 1029 1724 3030 2230 2501 2508
Ddo_sg048_3 Ddo 2765 1176 1780 911 863 787 885 884 628 313 793 1511 1250 834 910 894
Ehhadh_sg057_3 Ehhadh 5063 1674 3002 1512 2347 2594 2740 2944 1091 680 1533 3007 2856 2361 2621 2293
Mapk1_sg117_2 Mapk1 5369 2006 3483 1893 877 1216 934 1069 1273 709 1522 2837 1654 1260 1305 1267
Rbm6_sg183_6 Rbm6 7618 2926 4661 2568 2136 2332 2215 2283 1434 790 1914 3226 2617 2044 2337 2007
the commands I used are the following:
mageck test -k COUNTS_paired_sor_T0.count.txt -t SOR_r1 SOR_r1 SOR_r1 SOR_r1 SOR_r2 SOR_r2 SOR_r2 SOR_r2 -c T0_r1 T0_r1 T0_r1 T0_r1 T0_r2 T0_r2 T0_r2 T0_r2 -n sor_T0_paired --paired
Can anyone suggest a solution?
thank you