Menu

QA

Wei Li

Q and A

You can always ask questions on our Google group. Usually your questions are also other's questions, so please help us better improve our algorithm by joining our Google group and asking questions there!

Installation problems

I encountered an error after installation: "ImportError: No module named mageck". What is the problem?

A: Probably you are installing MAGeCK to your own directory, which is not recognized by Python. The solution is to set up the PYTHONPATH environment: see install/#setting-up-the-environment-variables for more details.

Where is MAGeCK binary installed?

A: If you add the "--user" option during installation, mageck executable is usually located on your local directory ($HOME/bin or $HOME/.local/bin). If you don't have this option, mageck is installed in the system bin (/usr/bin or /usr/sbin).

There are two ways you can check the path of MAGeCK. You can either type

which mageck

to determine the path of the mageck executable. Or, at the end of the installation, you will see a few lines of the log like this:

copying build/scripts-2.7/mageck -> /Users/john/.local/bin
changing mode of /Users/john/.local/bin/mageck to 755
running install_data
copying bin/RRA -> /Users/john/.local/bin

That means your mageck is installed at /Users/john/.local/bin. On the other hand, if you see a message like this:

copying build/scripts-2.7/mageck -> /Users/john/Library/Python/2.7/bin
changing mode of /Users/john/Library/Python/2.7/bin/mageck to 755
running install_data
copying bin/RRA -> /Users/john/Library/Python/2.7/bin

That means your mageck is installed at /Users/john/Library/Python/2.7/bin.

Depending on your system, the path may look like one of the following:

  1. /Users/john/.local/bin
  2. /Users/john/Library/Python/2.7/bin
  3. /Users/john/bin
  4. /home/john/.pyenv/versions/2.7.13/bin

Where is MAGeCK python module installed?

A: You can use a similar approach to identify MAGeCK python module, but look for pattern like python2.7/site-packages. During installation, if you see a message like this:

copying bin/mageckGSEA -> /home/john/.pyenv/versions/2.7.13/bin
running install_egg_info
Removing /home/john/.pyenv/versions/2.7.13/lib/python2.7/site-packages/mageck-0.5.6-py2.7.egg-info
Writing /home/john/.pyenv/versions/2.7.13/lib/python2.7/site-packages/mageck-0.5.6-py2.7.egg-info

That means your MAGeCK python module is installed in /home/john/.pyenv/versions/2.7.13/lib/python2.7/site-packages.

I use conda to install the latest version of MAGeCK, but my system still calls an older version of MAGeCK. What is the problem?

A: This usually happens when you have both conda version of MAGeCK and your previously installed version of MAGeCK. Even if your "mageck" command comes from conda, the libraries may still come from your previously installed MAGeCK. To solve this problem, you can manually install MAGeCK to the latest version.

I don't want to run the conda MAGeCK version, but instead the version I installed by myself. How can I do that?

A: There are two different solutions to do this.

Solution 1: Uninstall the conda MAGeCK version using the followig command:

conda uninstall mageck

You can always re-install MAGeCK later.

To avoid frequent un-installing and re-installing the software, consider using conda environments. For example, you can install the MAGeCK conda version under some certain environment, and activate it only the environment is activated.

Here is an example. First, create a python 3 environment named "mageckenv":

conda create -n mageckenv anaconda python=3

Then activate the environment using the following command:

source activate mageckenv

Now, install mageck under that environment

conda install -c bioconda mageck

You can use the MAGeCK conda version under the mageckenv environment now. To disable it, simply deactivate the environment:

source deactivate

Solution 2: The conda MAGeCK is run under python 3, while the MAGeCK in sourceforge and bitbucket is run under python 2. So the best way to run the installed version other than conda version is to create an python 2 conda environment and run mageck under that environment.

To create a python 2 envorinment when you have miniconda3 (where MAGeCK-VISPR is hosted), type the following command:

conda create -n py2k anaconda python=2

After that, you can activate the environment by typing

source activate py2k

If you run mageck now, it will invoke the installed version. You can also deactivate your environment by typing:

source deactivate

You may also need to manually edit the PATH variable such that the system will run your local mageck first. To do this, first locate the directory of mageck from your own installation (see the question "where is MAGeCK binary installed?"). If it's in /Users/john/.local/bin, then edit the PATH variable as follows:

::bash
export PATH=/Users/john/.local/bin:$PATH

Then you should be able to run your own installed mageck, not the conda mageck. For more information, go to Setting up the environment variables.

Using MAGeCK

How to deal with biological replicates and technical replicates?

A: Usually you can pool the read counts for technical replicates of the same sample. To do this, use comma (,) to separate the fastq files of the technical replicates from the same sample in the --fastq option. For example, "--fastq sample1_replicate1.fastq,sample1_replicate2.fastq sample2_replicate1.fastq,sample2_replicate2.fastq" indicates two samples with 2 technical replicates for each sample.
For biological replicates, treat them as separate samples and use them together when doing the comparison; so MAGeCK can analyze the variance of these samples. For example in the test command, "-t sample1_bio_replicate1,sample1_bio_replicate2 -c sample2_bio_replicate1,sample2_bio_replicate2" compares 2 samples (with 2 biological replicates in each sample).

The --trim-5 option can only trim a fixed length of nucleotides before sgRNA, but what if the trimming length is different in different reads?

A: Since version 0.5.6, MAGeCK enables automatically determining trimming length, even the length may be different within the same fastq files. Alternatively, you can use cutadapt to trim the adaptor sequences of variable length before running MAGeCK.

How do I get the simple statistics of the fastq files?

A: Since version 0.5, MAGeCK produces a "countsummary.txt" file containing all the statistics of the fastq files. If you use "--pdf-report" option, the statistics of fastq files are also in the PDF file from the test.

The statistics can also be found in the log file (for run and count command). Here is an example of the log file generated from count command (the last few lines):

INFO  @ Mon, 02 Feb 2015 08:12:15: Summary of file sample1_R1.fastq: 
INFO  @ Mon, 02 Feb 2015 08:12:15: reads        45631055 
INFO  @ Mon, 02 Feb 2015 08:12:15: mappedreads  34300176 
INFO  @ Mon, 02 Feb 2015 08:12:15: zerosgrnas   119315 
INFO  @ Mon, 02 Feb 2015 08:12:15: label        sample_1 
INFO  @ Mon, 02 Feb 2015 08:12:15: Summary of file sample2_R1.fastq: 
INFO  @ Mon, 02 Feb 2015 08:12:15: reads        36344414 
INFO  @ Mon, 02 Feb 2015 08:12:15: mappedreads  27042629 
INFO  @ Mon, 02 Feb 2015 08:12:15: zerosgrnas   119002 
INFO  @ Mon, 02 Feb 2015 08:12:15: label        sample_2

It provides the total number of reads, the number of mapped reads, the number of sgRNAs with 0 read counts, and the sample label of the fastq file.

How do I know the quality of my samples?

A: We published a paper (MAGeCK-VISPR) to describe some quality control (QC) terms to help you determine the quality of your samples.

For simple QC terms, you can just take a look at the sample statistics. Generally in a good negative selection sample, (1) the mapped reads should be over 60 percent of the total number reads, and (2) the number of zero-count sgRNAs should be few (<5%, and prefered <1%). One exception is in positive selection experiments, where the number of zero-count sgRNAs may be much higher, but the percentage of mapped reads should be reasonably high.

You can also inspect the results by taking a look at the comparison results, see the related question below.

The program cannot read library file or control sgRNA file, but they look fine when I manually check these files. What happened?

A: One possible reason is: you may save your library file or control sgRNA file to txt or csv format using some Microsoft softwares (like excel). Sometimes the line break representation is different between Windows and Linux/Mac systems, and it creates some problems for the program to read these files.

One solution is to open your txt file using Microsoft excel, copy all the contents (Ctrl+A, Ctrl+C), paste to another plain text editor like Vim (Ctrl+V), and save it to plain txt format.

The MLE module uses more CPU resources than expected, even if I specify the number of threads in --threads option. How to solve this problem?

A: The reason is numpy and scipy use MKL and openBLAS. Both libraries use multipel CPUs to accelerate numeric calculation (e.g., matrix operation). To limit the number of CPU to 1 per thread, set up the OMP_NUM_THREADS environment variable in Linux system. In other words, before running the mageck mle command, type the following command in the terminal:

export OMP_NUM_THREADS=1

This solution comes from the discussion here.

How to perform paired analysis?

A: Since version 0.5.9, MAGeCK RRA introduces paired comparison between treatments and controls (--paired option). This option allows MAGeCK to make full use of paired samples to boost the statistical power. It is especially useful if the data between two (or more) replicates is poorly correlated, and you want to find top hits that are consistent between paired samples.

Paired samples are usually biological replicates that have treatment and control conditions independently. For example, you have two replicates (r1, r2), and for each replicate you perform screens on treatment and control conditions separately. In the end you have four samples (treatment_r1, treatment_r2, control_r1, control_r2).

You can now run MAGeCK RRA to compare treatment and control conditions, but add an additional --paired parameter to tell MAGeCK that (treatment_r1, control_r1) and (treatment_r2, control_r2) are paired:

mageck test -k count.txt -t treatment_r1,treatment_r2 -c control_r1,control_r2 --paired

In the --paired mode, the number of samples in -t and -c must match and have an exactly the same order in terms of samples.

The way MAGeCK deals with paired samples is to consider sgRNAs in paired samples as independent sgRNAs; therefore, it is equivalent to doubling the number of sgRNAs per gene (if you have two paired samples). The assumption of independence is not always hold, especially if the correlation between replicates is high. If this is the case, it may introduce false positives. Therefore, use the --paired option only if the correlation between paired samples is low, and you want to find consistent signals between paired replicates.

Interpreting results

How do I know if my experiments work well?

A: First of all, make sure your sample statistics looks good (see the related question in "Counting sgRNAs from fastq files"). Next, take a look at the rankings of some well-known genes. In negative selection experiments, you will expect some ribosomal genes and well-known oncogenes that are on the top; for example, MYC, RAS, etc. In positive selection experiments, TP53 usually has a high ranking.

Besides visually inspecting top-ranked genes, a good validation is to run the pathway command to test on MSigDB KEGG pathways (see MSigDB website). In negative selection experiments (usually on some condition compared with day 0 condition), you will expect to see a set of essential pathways ranking on the top, like ribosome, splicesome, proteasome and cell cycle genes. If you see these pathways coming out, this is a good sign that your experiments are working. The smaller the RRA lo_value and p values they have, the better they are.

I see very few genes that are below the certain FDR cutoff (like 0.10). Why it is that and what should I do?

A: There are a couple of reasons that the top ranked genes have a high FDR. First, many CRISPR/Cas9 libraries designed few sgRNAs (<7) for each gene. Since some of them may have low cutting efficiency or off-target effect, there may not be enough statistical power to detect essential genes. Second, if there are two many comparisons (or genes), the multiple comparison adjustment may lead to a high FDR estimation. Also, MAGeCK employs a pretty stringent statistical framework to evaluate the statistical significance, its FDR estimation may be conservative.

There are a couple of procedures you can do to increase the sensitivity. First, try to filter out genes that you think are not hits before running MAGeCK; for example, remove genes that have extremely low expression, genes that have very few targeting sgRNAs (<4). Second, If you have a list of negative control genes (genes that you think are not essential, like AAVS1), you can specify the corresponding sgRNA IDs using the --control-sgrna option (see below), thus allowing MAGeCK to have a better estimation of null distribution. Third, if your replicates are paired samples, consider using the --paired option (see here).

What does the --control-sgrna CONTROL_SGRNA option do? How to use this option?

A: This option tells MAGeCK to use provided negative control sgRNAs to generate the null distribution when calculating the p values. If this option is not specified, MAGeCK generates the null distribution of RRA scores by assuming all of the genes in the library are non-essential. This approach is sometimes over-conservative, and you can improve this if you know some genes are not essential. By providing the corresponding sgRNA IDs in the --control-sgrna option, MAGeCK will have a better estimation of p values.

In addition, you can use the list of negative control sgRNAs to do the normalization. If --norm-method control option is specified, the median factor used for normalization will be calculated based on control sgRNAs only, rather than all the sgRNAs (by default).

New since 0.5.9.3: We include a new demo (demo5) in the MAGeCK source code to demonstrate the usage of control-sgrnas. Besides, we have an additional --control-gene option to specify the control genes instead of control sgRNAs.

To use this option, you need to prepare a text file specifying the IDs of control sgRNAs, one line for one sgRNA ID. Here is an example of the file:

NonTargetingControlGuideForHuman_0001
NonTargetingControlGuideForHuman_0002
NonTargetingControlGuideForHuman_0003
NonTargetingControlGuideForHuman_0004

There are several issues that you need to keep in mind:

  • You should have enough number of negative control guides (>100 recommended) for accurate p value estimation and normalization.
  • It is known that for growth based screens, non-targeting controls may lead to high false positives (e.g., Morgens et al. 2017). Use non-targeting controls carefully.

Visualization

The test or count command is successful but I have some problems producing the PDF file. How can I generate the PDF file?

A: MAGeCK will generate .R and .Rnw file even if the "--pdf-report" option is not specified. You can copy these files to a new computer where both R and pdflatex are properly installed, and use the following command to generate PDF files:

Rscript *.R

Note the for count command, the median-normalized read count file (.median_normalized.csv) should also be copied to the same directory. For test command, the gene summary file (.gene_summary.txt) should also be copied to the same directory.

I run into issues of generating pdf files using latex.

A: You may get some error messages like this:

Error in texi2dvi("recount_countsummary.tex", pdf = TRUE) :  
Running 'texi2dvi' on 'recount_countsummary.tex' failed.

This may be due to the system compatibility issue of latex. You can still get some figures generated from MAGeCK, by adding the "--keep-tmp" option to keep intermediate files.


Related

Wiki: Home

Discussion

  • Jon Xu

    Jon Xu - 2020-06-23

    I noticed in my mageck test result that the neg|p-value are not consistent.
    Why is that, please? I thought the rank was according to the p-values...

     
  • Fiammetta Falcone

    Question: mappedreads o I MAGeCK v0.5.9.4 I Ubuntu 16.04.6 LTS 64-bit

    I am using a public dataset (PRJNA542321), and library is Addgene #1000000049 (file csv : id, gRNA.sequence, Gene).

    When I am trying to run the mageck count function , the software give me the reads info, ex: reads 21339717 , but said that mappedreads are zero.
    Do you have any suggestion?
    Thanks in advance.

     
  • Andrea Neuner

    Andrea Neuner - 2021-08-14

    Question: MAGeCK results substantially different to DESeq2 results

    Dear all,

    I performed a CRISPR activator screen using the Calabrese library (Sanson et al. 2018). I sorted my cells at the flow cytometer regarding a phenotype population as treatment group (sample) and whole population as control group (control). I performed three biological replicates. After DNA-sequencing, I trimmed the samples using cutadapt yielding only the targeting sequence of the gRNA. To map the trimmed sequences to the reference sequence set and to obtain a count matrix, I performed MAGeCK count: mageck count -l library.txt -n Calabrese --sample-label Sample1,Sample2,Sample3,Ctrl1,Ctrl2,Ctrl3 --fastq sample1.fastq sample2.fastq sample3.fastq control1.fastq control2.fastq control3.fastq
    This count matrix I feeded into the attached R script for DESeq2 analysis as well as into MAGeCK test for the enrichment analysis (mageck test -k Calabrese.count.txt -t Sample1,Sample2,Sample3 -c Ctrl1,Ctrl2,Ctrl3 -n Calabrese). I plotted the results I got in a Volcano Plot with the -log10 of the false discovery rate (Benjamini-Hochberg) at the y-axis and the log2 fold change at the x-axis. I attached the plots as well. As you can see, I get lots of depleted as well as enriched sgRNAs with DESeq2 and only enriched sgRNAs with MAGeCK. On top, the FDRs of MAGeCK is much lower than those calculated with DESeq2. If I overlap the significant results I get by the two different methods, only 13.4% hits are shared.

    I don't know where these substantial differences come from. I checked at the set of negative control gRNAs and they look fine in both methods. Do you have an explanation or suggestions how I can approach this problem?

    I appreciate any help! Thank you a lot,
    Andrea

     
  • MiC

    MiC - 2022-10-06

    I noticed problem with using CNV normalization. For example CNV data for HT1080 cell line form Q4 21 (depmap) works but most recent 22Q2 doesn't and there is error message:

    INFO @ Thu, 06 Oct 2022 11:57:08: Performing copy number normalization ...
    Traceback (most recent call last):
    File "/usr/local/bin/mageck", line 66, in <module>
    main();
    File "/usr/local/bin/mageck", line 43, in main
    args=crisprseq_parseargs();
    File "/usr/local/lib/python3.10/site-packages/mageck/argsParser.py", line 258, in crisprseq_parseargs
    mageckmle_main(parsedargs=args); # ignoring the script path, and the sub command
    File "/usr/local/lib/python3.10/site-packages/mageck/mlemageck.py", line 225, in mageckmle_main
    betascore_piecewisenorm(allgenedict,CN_celllabel,CN_arr,CN_celldict,CN_genedict,selectGenes=genes2correct)
    File "/usr/local/lib/python3.10/site-packages/mageck/cnv_normalization.py", line 125, in betascore_piecewisenorm
    opt_bp = optimize.minimize(leastsq_bp,2,bounds=((1,np.percentile(CN_vals,99.9)),))
    File "/usr/local/lib/python3.10/site-packages/scipy/optimize/_minimize.py", line 699, in minimize
    res = _minimize_lbfgsb(fun, x0, args, jac, bounds,
    File "/usr/local/lib/python3.10/site-packages/scipy/optimize/_lbfgsb_py.py", line 362, in _minimize_lbfgsb
    f, g = func_and_grad(x)
    File "/usr/local/lib/python3.10/site-packages/scipy/optimize/_differentiable_functions.py", line 285, in fun_and_grad
    self._update_fun()
    File "/usr/local/lib/python3.10/site-packages/scipy/optimize/_differentiable_functions.py", line 251, in _update_fun
    self._update_fun_impl()
    File "/usr/local/lib/python3.10/site-packages/scipy/optimize/_differentiable_functions.py", line 155, in update_fun
    self.f = fun_wrapped(self.x)
    File "/usr/local/lib/python3.10/site-packages/scipy/optimize/_differentiable_functions.py", line 137, in fun_wrapped
    fx = fun(np.copy(x), *args)
    File "/usr/local/lib/python3.10/site-packages/mageck/cnv_normalization.py", line 115, in leastsq_bp
    (slope,intercept) = linreg_bp(bp)
    File "/usr/local/lib/python3.10/site-packages/mageck/cnv_normalization.py", line 110, in linreg_bp
    stats.linregress(CN_vals[CN_vals<=bp],score_vals[CN_vals<=bp])
    File "/usr/local/lib/python3.10/site-packages/scipy/stats/_stats_mstats_common.py", line 153, in linregress
    raise ValueError("Inputs must not be empty.")</module>

    I'm using function to prepare file for CNV normalization so format of this file is exactly the same.  Difference is only in cn numbers....
    

    Anyone has idea what is going on?

    Thanks
    MC
    
     
  • Pasquale

    Pasquale - 2023-09-11

    Dear
    I have a matter to submit. I'm trying to do a paired analysis, however, I'm getting this error.
    Error: incorrect number of dimensions in line 2 (16) compared with the header line (4). Please double-check your read count table file.
    This is the header of my count matrix and it doesn't seem to have apparently problem

    $ head COUNTS_paired_sor_T0.count.txt
    sgRNA Gene SOR_r1 SOR_r1 SOR_r1 SOR_r2 SOR_r2 SOR_r2 SOR_r2 T0_r1 T0_r1 T0_r1 T0_r2 T0_r2 T0_r2 T0_r2 T0_r2 T0_r2
    Pgd_sg155_1 Pgd 2989 995 1685 908 1127 1264 1196 1110 666 413 959 1391 1357 1235 1375 1366
    Smn1_sg208_4 Smn1 4792 1854 2605 1333 1772 2158 2242 2046 1201 710 1577 2433 2574 2037 2010 2033
    Cyp27a1_sg044_3 Cyp27a1 3210 1307 1746 885 713 562 642 660 671 362 819 1281 770 647 616 673
    Gca_sg079_3 Gca 4815 1544 2527 1571 869 810 866 880 860 533 1240 1895 1218 748 1046 831
    Gstk1_sg087_5 Gstk1 2998 1130 1819 1037 2214 2459 2508 2431 849 443 1029 1724 3030 2230 2501 2508
    Ddo_sg048_3 Ddo 2765 1176 1780 911 863 787 885 884 628 313 793 1511 1250 834 910 894
    Ehhadh_sg057_3 Ehhadh 5063 1674 3002 1512 2347 2594 2740 2944 1091 680 1533 3007 2856 2361 2621 2293
    Mapk1_sg117_2 Mapk1 5369 2006 3483 1893 877 1216 934 1069 1273 709 1522 2837 1654 1260 1305 1267
    Rbm6_sg183_6 Rbm6 7618 2926 4661 2568 2136 2332 2215 2283 1434 790 1914 3226 2617 2044 2337 2007
    the commands I used are the following:
    mageck test -k COUNTS_paired_sor_T0.count.txt -t SOR_r1 SOR_r1 SOR_r1 SOR_r1 SOR_r2 SOR_r2 SOR_r2 SOR_r2 -c T0_r1 T0_r1 T0_r1 T0_r1 T0_r2 T0_r2 T0_r2 T0_r2 -n sor_T0_paired --paired

    Can anyone suggest a solution?
    thank you

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.