The main portal of MAGeCK is the mageck program, which includes a couple of different subprograms:
There is also another subprogram plot that plots some figures of the genes you are interested in from the test results.
This subcommand tests and ranks sgRNAs and genes based on the read count tables provided.
usage:
usage: mageck test [-h] -k COUNT_TABLE (-t TREATMENT_ID | --day0-label DAY0_LABEL) [-c CONTROL_ID] [--paired] [--norm-method {none,median,total,control}] [--gene-test-fdr-threshold GENE_TEST_FDR_THRESHOLD] [--adjust-method {fdr,holm,pounds}] [--variance-estimation-samples VARIANCE_ESTIMATION_SAMPLES] [--sort-criteria {neg,pos}] [--remove-zero {none,control,treatment,both,any}] [--remove-zero-threshold REMOVE_ZERO_THRESHOLD] [--pdf-report] [--gene-lfc-method {median,alphamedian,mean,alphamean,secondbest}] [-n OUTPUT_PREFIX] [--control-sgrna CONTROL_SGRNA] [--normcounts-to-file] [--skip-gene SKIP_GENE] [--keep-tmp] [--additional-rra-parameters ADDITIONAL_RRA_PARAMETERS] [--cnv-norm CNV_NORM] [--cell-line CELL_LINE]
required arguments:
Parameter | Explanation |
---|---|
-k COUNT_TABLE, --count-table COUNT_TABLE | Provide a tab-separated count table instead of sam files. Each line in the table should include sgRNA name (1st column), targeting gene (2nd column) and read counts in each sample. See input/#sgrna-read-count-file for a detailed description. |
-t TREATMENT_ID, --treatment-id TREATMENT_ID | Sample label or sample index (0 as the first sample) in the count table as treatment experiments, separated by comma (,). If sample label is provided, the labels must match the labels in the first line of the count table; for example, "HL60.final,KBM7.final". For sample index, "0,2" means the 1st and 3rd samples are treatment experiments. See input/#sample-index for a detailed description. |
--day0-label DAY0_LABEL | Specify the label for control sample (usually day 0 or plasmid). For every other sample label, the module will treat it as a treatment condition and compare with control sample. |
optional general arguments:
Parameter | Explanation |
---|---|
-h, --help | show this help message and exit |
-c CONTROL_ID, --control-id CONTROL_ID | Sample label or sample index in the count table as control experiments, separated by comma (,). Default is all the samples not specified in treatment experiments. See input/#sample-index for a detailed description. |
--paired | Paired sample comparisons. In this mode, the number of samples in -t and -c must match and have an exact order in terms of samples. For example, "-t HL60.final,KBM7.final -c HL60.initial,KBM7.initial". |
--norm-method {none,median,total,control} | Method for normalization, default median. If control is specified, the size factor will be estimated using control sgRNAs specified in --control-sgrna option. |
--gene-test-fdr-threshold GENE_TEST_FDR_THRESHOLD | FDR threshold for gene test, default 0.25. |
--adjust-method {fdr,holm,pounds} | Method for sgrna-level p-value adjustment, including false discovery rate (fdr), holm's method (holm), or pounds's method (pounds). |
--variance-estimation-samples VARIANCE_ESTIMATION_SAMPLES | Sample label or sample index for estimating variances, separated by comma (,). See -t/--treatment-id option for specifying samples. |
--sort-criteria {neg,pos} | Sorting criteria, either by negative selection (neg) or positive selection (pos). Default negative selection. |
--remove-zero {none,control,treatment,both} | Whether to remove zero-count sgRNAs in control and/or treatment experiments. Default: none (do not remove those zero-count sgRNAs). |
--pdf-report | Generate pdf report of the analysis. |
--gene-lfc-method {median,alphamedian,mean,alphamean,secondbest} | Method to calculate gene log fold changes (LFC) from sgRNA LFCs. Available methods include the median/mean of all sgRNAs (median/mean), or the median/mean sgRNAs that are ranked in front of the alpha cutoff in RRA (alphamedian/alphamean), or the sgRNA that has the second strongest LFC (secondbest). In the alphamedian/alphamean case, the number of sgRNAs correspond to the "goodsgrna" column in the output, and the gene LFC will be set to 0 if no sgRNA is in front of the alpha cutoff. Default median. (new since v0.5.5) |
Optional arguments for input and output:
Parameter | Explanation |
---|---|
-n OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX | The prefix of the output file(s). Default sample1. |
--control-sgrna CONTROL_SGRNA | A list of control sgRNAs for normalization and for generating the null distribution of RRA. See the format specification. |
--normcounts-to-file | Write normalized read counts to file ({output-prefix}.normalized.txt). |
--keep-tmp | Keep intermediate files. |
--skip-gene SKIP_GENE | Skip genes in the report. By default, "NA" or "na" will be skipped. |
--additional-rra-parameters ADDITIONAL_RRA_PARAMETERS | Additional arguments to run RRA. They will be appended to the command line for calling RRA. |
Optional arguments for CNV correction:
Parameter | Explanation |
---|---|
--cnv-norm CNV_NORM | A matrix of copy number variation data across cell lines to normalize CNV-biased sgRNA scores prior to gene ranking. |
--cell-line CELL_LINE | The name of the cell line to be used for copy number variation normalization. |
This subcommand collects sgRNA read count information from fastq files. The output count tables can be used directly in the test subcommand.
usage:
usage: mageck count [-h] -l LIST_SEQ (--fastq FASTQ [FASTQ ...] | -k COUNT_TABLE) [--norm-method {none,median,total,control}] [--control-sgrna CONTROL_SGRNA] [--sample-label SAMPLE_LABEL] [-n OUTPUT_PREFIX] [--unmapped-to-file] [--keep-tmp] [--test-run] [--trim-5 TRIM_5] [--sgrna-len SGRNA_LEN] [--count-n] [--reverse-complement] [--pdf-report] [--day0-label DAY0_LABEL] [--gmt-file GMT_FILE]
required arguments:
Parameter | Explanation |
---|---|
-l LIST_SEQ, --list-seq LIST_SEQ | A file containing list of sgRNA names, the sequences and target genes, either in .txt or in .csv format. See input/#sgrna-library-file for more details. If this file is not provided, mageck will count all possible sgRNAs in the fastq. |
--fastq FASTQ | Sample fastq/fastq.gz files (or bam files after v0.5.5. See advanced tutorial), separated by space; use comma (,) to indicate technical replicates of the same sample. For example, "--fastq sample1_replicate1.fastq,sample1_replicate2.fastq sample2_replicate1.fastq,sample2_replicate2.fastq" indicates two samples with 2 technical replicates for each sample. |
-k COUNT_TABLE, --count-table COUNT_TABLE | The read count table file. Only 1 file is accepted. |
optional arguments for normalization:
Parameter | Explanation |
---|---|
--norm-method {none,median,total,control} | Method for normalization, including "none" (no normalization), "median" (median normalization, default), "total" (normalization by total read counts), "control" (normalization by control sgRNAs specified by the --control-sgrna option). |
--control-sgrna CONTROL_SGRNA | A list of control sgRNAs for normalization and for generating the null distribution of RRA. See the format specification. |
optional arguments for input and output:
Parameter | Explanation |
---|---|
--sample-label SAMPLE_LABEL | Sample labels, separated by comma (,). Must be equal to the number of samples provided (in --fastq option). Default "sample1,sample2,...". |
-n OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX | The prefix of the output file(s). Default sample1. |
--unmapped-to-file | Save unmapped reads to file. |
--keep-tmp | Keep intermediate files. |
--test-run | Test running. If this option is on, MAGeCK will only process the first 1M records for each file. |
optional arguments for processing fastq files:
Parameter | Explanation |
---|---|
--trim-5 TRIM_5 | Length of trimming the 5' of the reads. Default 0 |
--sgrna-len SGRNA_LEN | Length of the sgRNA. Default 20. ATTENTION: after v 0.5.3, the program will automatically determine the sgRNA length from library file; so only use this if you turn on the --unmapped-to-file option. |
--count-n | Count sgRNAs with Ns. By default, sgRNAs containing Ns will be discarded. |
--reverse-complement | Reverse complement the sequences in library for read mapping. |
Optional arguments for quality controls:
Parameter | Explanation |
---|---|
--pdf-report | Generate pdf report of the fastq files. |
--day0-label DAY0_LABEL | Turn on the negative selection QC and specify the label for control sample (usually day 0 or plasmid). For every other sample label, the negative selection QC will compare it with day0 sample, and estimate the degree of negative selections in essential genes. |
--gmt-file GMT_FILE | The pathway file used for QC, in GMT format. By default it will use the GMT file provided by MAGeCK. |
MAGeCK can also invoke GSEA (default) or RRA to test if a pathway is enriched in one particular gene ranking.
usage:
usage: mageck pathway [-h] --gene-ranking GENE_RANKING --gmt-file GMT_FILE [-n OUTPUT_PREFIX] [--method {gsea,rra}] [--single-ranking] [--sort-criteria {neg,pos}] [--keep-tmp] [--ranking-column RANKING_COLUMN] [--ranking-column-2 RANKING_COLUMN_2] [--pathway-alpha PATHWAY_ALPHA] [--permutation PERMUTATION]
required arguments:
Parameter | Explanation |
---|---|
--gene-ranking GENE_RANKING | The gene ranking file generated by the gene test step. |
--gmt-file GMT_FILE | The pathway file in GMT format. See input/#pathway-file-gmt for more details of the GMT file format. |
optional arguments:
Parameter | Explanation |
---|---|
-h, --help | show this help message and exit |
--single-ranking | The provided file is a (single) gene ranking file, either positive or negative selection. Only one enrichment comparison will be performed. |
-n OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX | The prefix of the output file(s). Default sample1. |
--method {gsea,rra} | Method for testing pathway enrichment, including gsea (Gene Set Enrichment Analysis) or rra. Default gsea. |
--sort-criteria {neg,pos} | Sorting criteria, either by negative selection (neg) or positive selection (pos). Default negative selection. |
--keep-tmp | Keep intermediate files. |
--ranking-column RANKING_COLUMN | Column number or label in gene summary file for gene ranking; can be either an integer of column number, or a string of column label. Default "2" (the 3rd column). |
--ranking-column-2 RANKING_COLUMN_2 | Column number or label in gene summary file for gene ranking; can be either an integer of column number, or a string of column label. This option is used to determine the column for positive selections and is disabled if --single-ranking is specified. Default "8" (the 9th column). |
--pathway-alpha PATHWAY_ALPHA | The default alpha value for RRA pathway enrichment. Default 0.25. |
--permutation PERMUTATION | The perumtation for gsea. Default 1000. |
The mle subcommand performs maximum-likelihood analysis of gene essentialities, instead of the RRA analysis.
usage:
usage: mageck.beta mle [-h] -k COUNT_TABLE (-d DESIGN_MATRIX | --day0-label DAY0_LABEL) [-n OUTPUT_PREFIX] [-i INCLUDE_SAMPLES] [-b BETA_LABELS] [--control-sgrna CONTROL_SGRNA] [--cnv-norm CNV_NORM] [--cnv-est CNV_EST] [--debug] [--debug-gene DEBUG_GENE] [--norm-method {none,median,total,control}] [--genes-varmodeling GENES_VARMODELING] [--permutation-round PERMUTATION_ROUND] [--no-permutation-by-group] [--max-sgrnapergene-permutation MAX_SGRNAPERGENE_PERMUTATION] [--remove-outliers] [--threads THREADS] [--adjust-method {fdr,holm,pounds}] [--sgrna-efficiency SGRNA_EFFICIENCY] [--sgrna-eff-name-column SGRNA_EFF_NAME_COLUMN] [--sgrna-eff-score-column SGRNA_EFF_SCORE_COLUMN] [--update-efficiency] [--bayes] [-p] [-w PPI_WEIGHTING] [-e NEGATIVE_CONTROL]
required arguments:
Parameter | Explanation |
---|---|
-k COUNT_TABLE, --count-table COUNT_TABLE | Provide a tab-separated count table. Each line in the table should include sgRNA name (1st column), target gene (2nd column) and read counts in each sample. See input/#sgrna-read-count-file for a detailed description. |
-d DESIGN_MATRIX, --design-matrix DESIGN_MATRIX | Provide a design matrix, either a file name or a quoted string of the design matrix. For example, "1,1;1,0". The row of the design matrix must match the order of the samples in the count table (if --include-samples is not specified), or the order of the samples by the --include-samples option. |
--day0-label DAY0_LABEL | Specify the label for control sample (usually day 0 or plasmid). For every other sample label, the MLE module will treat it as a single condition and generate an corresponding design matrix. |
optional arguments for input and output:
Parameter | Explanation |
---|---|
-n OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX | The prefix of the output file(s). Default sample1. |
-i INCLUDE_SAMPLES, --include-samples INCLUDE_SAMPLES | Specify the sample labels if the design matrix is not given by file in the --design-matrix option. Sample labels are separated by ",", and must match the labels in the count table. |
-b BETA_LABELS, --beta-labels BETA_LABELS | Specify the labels of the variables (i.e., beta), if the design matrix is not given by file in the --design-matrix option. Should be separated by ",", and the number of labels must equal to (# columns of design matrix), including baseline labels. Default value: "bata_0,beta_1,beta_2,...". |
--control-sgrna CONTROL_SGRNA | A list of control sgRNAs. See the format specification. |
Optional arguments for CNV correction:
Parameter | Explanation |
---|---|
--cnv-norm CNV_NORM | A matrix of copy number variation data across cell lines to normalize CNV-biased sgRNA scores prior to gene ranking. |
optional arguments for MLE module:
Parameter | Explanation |
---|---|
--debug | Debug mode to output detailed information of the running. |
--debug-gene DEBUG_GENE | Debug mode to only run one gene with specified ID. |
--norm-method {none,median,total,control} | Method for normalization, including "none" (no normalization), "median" (median normalization, default), "total" (normalization by total read counts), "control" (normalization by control sgRNAs specified by the --control-sgrna option). |
--genes-varmodeling GENES_VARMODELING | The number of genes for mean-variance modeling. Default 1000. |
--permutation-round PERMUTATION_ROUND | The rounds for permutation (interger). The permutation time is (# genes) * x for x rounds of permutation. Suggested value: 10 (may take longer time). Default 2. |
--no-permutation-by-group | By default, gene permutation is performed separately, by their number of sgRNAs. Turning this option will perform permutation on all genes together. This makes the program faster, but the p value estimation is accurate only if the number of sgRNAs per gene is approximately the same. |
--max-sgrnapergene-permutation MAX_SGRNAPERGENE_PERMUTATION | Only permute genes by group if the number of sgRNAs per gene is smaller than this number. This will save a lot of time if some regions are targeted by a large number of sgRNAs (usually hundreds). Must be an integer. Default 100. |
--remove-outliers | Try to remove outliers. Turning this option on will slow the algorithm. |
--threads THREADS | Using multiple threads to run the algorithm. Default using only 1 thread. |
--adjust-method {fdr,holm,pounds} | Method for sgrna-level p-value adjustment, including false discovery rate (fdr), holm's method (holm), or pounds's method (pounds). |
optional arguments for the EM iteration:
Parameter | Explanation |
---|---|
--sgrna-efficiency SGRNA_EFFICIENCY | An optional file of sgRNA efficiency prediction. The efficiency prediction will be used as an initial guess of the probability an sgRNA is efficient. Must contain at least two columns, one containing sgRNA ID, the other containing sgRNA efficiency prediction. |
--sgrna-eff-name-column SGRNA_EFF_NAME_COLUMN | The sgRNA ID column in sgRNA efficiency prediction file (specified by the --sgrna-efficiency option). Default is 0 (the first column). |
--sgrna-eff-score-column SGRNA_EFF_SCORE_COLUMN | The sgRNA efficiency prediction column in sgRNA efficiency prediction file (specified by the --sgrna-efficiency option). Default is 1 (the second column). |
--update-efficiency | Iteratively update sgRNA efficiency during EM iteration. |
The plot command generating graphics for selected genes. For interactive visualizations, use our new MAGeCK-VISPR algorithm.
usage:
usage: mageck plot [-h] -k COUNT_TABLE -g GENE_SUMMARY [--genes GENES] [-s SAMPLES] [-n OUTPUT_PREFIX] [--norm-method {none,median,total}] [--keep-tmp]
required arguments:
Parameter | Explanation |
---|---|
-k COUNT_TABLE, --count-table COUNT_TABLE | Provide a tab-separated count table. |
-g GENE_SUMMARY, --gene-summary GENE_SUMMARY | The gene summary file generated by the test command. |
optional arguments:
Parameter | Explanation |
---|---|
-h, --help | show this help message and exit |
--genes GENES | A list of genes to be plotted, separated by comma. Default: none. |
-s SAMPLES, --samples SAMPLES | A list of samples to be plotted, separated by comma. Default: using all samples in the count table. |
-n OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX | The prefix of the output file(s). Default sample1. |
--norm-method {none,median,total} | Method for normalization, default median. |
--keep-tmp | Keep intermediate files. |
This subcommand allows you generate comparison results directly from fastq files, with limited parameter settings available. The parameters for the run sub-command are included in test and count sub-command. See both sub-commands for more details. It is strongly suggested that users run the count and test command separately, in order to gain a finer control of the results.
These programs are used by MAGeCK internally, but can also be executed by users for other purposes.
RRA - Robust Rank Aggreation v 0.5.6.
Usage:
Parameter | Explanation |
---|---|
-i input_data file | Input file name. Format: "item id" "group id" "list id" "value" ["probability"] ["chosen"] |
-o output_file | Output file name. Format: "group id" "number of items in the group" "lo-value" "false discovery rate" |
-p maximum_percentile | RRA only consider the items with percentile smaller than this parameter. Default=0.1 |
--control control_sgrna_list | A list of control sgRNA names. |
--permutation permutation_round | The number of rounds of permutation. Increase this value if the number of genes is small. Default 100. |
--no-permutation-by-group | By default, gene permutation is performed separately, by their number of sgRNAs. Turning this option will perform permutation on all genes together. This makes the program faster, but the p value estimation is accurate only if the number of sgRNAs per gene is approximately the same. |
--skip-gene gene_name | Genes to skip from doing permutation. Specify it multiple times if you need to skip more than 1 genes. |
--min-percentage-goodsgrna min_percentage | Filter genes that have too few percentage of 'good sgrnas', or sgrnas that fall below the -p threshold. Must be a number between 0-1. Default 0 (do not filter genes). |
--min-number-goodsgrna min_number | Filter genes that have too few number of 'good sgrnas', or sgrnas that fall below the -p threshold. Must be an integer. Default 0 (do not filter genes). |
mageckGSEA is a fast implementation of Gene Set Enrichment Analysis (GSEA) using C++. It's used by MAGeCK for quality controls and pathway enrichment tests. Compared with the official GSEA program, the main advantage is its easy use and extremely fast running speed.
In the gsea/demo folder, an example is provided to run GSEA. Use the following command to perform GSEA analysis based on the ranked gene list in demo1.txt (provided in the demo folder), tested on pathways defined in kegg.ribosome.gmt (provided in the demo folder). The scores on the 2nd column will be used to rank genes (-c 1), and permute 10000 times to get p value:
mageckGSEA -r demo1.txt -g kegg.ribosome.gmt -c 1 -p 10000
You can either provide genes with their scores, as is in demo1.txt (genes with smaller scores are ranked in the front).
SYNRG 0.715581582 SREK1 0.992306809 SLC25A46 0.057411873 COL4A5 0.36387645 CCDC22 -0.463887932 MVD 0.020897922
mageckGSEA will first rank genes based on the provided scores, as long as you indicate which column to use (-c 1).
Or you can just provide gene rankings, as is in demo2.txt.
C5orf64 TTC17 MRPS27 PIGY GPAA1 KIF4A EPS15
The output is a tab-separated file to report the following statistics of GSEA:
Pathway Size ES p p_permutation FDR Ranking Hits LFC KEGG_RIBOSOME 88 0.3262 0.00240772 0.0043 0.0043 0 32 0
Item | Explanation |
---|---|
Pathway | The name of the pathway |
Size | The size of the pathway, i.e., the number of genes |
ES | Enrichment Score (ES) in GSEA |
p | The p value of ES |
p_permutation | The permutation p value of ES (usually more accurate than p |
FDR | False Discovery Rate of p_permutation |
Ranking | The ranking of this pathway |
Hits | The number of genes that are ranked before ES score. See "Leading Edge" analysis of GSEA |
LFC | Log fold change (not implemented) |
USAGE:
mageckGSEA -r rank_file -g gmt_file [-e] [-s] [-c score_column] [-p perm_time] [-n pathway_name] [-o output_file] [--] [--version] [-h]
Parameter | Explanation |
---|---|
-e, --reverse_value | Reverse the order of the gene. |
-s, --sort_byp | Sort the pathways by p value. |
-c score_column, --score_column score_column | The column for gene scores. If you just want to use the ranking of the gene (located at the 1st column), use 0. Otherwise, specify which column should be used to rank the gene. The column number starts from 0. Default: 0. |
-p perm_time, --perm_time perm_time | Permutations, default 1000. |
-n pathway_name, --pathway_name pathway_name | Name of the pathway to be tested. If not found, will test all pathways. |
-o output_file, --output_file output_file | The name of the output file. Use - to print to standard output. |
-r rank_file, --rank_file rank_file | (required) Rank file. The first column of the rank file must be the gene name. |
-g gmt_file, --gmt_file gmt_file | (required) The pathway annotation in GMT format. |
--version | Displays version information and exits. |
-h, --help | Displays usage information and exits. |
Return to [Home]