Menu

usage

Wei Li

Usage

The main portal of MAGeCK is the mageck program, which includes a couple of different subprograms:

  • count: only collect sgRNA read counts from read mapping files (sam format).
  • test: given a table of read counts, perform the sgRNA and gene ranking.
  • pathway: given a ranked gene list, test whether one pathway is enriched.
  • mle: perform maximum-likelihood estimation of gene essentiality scores.
  • run: collect sgRNA read counts from read mapping files (sam format), and perform sgRNA and gene ranking (disabled since 0.5.4).

There is also another subprogram plot that plots some figures of the genes you are interested in from the test results.

  • plot: Generating graphics for selected genes.

test

This subcommand tests and ranks sgRNAs and genes based on the read count tables provided.

usage:

  usage: mageck test [-h] -k COUNT_TABLE
                    (-t TREATMENT_ID | --day0-label DAY0_LABEL)
                    [-c CONTROL_ID]
                    [--paired] [--norm-method {none,median,total,control}]
                    [--gene-test-fdr-threshold GENE_TEST_FDR_THRESHOLD]
                    [--adjust-method {fdr,holm,pounds}]
                    [--variance-estimation-samples VARIANCE_ESTIMATION_SAMPLES]
                    [--sort-criteria {neg,pos}]
                    [--remove-zero {none,control,treatment,both,any}]
                    [--remove-zero-threshold REMOVE_ZERO_THRESHOLD]
                    [--pdf-report]
                    [--gene-lfc-method {median,alphamedian,mean,alphamean,secondbest}]
                    [-n OUTPUT_PREFIX] [--control-sgrna CONTROL_SGRNA]
                    [--normcounts-to-file] [--skip-gene SKIP_GENE]
                    [--keep-tmp]
                    [--additional-rra-parameters ADDITIONAL_RRA_PARAMETERS]
                    [--cnv-norm CNV_NORM] [--cell-line CELL_LINE]

required arguments:

Parameter Explanation
-k COUNT_TABLE, --count-table COUNT_TABLE Provide a tab-separated count table instead of sam files. Each line in the table should include sgRNA name (1st column), targeting gene (2nd column) and read counts in each sample. See input/#sgrna-read-count-file for a detailed description.
-t TREATMENT_ID, --treatment-id TREATMENT_ID Sample label or sample index (0 as the first sample) in the count table as treatment experiments, separated by comma (,). If sample label is provided, the labels must match the labels in the first line of the count table; for example, "HL60.final,KBM7.final". For sample index, "0,2" means the 1st and 3rd samples are treatment experiments. See input/#sample-index for a detailed description.
--day0-label DAY0_LABEL Specify the label for control sample (usually day 0 or plasmid). For every other sample label, the module will treat it as a treatment condition and compare with control sample.

optional general arguments:

Parameter Explanation
-h, --help show this help message and exit
-c CONTROL_ID, --control-id CONTROL_ID Sample label or sample index in the count table as control experiments, separated by comma (,). Default is all the samples not specified in treatment experiments. See input/#sample-index for a detailed description.
--paired Paired sample comparisons. In this mode, the number of samples in -t and -c must match and have an exact order in terms of samples. For example, "-t HL60.final,KBM7.final -c HL60.initial,KBM7.initial".
--norm-method {none,median,total,control} Method for normalization, default median. If control is specified, the size factor will be estimated using control sgRNAs specified in --control-sgrna option.
--gene-test-fdr-threshold GENE_TEST_FDR_THRESHOLD FDR threshold for gene test, default 0.25.
--adjust-method {fdr,holm,pounds} Method for sgrna-level p-value adjustment, including false discovery rate (fdr), holm's method (holm), or pounds's method (pounds).
--variance-estimation-samples VARIANCE_ESTIMATION_SAMPLES Sample label or sample index for estimating variances, separated by comma (,). See -t/--treatment-id option for specifying samples.
--sort-criteria {neg,pos} Sorting criteria, either by negative selection (neg) or positive selection (pos). Default negative selection.
--remove-zero {none,control,treatment,both} Whether to remove zero-count sgRNAs in control and/or treatment experiments. Default: none (do not remove those zero-count sgRNAs).
--pdf-report Generate pdf report of the analysis.
--gene-lfc-method {median,alphamedian,mean,alphamean,secondbest} Method to calculate gene log fold changes (LFC) from sgRNA LFCs. Available methods include the median/mean of all sgRNAs (median/mean), or the median/mean sgRNAs that are ranked in front of the alpha cutoff in RRA (alphamedian/alphamean), or the sgRNA that has the second strongest LFC (secondbest). In the alphamedian/alphamean case, the number of sgRNAs correspond to the "goodsgrna" column in the output, and the gene LFC will be set to 0 if no sgRNA is in front of the alpha cutoff. Default median. (new since v0.5.5)

Optional arguments for input and output:

Parameter Explanation
-n OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX The prefix of the output file(s). Default sample1.
--control-sgrna CONTROL_SGRNA A list of control sgRNAs for normalization and for generating the null distribution of RRA. See the format specification.
--normcounts-to-file Write normalized read counts to file ({output-prefix}.normalized.txt).
--keep-tmp Keep intermediate files.
--skip-gene SKIP_GENE Skip genes in the report. By default, "NA" or "na" will be skipped.
--additional-rra-parameters ADDITIONAL_RRA_PARAMETERS Additional arguments to run RRA. They will be appended to the command line for calling RRA.

Optional arguments for CNV correction:

Parameter Explanation
--cnv-norm CNV_NORM A matrix of copy number variation data across cell lines to normalize CNV-biased sgRNA scores prior to gene ranking.
--cell-line CELL_LINE The name of the cell line to be used for copy number variation normalization.

count

This subcommand collects sgRNA read count information from fastq files. The output count tables can be used directly in the test subcommand.

usage:

 usage: mageck count [-h] -l LIST_SEQ 
                (--fastq FASTQ [FASTQ ...] | -k COUNT_TABLE)
                [--norm-method {none,median,total,control}]
                [--control-sgrna CONTROL_SGRNA]
                [--sample-label SAMPLE_LABEL] [-n OUTPUT_PREFIX]
                [--unmapped-to-file] [--keep-tmp] [--test-run]
                [--trim-5 TRIM_5] [--sgrna-len SGRNA_LEN] [--count-n]
                [--reverse-complement] [--pdf-report]
                [--day0-label DAY0_LABEL] [--gmt-file GMT_FILE]

required arguments:

Parameter Explanation
-l LIST_SEQ, --list-seq LIST_SEQ A file containing list of sgRNA names, the sequences and target genes, either in .txt or in .csv format. See input/#sgrna-library-file for more details. If this file is not provided, mageck will count all possible sgRNAs in the fastq.
--fastq FASTQ Sample fastq/fastq.gz files (or bam files after v0.5.5. See advanced tutorial), separated by space; use comma (,) to indicate technical replicates of the same sample. For example, "--fastq sample1_replicate1.fastq,sample1_replicate2.fastq sample2_replicate1.fastq,sample2_replicate2.fastq" indicates two samples with 2 technical replicates for each sample.
-k COUNT_TABLE, --count-table COUNT_TABLE The read count table file. Only 1 file is accepted.

optional arguments for normalization:

Parameter Explanation
--norm-method {none,median,total,control} Method for normalization, including "none" (no normalization), "median" (median normalization, default), "total" (normalization by total read counts), "control" (normalization by control sgRNAs specified by the --control-sgrna option).
--control-sgrna CONTROL_SGRNA A list of control sgRNAs for normalization and for generating the null distribution of RRA. See the format specification.

optional arguments for input and output:

Parameter Explanation
--sample-label SAMPLE_LABEL Sample labels, separated by comma (,). Must be equal to the number of samples provided (in --fastq option). Default "sample1,sample2,...".
-n OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX The prefix of the output file(s). Default sample1.
--unmapped-to-file Save unmapped reads to file.
--keep-tmp Keep intermediate files.
--test-run Test running. If this option is on, MAGeCK will only process the first 1M records for each file.

optional arguments for processing fastq files:

Parameter Explanation
--trim-5 TRIM_5 Length of trimming the 5' of the reads. Default 0
--sgrna-len SGRNA_LEN Length of the sgRNA. Default 20. ATTENTION: after v 0.5.3, the program will automatically determine the sgRNA length from library file; so only use this if you turn on the --unmapped-to-file option.
--count-n Count sgRNAs with Ns. By default, sgRNAs containing Ns will be discarded.
--reverse-complement Reverse complement the sequences in library for read mapping.

Optional arguments for quality controls:

Parameter Explanation
--pdf-report Generate pdf report of the fastq files.
--day0-label DAY0_LABEL Turn on the negative selection QC and specify the label for control sample (usually day 0 or plasmid). For every other sample label, the negative selection QC will compare it with day0 sample, and estimate the degree of negative selections in essential genes.
--gmt-file GMT_FILE The pathway file used for QC, in GMT format. By default it will use the GMT file provided by MAGeCK.

pathway

MAGeCK can also invoke GSEA (default) or RRA to test if a pathway is enriched in one particular gene ranking.

usage:

usage: mageck pathway [-h] --gene-ranking GENE_RANKING --gmt-file GMT_FILE
                  [-n OUTPUT_PREFIX] [--method {gsea,rra}]
                  [--single-ranking] [--sort-criteria {neg,pos}]
                  [--keep-tmp] [--ranking-column RANKING_COLUMN]
                  [--ranking-column-2 RANKING_COLUMN_2]
                  [--pathway-alpha PATHWAY_ALPHA]
                  [--permutation PERMUTATION]

required arguments:

Parameter Explanation
--gene-ranking GENE_RANKING The gene ranking file generated by the gene test step.
--gmt-file GMT_FILE The pathway file in GMT format. See input/#pathway-file-gmt for more details of the GMT file format.

optional arguments:

Parameter Explanation
-h, --help show this help message and exit
--single-ranking The provided file is a (single) gene ranking file, either positive or negative selection. Only one enrichment comparison will be performed.
-n OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX The prefix of the output file(s). Default sample1.
--method {gsea,rra} Method for testing pathway enrichment, including gsea (Gene Set Enrichment Analysis) or rra. Default gsea.
--sort-criteria {neg,pos} Sorting criteria, either by negative selection (neg) or positive selection (pos). Default negative selection.
--keep-tmp Keep intermediate files.
--ranking-column RANKING_COLUMN Column number or label in gene summary file for gene ranking; can be either an integer of column number, or a string of column label. Default "2" (the 3rd column).
--ranking-column-2 RANKING_COLUMN_2 Column number or label in gene summary file for gene ranking; can be either an integer of column number, or a string of column label. This option is used to determine the column for positive selections and is disabled if --single-ranking is specified. Default "8" (the 9th column).
--pathway-alpha PATHWAY_ALPHA The default alpha value for RRA pathway enrichment. Default 0.25.
--permutation PERMUTATION The perumtation for gsea. Default 1000.

mle

The mle subcommand performs maximum-likelihood analysis of gene essentialities, instead of the RRA analysis.

usage:

     usage: mageck.beta mle [-h] -k COUNT_TABLE
                   (-d DESIGN_MATRIX | --day0-label DAY0_LABEL)
                   [-n OUTPUT_PREFIX] [-i INCLUDE_SAMPLES]
                   [-b BETA_LABELS] [--control-sgrna CONTROL_SGRNA]
                   [--cnv-norm CNV_NORM] [--cnv-est CNV_EST] [--debug]
                   [--debug-gene DEBUG_GENE]
                   [--norm-method {none,median,total,control}]
                   [--genes-varmodeling GENES_VARMODELING]
                   [--permutation-round PERMUTATION_ROUND]
                   [--no-permutation-by-group]
                   [--max-sgrnapergene-permutation MAX_SGRNAPERGENE_PERMUTATION]
                   [--remove-outliers] [--threads THREADS]
                   [--adjust-method {fdr,holm,pounds}]
                   [--sgrna-efficiency SGRNA_EFFICIENCY]
                   [--sgrna-eff-name-column SGRNA_EFF_NAME_COLUMN]
                   [--sgrna-eff-score-column SGRNA_EFF_SCORE_COLUMN]
                   [--update-efficiency] [--bayes] [-p] [-w PPI_WEIGHTING]
                   [-e NEGATIVE_CONTROL]

required arguments:

Parameter Explanation
-k COUNT_TABLE, --count-table COUNT_TABLE Provide a tab-separated count table. Each line in the table should include sgRNA name (1st column), target gene (2nd column) and read counts in each sample. See input/#sgrna-read-count-file for a detailed description.
-d DESIGN_MATRIX, --design-matrix DESIGN_MATRIX Provide a design matrix, either a file name or a quoted string of the design matrix. For example, "1,1;1,0". The row of the design matrix must match the order of the samples in the count table (if --include-samples is not specified), or the order of the samples by the --include-samples option.
--day0-label DAY0_LABEL Specify the label for control sample (usually day 0 or plasmid). For every other sample label, the MLE module will treat it as a single condition and generate an corresponding design matrix.

optional arguments for input and output:

Parameter Explanation
-n OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX The prefix of the output file(s). Default sample1.
-i INCLUDE_SAMPLES, --include-samples INCLUDE_SAMPLES Specify the sample labels if the design matrix is not given by file in the --design-matrix option. Sample labels are separated by ",", and must match the labels in the count table.
-b BETA_LABELS, --beta-labels BETA_LABELS Specify the labels of the variables (i.e., beta), if the design matrix is not given by file in the --design-matrix option. Should be separated by ",", and the number of labels must equal to (# columns of design matrix), including baseline labels. Default value: "bata_0,beta_1,beta_2,...".
--control-sgrna CONTROL_SGRNA A list of control sgRNAs. See the format specification.

Optional arguments for CNV correction:

Parameter Explanation
--cnv-norm CNV_NORM A matrix of copy number variation data across cell lines to normalize CNV-biased sgRNA scores prior to gene ranking.

optional arguments for MLE module:

Parameter Explanation
--debug Debug mode to output detailed information of the running.
--debug-gene DEBUG_GENE Debug mode to only run one gene with specified ID.
--norm-method {none,median,total,control} Method for normalization, including "none" (no normalization), "median" (median normalization, default), "total" (normalization by total read counts), "control" (normalization by control sgRNAs specified by the --control-sgrna option).
--genes-varmodeling GENES_VARMODELING The number of genes for mean-variance modeling. Default 1000.
--permutation-round PERMUTATION_ROUND The rounds for permutation (interger). The permutation time is (# genes) * x for x rounds of permutation. Suggested value: 10 (may take longer time). Default 2.
--no-permutation-by-group By default, gene permutation is performed separately, by their number of sgRNAs. Turning this option will perform permutation on all genes together. This makes the program faster, but the p value estimation is accurate only if the number of sgRNAs per gene is approximately the same.
--max-sgrnapergene-permutation MAX_SGRNAPERGENE_PERMUTATION Only permute genes by group if the number of sgRNAs per gene is smaller than this number. This will save a lot of time if some regions are targeted by a large number of sgRNAs (usually hundreds). Must be an integer. Default 100.
--remove-outliers Try to remove outliers. Turning this option on will slow the algorithm.
--threads THREADS Using multiple threads to run the algorithm. Default using only 1 thread.
--adjust-method {fdr,holm,pounds} Method for sgrna-level p-value adjustment, including false discovery rate (fdr), holm's method (holm), or pounds's method (pounds).

optional arguments for the EM iteration:

Parameter Explanation
--sgrna-efficiency SGRNA_EFFICIENCY An optional file of sgRNA efficiency prediction. The efficiency prediction will be used as an initial guess of the probability an sgRNA is efficient. Must contain at least two columns, one containing sgRNA ID, the other containing sgRNA efficiency prediction.
--sgrna-eff-name-column SGRNA_EFF_NAME_COLUMN The sgRNA ID column in sgRNA efficiency prediction file (specified by the --sgrna-efficiency option). Default is 0 (the first column).
--sgrna-eff-score-column SGRNA_EFF_SCORE_COLUMN The sgRNA efficiency prediction column in sgRNA efficiency prediction file (specified by the --sgrna-efficiency option). Default is 1 (the second column).
--update-efficiency Iteratively update sgRNA efficiency during EM iteration.

plot

The plot command generating graphics for selected genes. For interactive visualizations, use our new MAGeCK-VISPR algorithm.

usage:

usage: mageck plot [-h] -k COUNT_TABLE -g GENE_SUMMARY [--genes GENES]
                   [-s SAMPLES] [-n OUTPUT_PREFIX]
                   [--norm-method {none,median,total}] [--keep-tmp]

required arguments:

Parameter Explanation
-k COUNT_TABLE, --count-table COUNT_TABLE Provide a tab-separated count table.
-g GENE_SUMMARY, --gene-summary GENE_SUMMARY The gene summary file generated by the test command.

optional arguments:

Parameter Explanation
-h, --help show this help message and exit
--genes GENES A list of genes to be plotted, separated by comma. Default: none.
-s SAMPLES, --samples SAMPLES A list of samples to be plotted, separated by comma. Default: using all samples in the count table.
-n OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX The prefix of the output file(s). Default sample1.
--norm-method {none,median,total} Method for normalization, default median.
--keep-tmp Keep intermediate files.

run (disabled since 0.5.4)

This subcommand allows you generate comparison results directly from fastq files, with limited parameter settings available. The parameters for the run sub-command are included in test and count sub-command. See both sub-commands for more details. It is strongly suggested that users run the count and test command separately, in order to gain a finer control of the results.

Internal programs

These programs are used by MAGeCK internally, but can also be executed by users for other purposes.

RRA

RRA - Robust Rank Aggreation v 0.5.6.

Usage:

Parameter Explanation
-i input_data file Input file name. Format: "item id" "group id" "list id" "value" ["probability"] ["chosen"]
-o output_file Output file name. Format: "group id" "number of items in the group" "lo-value" "false discovery rate"
-p maximum_percentile RRA only consider the items with percentile smaller than this parameter. Default=0.1
--control control_sgrna_list A list of control sgRNA names.
--permutation permutation_round The number of rounds of permutation. Increase this value if the number of genes is small. Default 100.
--no-permutation-by-group By default, gene permutation is performed separately, by their number of sgRNAs. Turning this option will perform permutation on all genes together. This makes the program faster, but the p value estimation is accurate only if the number of sgRNAs per gene is approximately the same.
--skip-gene gene_name Genes to skip from doing permutation. Specify it multiple times if you need to skip more than 1 genes.
--min-percentage-goodsgrna min_percentage Filter genes that have too few percentage of 'good sgrnas', or sgrnas that fall below the -p threshold. Must be a number between 0-1. Default 0 (do not filter genes).
--min-number-goodsgrna min_number Filter genes that have too few number of 'good sgrnas', or sgrnas that fall below the -p threshold. Must be an integer. Default 0 (do not filter genes).

mageckGSEA

mageckGSEA is a fast implementation of Gene Set Enrichment Analysis (GSEA) using C++. It's used by MAGeCK for quality controls and pathway enrichment tests. Compared with the official GSEA program, the main advantage is its easy use and extremely fast running speed.

In the gsea/demo folder, an example is provided to run GSEA. Use the following command to perform GSEA analysis based on the ranked gene list in demo1.txt (provided in the demo folder), tested on pathways defined in kegg.ribosome.gmt (provided in the demo folder). The scores on the 2nd column will be used to rank genes (-c 1), and permute 10000 times to get p value:

 mageckGSEA -r demo1.txt -g kegg.ribosome.gmt  -c 1 -p 10000

You can either provide genes with their scores, as is in demo1.txt (genes with smaller scores are ranked in the front).

SYNRG   0.715581582
SREK1   0.992306809
SLC25A46        0.057411873
COL4A5  0.36387645
CCDC22  -0.463887932
MVD     0.020897922

mageckGSEA will first rank genes based on the provided scores, as long as you indicate which column to use (-c 1).

Or you can just provide gene rankings, as is in demo2.txt.

C5orf64
TTC17
MRPS27
PIGY
GPAA1
KIF4A
EPS15

The output is a tab-separated file to report the following statistics of GSEA:

Pathway Size    ES  p   p_permutation   FDR Ranking Hits    LFC
KEGG_RIBOSOME   88  0.3262  0.00240772  0.0043  0.0043  0   32  0
Item Explanation
Pathway The name of the pathway
Size The size of the pathway, i.e., the number of genes
ES Enrichment Score (ES) in GSEA
p The p value of ES
p_permutation The permutation p value of ES (usually more accurate than p
FDR False Discovery Rate of p_permutation
Ranking The ranking of this pathway
Hits The number of genes that are ranked before ES score. See "Leading Edge" analysis of GSEA
LFC Log fold change (not implemented)

USAGE:

 mageckGSEA  -r rank_file -g gmt_file 
                           [-e] [-s]  [-c score_column] 
                           [-p perm_time]   [-n pathway_name] 
                           [-o output_file]  [--] [--version] [-h]
Parameter Explanation
-e, --reverse_value Reverse the order of the gene.
-s, --sort_byp Sort the pathways by p value.
-c score_column, --score_column score_column The column for gene scores. If you just want to use the ranking of the gene (located at the 1st column), use 0. Otherwise, specify which column should be used to rank the gene. The column number starts from 0. Default: 0.
-p perm_time, --perm_time perm_time Permutations, default 1000.
-n pathway_name, --pathway_name pathway_name Name of the pathway to be tested. If not found, will test all pathways.
-o output_file, --output_file output_file The name of the output file. Use - to print to standard output.
-r rank_file, --rank_file rank_file (required) Rank file. The first column of the rank file must be the gene name.
-g gmt_file, --gmt_file gmt_file (required) The pathway annotation in GMT format.
--version Displays version information and exits.
-h, --help Displays usage information and exits.

Return to [Home]



Related

Wiki: Home
Wiki: advanced_tutorial
Wiki: demo