The sgRNA read count file will be used in -k parameter in the test or run sub-command.
The read count file should list the names of the sgRNA, the gene it is targeting, followed by the read counts in each sample. Each item should be separated by the tab ('\t'). A header line is optional. For example in the studies of T. Wang et al. Science 2014, there are 4 CRISPR screening samples, and they are labeled as: HL60.initial, KBM7.initial, HL60.final, KBM7.final. Here are a few lines of the read count file:
sgRNA gene HL60.initial KBM7.initial HL60.final KBM7.final
A1CF_m52595977 A1CF 213 274 883 175
A1CF_m52596017 A1CF 294 412 1554 1891
A1CF_m52596056 A1CF 421 368 566 759
A1CF_m52603842 A1CF 274 243 314 855
A1CF_m52603847 A1CF 0 50 145 266
The count sub-command will output the read count file like this.
In the -t/--treatment-id, -c/--control-id parameters, you can use either sample label or sample index to specify samples. If sample label is used, the labels [must] match the sample labels in the first line of the count table. For example, "HL60.final,KBM7.final".
You can also use sample index to specify samples. The index of the sample is the order it appears in the sgRNA read count file, starting from 0. The index is used in the -t/--treatment-id, -c/--control-id parameters. In the example above, there are four samples, and the index of each sample is as follows:
sample | index |
---|---|
HL60.initial | 0 |
KBM7.initial | 1 |
HL60.final | 2 |
KBM7.final | 3 |
The design matrix is a txt file indicating the effects of different conditions on different samples. In this file, each row is a sample, each column is a condition, and the value is 1 or 0, indicating whether the sample (in the row) is affected by the condition (in the column).
Here is a simple example of the design matrix from the studies in T. Wang et al. Science 2014. The CRISPR screens are done on two cell lines, HL60 and KBM7, and four samples are generated, two corresponding to the initial states of two cell lines, and two corresponding to the final states of two cell lines. If you want to model the effects of two cell lines, you can have the design matrix as follows:
Samples baseline HL60 KBM7
HL60.initial 1 0 0
KBM7.initial 1 0 0
HL60.final 1 1 0
KBM7.final 1 0 1
Here are some important rules of the design matrix:
Note: different orders of the samples in the design matrix may change the results, because there are preprocessing steps to remove outliers. A good practice will be to always place initial samples (like day0 or plasmid) as the first rows in the design matrix.
When starting from fastq files, MAGeCK needs to know the sgRNA sequence and its targeting gene. Such information is provided in the sgRNA library file, and can be specified by the -l/--list-seq option in run or count subcommand.
The sgRNA library file can be provided either in .txt format or in .csv format. There are three columns in the library file: the sgRNA ID, the sequence, and the gene it is targeting. One example of the library file is provided as library.txt in demo2:
s_10007 TGTTCACAGTATAGTTTGCC CCNA1
s_10008 TTCTCCCTAATTGCTTGCTG CCNA1
s_10027 ACATGTTGCTTCCCCTTGCA CCNC
If provided in .csv format, the file will look like:
s_10007,TGTTCACAGTATAGTTTGCC,CCNA1
s_10008,TTCTCCCTAATTGCTTGCTG,CCNA1
s_10027,ACATGTTGCTTCCCCTTGCA,CCNC
When using --control-sgrna option, users need to provide a plain text file just containing negative control sgRNA IDS (one per each line). For example,
NonTargetingControlGuideForHuman_0001
NonTargetingControlGuideForHuman_0002
NonTargetingControlGuideForHuman_0003
NonTargetingControlGuideForHuman_0004
Some systems may read only 1 control sgRNA ID. Please look at this Q&A for solutions.
The GMT file format stores the pathway information and is consistent with the GMT file in Gene Set Enrichment Analysis (GSEA). The details of the GMT format can be found at GSEA website.
You can also download different pathway files directly from GSEA MSigDB database. They can be used directly by MAGeCK.
The sgRNA/gene mapping file will be used in the --gene-test parameter in the test or run sub-command.
This file should list the names of the sgRNAs and their corresponding genes, separated by the tab ('\t'). For example:
A1CF_m52595977 A1CF
A1CF_m52596017 A1CF
A1CF_m52596056 A1CF
A1CF_m52603842 A1CF
A1CF_m52603847 A1CF
A1CF_p52595870 A1CF
A1CF_p52595881 A1CF
A1CF_p52596023 A1CF
Return to [Home]