Menu

input

input (1)
Wei Li

Input file specification

sgRNA read count file

The sgRNA read count file will be used in -k parameter in the test or run sub-command.

The read count file should list the names of the sgRNA, the gene it is targeting, followed by the read counts in each sample. Each item should be separated by the tab ('\t'). A header line is optional. For example in the studies of T. Wang et al. Science 2014, there are 4 CRISPR screening samples, and they are labeled as: HL60.initial, KBM7.initial, HL60.final, KBM7.final. Here are a few lines of the read count file:

sgRNA           gene    HL60.initial    KBM7.initial    HL60.final      KBM7.final
A1CF_m52595977  A1CF    213             274            883                175
A1CF_m52596017  A1CF    294             412            1554              1891
A1CF_m52596056  A1CF    421             368            566                759
A1CF_m52603842  A1CF    274             243            314                855
A1CF_m52603847  A1CF    0               50             145                266

The count sub-command will output the read count file like this.

Sample index

In the -t/--treatment-id, -c/--control-id parameters, you can use either sample label or sample index to specify samples. If sample label is used, the labels [must] match the sample labels in the first line of the count table. For example, "HL60.final,KBM7.final".

You can also use sample index to specify samples. The index of the sample is the order it appears in the sgRNA read count file, starting from 0. The index is used in the -t/--treatment-id, -c/--control-id parameters. In the example above, there are four samples, and the index of each sample is as follows:

sample index
HL60.initial 0
KBM7.initial 1
HL60.final 2
KBM7.final 3

design matrix file

The design matrix is a txt file indicating the effects of different conditions on different samples. In this file, each row is a sample, each column is a condition, and the value is 1 or 0, indicating whether the sample (in the row) is affected by the condition (in the column).

Here is a simple example of the design matrix from the studies in T. Wang et al. Science 2014. The CRISPR screens are done on two cell lines, HL60 and KBM7, and four samples are generated, two corresponding to the initial states of two cell lines, and two corresponding to the final states of two cell lines. If you want to model the effects of two cell lines, you can have the design matrix as follows:

Samples        baseline        HL60        KBM7
HL60.initial   1               0           0
KBM7.initial   1               0           0
HL60.final     1               1           0
KBM7.final     1               0           1

Here are some important rules of the design matrix:

  • The design matrix file must include a header line of condition labels;
  • The first column is the sample labels that must match labels in read count file (see the above example in sgRNA read count file);
  • The second column must be a "baseline" column that sets all values to "1";
  • The element in the design matrix is either "0" or "1".
  • You must have at least one sample of "initial state" (e.g., day 0 or plasmid) that has only one "1" in the corresponding row. That only "1" must be in the baseline column.

Note: different orders of the samples in the design matrix may change the results, because there are preprocessing steps to remove outliers. A good practice will be to always place initial samples (like day0 or plasmid) as the first rows in the design matrix.

sgRNA library file

When starting from fastq files, MAGeCK needs to know the sgRNA sequence and its targeting gene. Such information is provided in the sgRNA library file, and can be specified by the -l/--list-seq option in run or count subcommand.

The sgRNA library file can be provided either in .txt format or in .csv format. There are three columns in the library file: the sgRNA ID, the sequence, and the gene it is targeting. One example of the library file is provided as library.txt in demo2:

s_10007 TGTTCACAGTATAGTTTGCC    CCNA1
s_10008 TTCTCCCTAATTGCTTGCTG    CCNA1
s_10027 ACATGTTGCTTCCCCTTGCA    CCNC

If provided in .csv format, the file will look like:

s_10007,TGTTCACAGTATAGTTTGCC,CCNA1
s_10008,TTCTCCCTAATTGCTTGCTG,CCNA1
s_10027,ACATGTTGCTTCCCCTTGCA,CCNC

negative control sgRNA list

When using --control-sgrna option, users need to provide a plain text file just containing negative control sgRNA IDS (one per each line). For example,

NonTargetingControlGuideForHuman_0001
NonTargetingControlGuideForHuman_0002
NonTargetingControlGuideForHuman_0003
NonTargetingControlGuideForHuman_0004

Some systems may read only 1 control sgRNA ID. Please look at this Q&A for solutions.

pathway file (gmt)

The GMT file format stores the pathway information and is consistent with the GMT file in Gene Set Enrichment Analysis (GSEA). The details of the GMT format can be found at GSEA website.

You can also download different pathway files directly from GSEA MSigDB database. They can be used directly by MAGeCK.

sgRNA/gene mapping file (depreciated after version 0.3)

The sgRNA/gene mapping file will be used in the --gene-test parameter in the test or run sub-command.

This file should list the names of the sgRNAs and their corresponding genes, separated by the tab ('\t'). For example:

A1CF_m52595977  A1CF
A1CF_m52596017  A1CF
A1CF_m52596056  A1CF
A1CF_m52603842  A1CF
A1CF_m52603847  A1CF
A1CF_p52595870  A1CF
A1CF_p52595881  A1CF
A1CF_p52596023  A1CF

Return to [Home]



Related

Wiki: Home

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.