| Name | Modified | Size | Downloads / Week |
|---|---|---|---|
| data_sample.zip | 2016-08-03 | 111.4 MB | |
| create_d_set_from_list_of_files_instance.R | 2016-08-03 | 757 Bytes | |
| load_saved_d_set_and_label_self_locus.R | 2016-08-03 | 3.4 kB | |
| load_saved_d_set_and_label_self_locus_instance.R | 2016-08-03 | 374 Bytes | |
| create_d_set_from_list_of_files.R | 2016-08-03 | 6.0 kB | |
| list_of_filered_TF_tags_regionbinsize=500.RDS | 2016-08-03 | 913.8 kB | |
| list_of_file_files_TF_tag_filtered.txt | 2016-08-03 | 1.7 kB | |
| Compute_accessibility.R | 2016-08-03 | 2.5 kB | |
| Accessibility_per_region_bin.txt | 2016-08-03 | 223.7 kB | |
| roc_plot.R | 2016-08-03 | 3.3 kB | |
| gffRead.R | 2016-08-03 | 1.2 kB | |
| read_gff_annot.R | 2016-08-03 | 2.6 kB | |
| closestinterval2_fast.R | 2016-08-03 | 3.6 kB | |
| closestinterval2_sorted.R | 2016-08-03 | 4.0 kB | |
| fimo_read.R | 2016-08-03 | 1.7 kB | |
| closestinterval2.R | 2016-08-03 | 3.5 kB | |
| README.txt | 2016-08-03 | 4.8 kB | |
| Totals: 17 Items | 112.6 MB | 0 |
#This document describes the Inference of Accessibility (IA) algorithm that is used compute DNA accessibility in bacteria from ChIP-seq data.
This pipeline is called BRACIL+IA, as we infer binding motifs using the Method BRACIL (Gomes et al., Gen. Res. 2014) and apply the IA algorithm.
Citation: Gomes ALC, Wang HH (2016) The Role of Genome Accessibility in Transcription Factor Binding in Bacteria. PLoS Comput Biol 12(4): e1004891. doi:10.1371/journal.pcbi.1004891
This document is divided in IV parts:
I. Requirements
II. Data preparation
III. Computing DNA accessibility
IV. Peak calling de novo and downstream analysis
I. Requirements
To run BRACIL+IA you need the following:
(i) A set of bacterial ChIP-seq data obtained from the same growth condition:
Original analysis was performed using dataset from: http://networks.systemsbiology.net/mtb/
(ii) MEME/FIMO package (It performs motif finding and scan):
-http://meme.nbcr.net/meme/
(iii) A peak caller of your choice :
(e.g. MACS)
(iii) BRACIL (It deconvolves ChIP-seq data and identify binding motif):
- Gomes et al., Gen. Res. 2014; doi: 10.1101/gr.161711.113;
- https://sourceforge.net/projects/bracil/
(iv) R package :
- BRASILIA was written in R.
Items (ii) and (iii) can be re-adapted according to user preferences and code should be adapted accordingly.
%% II. Data preparation
(i) Transform ChIP-seq coverage into `.cov` format. The `.cov` is a 4 column format with columns: genome_position, count_total, count_of_negative_strand_reads, cov_of_positive_strand_reads.
(ii) Run BRACIL: The output of BRACIL contains motif file (`.meme`)
- Run some peak caller of your choice prior to BRACIL to identify enriched regions (e.g. MACS).
(iii) Select ChIPseq data with valid binding motif (e.g. E-value < 10^-4).
(iv) Run FIMO to scan binding motif of each experiment in the organism genome.
(v) Create `d_set`, the data frame with organized data:
- Use script: Create_d_set_from_list_of_files.R (check create_d_set_from_list_of_files_instance.R)
- Input is a 4 column file, each row contains
- `TF_tag` (A unique label indicating ChIP-seq experiment)
- `regions_set_file` (a file containing enriched regions. This is not essential to identify genome accessibility, but it is important for de novo ChIPseq peak analysis).
- `cov_file` (the ChIPseq data in `.cov` format)
- `fimo_file` (the fimo file scanned in the organism genome for motif corresponding to TF_tag);
- d_set` is going to be saved with the default name of `list_of_filered_TF_tags_regionbinsize=500.RDS`.
- d_set contains the following fields:
- "TF_tag" (A unique label indicating ChIP-seq experiment)
- "mean_cov_Z" (The total count of coverage per ChIP-seq experiment. Equivalent to Z in the definition of partition function)
- "region_bin" (indicates center of region)
- "l10_pvalue_max" (indicates maximum log10(p-value) of motif match)
- "n_motifs" (number of motifs found in region)
- "l10_pvalue_all" (product of log10(p-value) of all motifs inside region))
- "variable" (it has a unique value of `fw+rv`. It indicates coverage was computed from forward + reverse coverage).
- "mean_cov" (mean coverage of corresponding `TF_tag` in `region_bin`)
- "called_as_enriched" (boolean. Its value is 1 when the region belongs to a called peak).
- "regions_bin_size" (500 is the default value for bin size)
- "mean_cov_n" (normalized coverage. It is equal to mean_cov/mean_cov_Z).
- "TF_locus" (gene name for TF)
- "self_locus" (it is `true` if region_bin overlaps with locus that contains genes)
- "region_bin_factor" (it represents region_bin as a `factor`).
%%III Computing DNA accessibility
Run `Compute_accessibility.R`.
- This is an instance code to run the pipeline. It outputs accessibility_per_region_bin.txt.
- Adapt code and paths accordingly.
%%IV PEAK CALLING DE NOVO AND DOWNSTREAM ANALYSIS
(i) The model enables a quantitative metric of genome accessibility that can be used to test hypothesis and investigate the role of genome accessibility in functional features.
(ii) The genome accessibility predictions can be used for de novo ChIP-seq predictions. Re-scale accessibility parameters from 0 to 1. Also, re-scale motif parameters from 0 to 1. Use the sum accessibility_{0:1} + motif_score_{0:1} as metric for de novo peak calling.
(iii) The model can be used to predict transcription factor(TF)-binding for TFs whose binding PWM is available without the need to perform ChIP-seq experiments.
(iv) I encourage others to improve the method of de novo ChIP-seq prediction. The current method shows the proof of principle and the possibilities for improvement are not saturated yet.