Home
Name Modified Size InfoDownloads / Week
BRACIL_compiled_Jul2014.tar.gz 2014-08-02 91.8 MB
README 2014-08-02 12.9 kB
BRACIL_data_Jul2014.tar.gz 2014-08-02 42.0 MB
Totals: 3 Items   133.8 MB 0
This README file will provide a brief tutorial on how to use BRACIL.
The details about this manuscript is describe in the Gomes et al. manuscript.

Last update: Aug/01/2014

I. Requirements
II. Quick usage instance
III. Background overview
IV. Detailed description of input data
V Description of the main tasks provided by the code.
	(i). Resolution estimation
	(ii). Cooperative interaction prediction
	(iii). Estimation of binding event P-value.
VI. Variations in binding event prediction tools.



%%I. Requirements:

Before using BRACIL, make sure you have MEME and MCR packages installed.
The MEME package is necessary for motif discovery.
The MCR package stands for MATLAB Compiler Runtime and contains the library necessary to run the compiled version of the code.

The link for both packages are:

MEME:
  -http://meme.nbcr.net/meme/
MCR:
  -http://www.mathworks.com/products/compiler/mcr/




%%II. Quick usage

We provide a compiled version of BRACIL that permits you to run it from a command line.
It was compiled and tested on a Linux environment.

Download the data and the compiled file from sourceforge. There are three bash files for quickstart. Each one for binding site identification, cooperative interaction prediction, p-value estimation.
Each file lists the path of files necessary to use BRACIL. Make sure to edit them according to where your data is located. The files: `27_08_covset.txt`, `h37rv_annot.gt`, `h37rv.gt` should also be edited.

After proper edit, the quick usage is:

sh test_bracil_pipeline_rescue.sh
(for binding site identification)

sh test_bracil_cooperative_interaction.sh
(for cooperative interaction prediction)

sh test_bracil_event_pvalue.sh
(for p-value estimation)



%%III. Background overview

BRACIL considers the problem of refining the resolution of enriched regions by providing binding sites with high-resolution.
It does so by integrating ChIP-seq coverage with genome sequence and using
a blind-deconvolution algorithm that identifies the binding sites. The enriched regions to be refined can be obtained by any peak-callers, according to user preference.

This integrated model provides a more robust method to identify binding sites at high-resolution, with better sensitivity and specificity when compared to methods based only on peak callers or on motif sequence.
It uses both sequence conservation as well as ChIP-seq coverage to predict binding site locations.
BRACIL also predicts cooperative interaction.

The details about the method are in the manuscript Gomes et al., Decoding ChIP-Seq peaks with a double-binding signal refines binding peaks to single-nucleotide and predicts cooperative interaction. (in press at the time this README file was written).



%%IV. Detailed description of input data


The folder (…)/data/resolution and (...)/data/cooperative_interaction contain the test dataset for BRACIL.
The files are:

(…)/data/resolution:

  27_08_covset.txt : `covet` file. Two columns file. The first column indicates the coverage file and the second column the corresponding chromosome. Make sure to update path in this file. 
  27_8_sample.cov : Text version of 27_8 coverage
  27_8_sample.cov.mat : Binary version of 27_8 coverage. It is faster to read and requires less HDD memory. 
  DosR_motif.meme : Instance of a DosR motif predicted by MEME.
  h37rv_4thMarkovOrder.bfile : 4th Markov Order Background file of h37rv genome. It is required by MEME and FIMO.
  h37rv_annot.gt : Annotation genome table. Two columns file. First column is a path for annotation file and second column is the corresponding chromosome. It is not essential.
  h37rv.fa : Fasta file with h37rv genome sequence.
  h37rv.gff : annotation file for h37rv.
  h37rv.gt : Genome table for h37rv. Two columns file, first column indicates path for fast file and second column indicates the corresponding chromosome.
  reference_sites_chauhan.txt : Reference set of DosR binding sites described by Chauhan et al.
  regions_chauhan.txt : Regions that contain the reference sites `reference_sites_chauhan`.
  regions_set.txt : Set of enriched regions for the DosR dataset


(…)/data/cooperative_interaction:

binding_sites_chauhan.txt : Two columns file listing chromosome name and binding site locations. Binding sites are taken for Chauhan et al. paper.
binding_sites_predicted.txt : Two columns file listing chromosome name and binding site locations predicted by BRACIL in Chauhan regions.
bracil_27.08_params.txt : instance of a BRACIL binding event prediction output file. This file is important to input the parameters of the impulse response.
BRACIL_finalpeaks_mline0.out : instance of a BRACIL binding event prediction output file. This file is important to input the parameters of the impulse response.
regions_positive_control_p00eq0.09.txt : regions file that contain p00 probability. `p00` indicates the probability of a non-binding conformation and is a necessary input to test for cooperative interaction. This file has 4 important columns: [chromosome region_start region_end p00].


%%V. Description of the main tasks provided by the code.
	(i). Resolution estimation


The main function to run binding site prediction of BRACIL is: `BRACIL_pipeline_rescue`.
It will output the files `BRACIL_cov_only.out` and `BRACIL_finalpeaks_mline%d.out` for binding site predictions. 
`BRACIL_cov_only.out` predicts binding site locations based only in the ChIP-seq coverage.
`BRACIL_finalpeaks_mline%d.out` predicts binding site locations refined by motif discovery. 

The function usage with its inputs is shown in the file: `test_bracil_pipeline_rescue.sh`.

The description of `BRACIL_pipeline_rescue` input is describe in the following:

1. Coverage_file_set:
    The input for the coverage data. It accepts three formats: `tagAlign_count_set`, `cov_list`, and `tagAlign_count_set_list`. 
All formats can be used to perform analysis in multiple chromosome organisms
The `tagAlign_count_set` format is a 4 columns file, represented by:
                genome_position, strand, number_of_tags_count, chromosome.
Strand is `+` or `-` to indicate positive or negative strand.
The `cov_list` format is a two columns file with the following information:
                cov_file, chromosome_label.
The `cov_file` is indicates the coverage file for the corresponding chromosome indicated by `chromosome_label`. It contains 4 columns:
                genome_position, cov_total_count, count_negative_strand_tags, cov_positive_strand_tag
The cov_file can be a text file or a `.mat`, matlab binary file.
The `tagAlign_count_set_list` is a two columns file with a `tagAlign_count_set` file for each chromosome, as following:
                tagAlign_count_set_file, chromosome_label

2. Coverage_file_type: `tagAlign_count`, `cov` or `tagAlign_count_set_list`, according to the input used in coverage_file_set.

3. regions_file_set:
    A 10 columns format file (only the first 4 have some meaning).
%format:
    contiguous, start, stop, region_id nan nan nan nan nan nan

4. output_path
    - Path where output files will be saved
5. output_tag
    - An additional name that will be added to output_path. Warning: do not use the underline symbel "_" in this part. It might get you some conflict in name parsing.

6. genome_table_file
    - A two columns file indicating where genome files are located:
    Col1=genome_file Col2=contigous.
    Genome_file is a fasta file, but can be used a saved fasta file `.mat` (matlab binary) for speed.

7. mset 
    A 3 columns vector to indicate how to execute motif finding. 
We use MEME as motif finding algorithm.
    mset = [mline top_perc extra_edges], where:
    mline: a number indicating the MEME query to be used (check `meme_type` for more options).
        10 : -dna -revcomp -minw  8 -maxw 30  -mod oops -bfile
        12 : -dna -minw  8 -maxw 30  -mod oops -bfile 
    top_perc : fraction of top enriched regions used to create pre-motif fasta file.
    extra_edges : each subsequence in `pre-motif` fasta file correspond to
the genome sequence that spans around plus/minus extra_edges bps of a predicted binding site.

NOTICE: the number in `mline` will be observed in output file
`BRACIL_finalpeaks_mline%d.out`.

8. ftype
    Defines the impulse function. As default, use "gumbel".

9. bg_file
    A background file for MEME query. It contains frequency of DNA letters 
in the queried genome. Check MEME documentation for details.

10. annotation_table_file
#DEPRECATED PART OF THE CODE. THIS PART IS NOT NECESSARY
    - A two columns file indicating genome annotation to each chromosome. 
Similar to genome_table_file.
    Col1=annotation_file Col2=contigous.
    annotation_file is a `gff` file, where the collumns 4, 5, 7, 9 are
important to characterize the feature, corresponding to, respectively: 
feature_start_position feature_end_position feature_strand feature_gene_id_name. 

11. weak_site_log10_threshold : minimum threshold to consider a motif as a potential binding site. Value is defined as `-log10 unit`. (e.g. weak_site_log10_threshold = 3 considers only motifs with p_value <= 10^-3 ). I usued the value of 2.5 for DosR data (highly informative motif) and 3 for Eukaryote data (lowly informative motif) .

12. strong_site_log10_threshold : It defines a boundary for classification of weak and strong site. Important to define penalty classification. I usued the value of 4 for DosR data (highly informative motif) and 5 for Eukaryote data (lowly informative motif) .

13. alpha_rescue : penalty for considering a weak site. It assumes a value from [0 1], with 0 indicating no penalty and 1 indicating a penalty proportional to sum of squares of coverage.
The penalty is proportional to number of weak sites used for deconvolution. I used a low value for an inclusive run( e.g. 0.02 or 0.01). and a higher valeu for a move conservative run ( e.g. 0.1).

14. d_lim : Maximum distance to consider `double binding` signal. Defining d_lim=0 indicates only `single binding` signal. Use d_lim=50 to consider the double binding signal.




	(ii). Cooperative interaction prediction

The function that predicts cooperative interaction is `BRACIL_cooperative_interaction_test`. Its inputs are:

regions_set_with_p00_file : regions file containing 4 columns [chr start stop p00]. `chr` indicates chromosome label, `start` and `stop` indicates regions interval and `p00` indicates the probability of non-binding configuration.
binding_sites_file :  Two columns files [chromosome binding_site_center]
coverage_file_set : same as above ( subitem `i`)
coverage_file_type : same as above ( subitem `i`)
ip_rate_type : `high` or `low`. Indicates the assumption of ip_rate.
bracil_file : instance of bracil_file. It is important to load the parameters that will be used.
output_file : path in which prediction will be output.


	(iii). Estimation of binding event p-value.

It estimates the p-value of binding events at an input region. The function that does it is `BRACIL_event_pvalue`. 

covc_file : coverage file (read subitem `i`).
region_start : region start
region_stop : region stop
region_chromosome : region chromosome label
n_repetitions : number of sample to estimate p-value. It limits resolution to 1/n_repetitions.
output_path : path where output is saved
genome_file : fasta_file corresponding to `chromosome label`.
bracil_refined_input_file : BRACIL refined file from binding event prediction (see subitem `i`). It is required to extract impulse response parameters.
meme_file : meme file used to generate `BRACIL refined`. It has same `mline` label as BRACIL refined file.
bg_file : background file for motif scan (see MEME/FIMO documentation).





%%VI. Variations in binding event prediction


We provide 4 variations in the binding event prediction of BRACIL that requires a diferent output.

- BRACIL_pipeline_rescue_trf.
    It requires the training set (where impulse response is estimated) as input. As default, BRACIL uses only the 16 most enriched regions.

- BRACIL_pipeline_rescue_motif_input:
    This version requires a motif file, `memefile`, as input. It uses this motif file to predict the location of the potential binding sites that will be used for deconvolution.

BRACIL_pipeline_rescue_nth_round:
	This version re-iterates motif discovery and binding event prediction. It requires as input the round number and a `pre_final_peaks` file. The pre_final_peaks file represent the refined file from a previous round.

BRACIL_post_motif_and_post_training_pipeline:
	This version skip step that trains impulse response parameters. It requires as input a `bracil_covonly` file, which represents BRACIL prediction based only in ChIP-seq coverage, a `bracil_final_peaks` file, which represents BRACIL prediction that is refined by motif discovery, and a motif input. The user would need this file to run BRACIL in parallel. 




Source: README, updated 2014-08-02