SubPatCNV Code

Approximate association pattern mining algorithm for CNVs.

Brought to you by: njohnsonumn

Tree [5df645] master / History

HTTPS access

File	Date	Author	Commit
code	2014-12-19	Rui Kuang	[eae895] Initial commit
datasets	2014-12-19	Rui Kuang	[eae895] Initial commit
matlab_code	2014-12-19	Rui Kuang	[eae895] Initial commit
scripts	2014-12-19	Rui Kuang	[eae895] Initial commit
README.txt	2014-12-19	Rui Kuang	[9f0360] add readme.txt

Read Me

Toolbox Steps
/*Preprocessing data*/
1. In scripts/ execute create_file_structure.sh to set up the file structure. You will
need to edit line 3 to set the dataset name.
2. Transform raw data into a matrix form text file. Matrix form is a nx(m+2) matrix
where n is the number of probes and m is the number of samples. The first two
columns must be the chromosome number and the base pair location of the probe
(that's the +2 part). There should be no additional header or probe labels. Place the
text file in datasets/your_dataset/data/
3. In matlab_code/ execute binarize_data.m on the matrix text file. You will need to
edit lines 12 and 13 for your dataset and text file name. This will create files for
each chromosome for amplification and deletion CNV events each in
datasets/your_dataset/data/datafiles/

/*Running SubPatCNV algorithm*/
4. In scripts/ execute run_experiments.sh to run the SubPatCNV algorithm on your
data. You will need to edit line 4 to set the dataset name. The results will be
created for each chromosome and for amplification and deletion CNV events in
datasets/your_dataset/data/outfiles/

/*Visualizing results*/
5. You can visualize the results by running any of the scripts in matlab_code/. You
will need to edit the first few lines in each for your dataset.
• num_patterns_figure.m: Plot of the number of discovered patterns with
respect to support value.
• pattern_figures.m: Heatmap of patterns discovered with individual patient
clinical variables labeled.
• chr_pattern_figures.m: Plot of pattern location and subset patient clinical
variable associations on specific chromosome.
• genome_pattern_figures.m: Plot of pattern location along the genome.
• pattern_size_dist.m: Plot of the (normalized) pattern size distributions.
• oncogene_coverage_figure.m: Plot of the oncogene coverage by patterns
discovered by SubPatCNV.
File formats:
• log2_matrix_file.txt: Must be a tab delimited text file where rows are the probes
and columns are the samples. First two columns must be chromosome and probe bp location. All columns after are samples.
• oncogene_data.txt: Must be a tab delimited text file that contains oncogene
information that the user is interested in analyzing. Rows are oncogenes. Columns
are: oncogene name, chromosome, bp start, bp stop.
• clinical_variables.txt: Must be a tab delimited text file that contains the clinical
variable labels for each sample. Must have header as first row in file. Rows are
samples. Columns are the different clinical variables.

SubPatCNV Code

Approximate association pattern mining algorithm for CNVs.

Branches

Tree [5df645] master / Download Snapshot History

Read Me

Tree [5df645] master /

History