SubPatCNV Code
Approximate association pattern mining algorithm for CNVs.
Brought to you by:
njohnsonumn
File | Date | Author | Commit |
---|---|---|---|
code | 2014-12-19 |
![]() |
[eae895] Initial commit |
datasets | 2014-12-19 |
![]() |
[eae895] Initial commit |
matlab_code | 2014-12-19 |
![]() |
[eae895] Initial commit |
scripts | 2014-12-19 |
![]() |
[eae895] Initial commit |
README.txt | 2014-12-19 |
![]() |
[9f0360] add readme.txt |
Toolbox Steps /*Preprocessing data*/ 1. In scripts/ execute create_file_structure.sh to set up the file structure. You will need to edit line 3 to set the dataset name. 2. Transform raw data into a matrix form text file. Matrix form is a nx(m+2) matrix where n is the number of probes and m is the number of samples. The first two columns must be the chromosome number and the base pair location of the probe (that's the +2 part). There should be no additional header or probe labels. Place the text file in datasets/your_dataset/data/ 3. In matlab_code/ execute binarize_data.m on the matrix text file. You will need to edit lines 12 and 13 for your dataset and text file name. This will create files for each chromosome for amplification and deletion CNV events each in datasets/your_dataset/data/datafiles/ /*Running SubPatCNV algorithm*/ 4. In scripts/ execute run_experiments.sh to run the SubPatCNV algorithm on your data. You will need to edit line 4 to set the dataset name. The results will be created for each chromosome and for amplification and deletion CNV events in datasets/your_dataset/data/outfiles/ /*Visualizing results*/ 5. You can visualize the results by running any of the scripts in matlab_code/. You will need to edit the first few lines in each for your dataset. • num_patterns_figure.m: Plot of the number of discovered patterns with respect to support value. • pattern_figures.m: Heatmap of patterns discovered with individual patient clinical variables labeled. • chr_pattern_figures.m: Plot of pattern location and subset patient clinical variable associations on specific chromosome. • genome_pattern_figures.m: Plot of pattern location along the genome. • pattern_size_dist.m: Plot of the (normalized) pattern size distributions. • oncogene_coverage_figure.m: Plot of the oncogene coverage by patterns discovered by SubPatCNV. File formats: • log2_matrix_file.txt: Must be a tab delimited text file where rows are the probes and columns are the samples. First two columns must be chromosome and probe bp location. All columns after are samples. • oncogene_data.txt: Must be a tab delimited text file that contains oncogene information that the user is interested in analyzing. Rows are oncogenes. Columns are: oncogene name, chromosome, bp start, bp stop. • clinical_variables.txt: Must be a tab delimited text file that contains the clinical variable labels for each sample. Must have header as first row in file. Rows are samples. Columns are the different clinical variables.