Menu

ReadmeValidation

Anonymous dmitri.lvov Alexander Favorov

ReadMeValidation

Introduction

Here we show the technical details of validation procedure. It is described generally at the Validation section of the readme file.

Data flow

First, APSampler runs. Its data flow is described in Readme.

The patterns_validation.pl script reads the configuration file, statics file and null-distribution file if it exists. Its output is stdout, so it is to user's choice where to redirect the output. The name of the null-distribution file is <config_file_name>.null-distribution.

Validation data flow

When the null_statistics_gather.pl is started, it takes two parameters: configuration file name (<config_file_name>) and number of permutations (N). The gathering has 3 steps, overall rule for each action is: if output file does not exist, create it. If it does and it is younger than input, do nothing: all you need to do is already done unless output is empty; if so rewrite it. If the output exist and it is older than input, report an error and stop.

The first step (prepare.pl) prepares to run the gathering. Its input is configuration file and the disease level file. Its output is: N configuration files <config_file_name>.permut.## (## is number of the file) and N permuted disease level files named <disease-level-file>.permut.##. Each permuted config refers to corresponding disease level file.

The next step (sample.pl) is sampling. Each <config_file_name>.permut.## is processed by APSampler (<statics-file-name>.permut.## and <report-file-name>.permut.## are the output; <config_file_name>.permut.## prepared on the previous mention them as output). Then, the patterns_validation.pl is run of the statistics with special switch -d that requires to gather the distribution of p-values rather than to output validated and filtered patterns. The output is <config_file_name>.null-distribution.##. As far as all things happen with the N permuted datasets independently, the step is well-parallelisable.

The last step(combine.pl) combines all the permutation-wise distributions (<config_file_name>.permut.## are inputs along with the permute configs) to one final null distribution <config_file_name>.null-distribution.

If we have more than 2 disease levels, the Fisher validation is impossible, so we are to map all the levels into "case/control" scheme or use other validation method.

validation_fisher_threshold=level
all the levels of disease that are higher than level interpreted as case (1), otherwise they are controls (0)

validation_fisher_control_min=level
more precise mapping of levels of disease into "case-control" scheme: min for controls

validation_fisher_control_max=level
more precise mapping of levels of disease into "case-control" scheme: max for controls

validation_fisher_patient_min=level
more precise mapping of levels of disease into "case-control" scheme: min for cases

validation_fisher_patient_max=level
more precise mapping of levels of disease into "case-control" scheme: max for cases

permut_random_seed_1=long_1 permut_random_seed_2=long_2
The random seeds for the random generator. One is able obtain another permutation test for the same data by changing the parameters.

User-callable scripts

patterns_validation.pl

This is a Perl script that performs validation.

perl patterns_validation.pl config_file_name

The script converts the APSampler output (statisticsfile) to a human-readable form. Simple (Fisher or Goodman Kruskal) validation is done whatever. If the statisticsfile contains all the patterns (see keep_patterns=keep_all in APSampler Readme) it performs Bonferroni and Benjamini-Hoochberg multiple hypothesis correction (FDR). If the file <config_file_name>.null-distribution with null hypothesis statistics exist it also evaluates the permutation and Westfall-Yong p-values for each pattern and the permutation-based FDR. The file is prepared by null_statistics_gather.pl run.

The patterns_validation.pl script filters output by default. It hides all the patterns that contains stronger patterns (filters out the weaker superset) and it wipes out all the patterns that has all p-values higher than 0.05 and all FDR's higher than 0.1. You can customise the filtering behaviour with command-line switches.

null_statistics_gather.pl

To validate the APSampler results by permutation test, one is to collect the null hypothesis statistics and then run the patterns_validation.pl script that will detect the presence of the <config_file_name>.null-distribution and it will calculate the permutation-based each output pattern.

The toolkit essentially is a set of Perl script, that are run one by one, or whose run can be automated. The main three service scripts are located in ./PermutationTest/ directory (relative to the working folder, or the service script folder can be linked from the working one) . They are: prepare.pl to prepare permuted config files and data files; sample.pl to run APSampler and patterns_validation.pl on each permuted datafiles, and, finally, the combine.pl that gathers the results of all runs into one null statistics file. These three are called by null_statistics_gather.pl script that is located in the work directory after the link_me run.

This script collects the parameters from user and runs the next three scripts in order, passing those parameters to each of them. The script does not use parallel computation possibilities exist in the infrastructure thus, simply running this script without any editing, is not going to parallel anything and is suitable only for sequential computation. See Parallel versions of null_statistics_gather.pl.

perl null_statistics_gather.pl <config_file_name> <number_of_permutations>

The toolkit comprises several scripts:

  • prepare.pl
  • sample.pl
  • combine.pl

Following is some short top-down description of each of them, that should allow users to run the parallel option of APSampler. When talking about input parameters, ones marked in square brackets are mandatory.

prepare.pl

perl PermutationTest/prepare.pl <config_file_name> <number_of_permutations>

Prepare a configuration file and permuted disease level file for each permutation number permutations from first to number_of_null_tests. If both of them already exist for one one the permutations and they are younger than the main config file, leaving these two unchanged. If there are any and they are older than the config, give a error message and do nothing.

It writes new configuration files and level data files that will be used for the permutation runs by APSampler. The files are named like <config_file_name>.permut.##. They all share the same parameters for the APSampler and patterns_validation.pl as well as the genetic data file with the original configuration, The things that differ are: the disease level datafiles and the names for the output (statistics and report) datafiles. They all have suffix permut.## added to the names for the original (non-permuted) run. Each disease level file <disease-level-file>.permut.## is a permutation of the <disease-level-file> file. If possible, the permutation is done in a balanced manner.

All the permutations are done starting from the permut_random_seed_## values (or long_random_seed_## if they are absent) so the data is reproducible and if we restart the validation with a larger number of permutation (the new N2 is larger than the old N1), the previous N1 results are the same, so we are not to rewrite them.

sample.pl

Once the input data (in this case the various Levels and Config files) for all the APSampler runs are ready, the running of actual sampling can start. This is the slowest stage. The script does it sequentially, one-by-one, but it actually the only common part for all of them is the read-only gene data file. So, it is easy to make it parallel. The naming scheme allows to do it even on different computers and then copy back the results (output files and $N-th-config_file_name.null-distribution for all the runs). In other words, if you are paralleling the computation, run this script many times on different cores. The input parameters allow to specify which permutations are run on which machine.

perl PermutationTest/sample.pl <config_file_name> <upper_permutation_number>

if you want to start from first permuted config or:

perl PermutationTest/sample.pl <config_file_name> <lower_permutation_number> <upper_permutation_number>

For example, if there are 2 cores available, then on each core, for a total of 100 permutations you should run something like: perl PermutationTest/sample.pl config 50 on core 1 and perl PermutationTest/sample.pl config 50 100 on core 2

In the following examples <config_file_name> is "config" .

A --single command-line switch after the name of configuration fiel says to run only one permutation, its number is the second parameter

perl PermutationTest/sample.pl <config_file_name> --single <number>

Example:

perl PermutationTest/sample.pl config --single 31

A --single-var switch after the config file name says to run only one permutation, the third parameter is the variable name to read the permutation number from:

perl PermutationTest/sample.pl <config_file_name> --single-var <varname>

Example:

perl PermutationTest/sample.pl config --single-var SGE_TASK_ID

A --single-var-plus-shift switch says to run only one permutation. The third parameter is the variable name to read; then the shift (fourth parameter) is added to it. The sum is the number of permutation to sample.

perl PermutationTest/sample.pl <config_file_name> --single-var-plus-shift <varname> <shift>

Example:

perl PermutationTest/sample.pl config --single-var-plus-shift SGE_TASK_ID 14

For each run, he input in config <config-file-name>.permut.## and the permuted disease level data <levels-file-name>.permut.##, the output is the null-statistics for the run <config-file-name>.##.null-distribution (## is the number of run).

combine.pl

After the permuted runs have finished, the results must be combined and the null statistics finally gathered. This is done with the PermutationTest/combine.pl script. Now, everything is to be in the folder. The script will combine the distributions to one file. The file in named <config_file_name>.null.distribution. After it is created, the patterns_validation.pl script will calculate permutation-based p-value and FDR along with other values for each pattern.

The input is the configuration file and all the <config-file-name>.##.null-distribution files; the output is <config_file_name>.null.distribution. The script bacups the <config_file_name>.null.distribution if the file exists and it is valid. If the file corresponds to the run completely (it has the same number of distribitions as current run and timestamps of all the components are earlier, the script does nothing.

perl PermutationTest/combine.pl <config_file_name> <number_of_permutations>

clean_after_null_statistics_gather.pl

null_statistics_gather.pl produces a lot of files, the current script cleans them up.

perl clean_after_null_statistics_gather.pl <config_file_name>

Parallel versions of null_statistics_gather.pl

There is an additional ./PermutationTest/null_statistics_gather/ folder in the ./PermutationTest/ folder that contain some more versions of null_statistics_gather that can be copied used instead of the usual null_statistics_gather.pl .

  • null_statistics_gather_fork.pl The script use the simplest and the most common parallel model (fork + wait). The thing to run parallel is sample.pl. The script has use the second numeric parameter core_number that shows how many cores to use:
    perl null_statistics_gather_fork.pl <config_file_name> [desired_number_of_permutations] [core_number]

  • null_statistics_gather_sge that use sge (Oracle SGE, former Sun Grid Engine) with Array Model to call a lot sample.pl copies.
    perl null_statistics_gather_sge.pl <config_file_name> [desired_number_of_permutations] [core_number]

We do not tell 'core' from 'thread in this documentation.

The parallel versions are given as examples, the user who is interested in parallel computation can write his own variants of the parallel script. The examples use completely different parallel and dependecies schemes, and we hopr that these two are enough as a template set.

All the sampling output that usually is used to monitor the sampling by looking at stdout is redirected to the <config-file-name>.nsg-p.out for the parent process (nsg-p is for null statistic gather parent) and <config-file-name>.nsg-ch.##.out where ## is the thread number.

Internal tools

permut (executable)

Provides balanced or usual permutations of the input data. Console interface executable. Located in ./PermutationTest

cumulative_statistics_file.pm

Perl module that contain methods to read, write and modify the null distribution. Located in ./PermutationTest

Exact_4_Pole_Fisher.pm

Calculates exact Fisher's p-value for a 2x2 contingency table. Uses the Stitling's approximation for factorials. Located in ./Statistics

Goodman_Kruskal_gamma.pm

The module provides a function that counts p-value for a 2*n contingency table, n>=2. The p-value is calculated for Goodman-Kruskall's gamma value that characterise the contingency table. Located in ./Statistics To run the validation in this mode, and in case your you have multi-levelled input, comment out only those lines in the .config file, that are related to Fisher's test.

pjacklam folder (normal distribution)

Here, we use only errf and normal functions from all the collection of statistical solutions by Peter J. Acklam. Refer to the author's website. Located in ./pjacklam


We use the <parameter> notation for obligatory parameters and the [parameter] for optional parameters for scripts, e.g.
perl perl_script <obligatory parameter> [optional parameter]


Related

Wiki: Readme
Wiki: ReadmeQuickStart