APSampler Wiki

APSampler is a tool that finds multifactorial association patterns

Brought to you by: dimalvov, favorov

ReadmeQuickStart

Authors: Anonymous

APSampler quick start guide

Introduction

The page shows how to get quickly started with APSampler. More detailed description of the software and it output is in Readme. The input data formats are also described in details in Readme. Тhe examples of input data are also in the source package, they are in levels.example and in genes.example files. A configuration file example that links the data example files is cfile.example. You can start with this three or you can modify the examples. The configuration file tags are explained in Readme. The algorithm itself is presented on APSampler web site

Requirements

The APSampler source bundle is distributed as C/C++ source code to be compiled to the core software and a set of Perl and Bash scripts that are the framework to run the software and to proceed the results. Any UNIX-like system is suitable. If you run Windows on your computer, we recommend you to install Cygwin. Formal requirements are: C/C++ compiler, GNU make, Bash, Perl 5. The user is supposed to be familiar with simple UNIX concepts, e.g. 'home directory'.

Data flow

APSampler reads the configuration file, which contains the names of the disease level file and the genotyping data file to read. All these three input files are necessary. APSampler outputs reportfile and the statisticsfile, which is not convenient for human reading. To extract the information from the file is one of the purposes of the validation procedure.

Compilation

The installation procedure depends on whether you have downloaded source code or Windows executables. If you are installing from source code, you need to compile it first.

First, get the source code from APSampler git repo or unpack the APSampler source archive to a directory you choose to contain APSampler software. It is not a working directory for your project, and run:

make

If everything is OK, after a screen of compilation and linking messages, you will get your command line prompt back and an APSampler executable file will appear in your ~/APSampler/src. Its name is APSampler.exe if you use Windows and it is APSampler otherwise. To test whether it works, say:

APSampler --version

if everything goes well, you will get something like:

APSampler version is Open Source Release 1.15 (version 3.6.15)

Preparing the work directory

Now, choose where you want to keep the data and results of the test APSampler run.

You can create your own configuration and datafiles from scratch or you can copy all the three *.example files from ~/APSampler/src and use them as the template.

Installation

Windows binary

If you use Windows binary bundle, extract the content of APSAMPLER_#_#_#.zip (##_# is version number) archive (look for it in APSampler download folder ) in the working directory. Each working directory is to carry the unpacked executables, scripts and folders. _

Source bundle

If you use the source bundle and you already compiled it, it is not necessary to copy the results in the working directory. Instead of it, link the needed files by using the link_me script

Configuration

The configuration consists in just preparing input data and the configuration (options) file, which is the main parameter given to APSampler. A detailed descriptions of the data and config file formats are in Readme. The simple description follows.

Disease (trait) level file

Does not have any header. Contains numeric description of phenotype trait, e.g.:

1 and 2 for control and case; 1, 2, 3 for control; case 1 and case 2, where case 2 is more severe than case 1.

Here is an example for 5 persons having 3 levels:

Genetic data file

The file has a header, which is tab- or space- delimited to separate different rows (alleles) and inside each tab-delimited row contains space-delimited code for variant for each of the two chromosomes. Here is an example for 5 persons:

 SNP1   SNP2   SNP3
 a b    f f    c t     
 a b    0 f    c c
 b a    g g    c c
 a a    f f    0 t
 b b    g g    t c

0's denote missing data by default.

Configuration file

There is a default configuration in the ~/APSampler/src/cfile.example file; in order to start fast you possibly need to change only following parameters:

gene_sets_number=N
N (a natural number) is the size of the sample we work with (and so, it is the number of gene sets).

loci_in_set=L
L ((a natural-number) is the number of loci that are involved in the search.

allele_variants=comma-delimited-list-of-natural-numbers
The loci could be biallelic (SNP, deletions, etc) or polyallelic, so this shows the number of variants in each allele. This list contains one number per locus. The order of loci is the same as in the gene data file.

genome_data_file=filename
The filename is the name of the genome data file.

disease_levels_file=filename
The filename is the name of the describing level of disease or other phenotype trait of interest.

More than one project in a folder

The validation procedure creates a lot of intermediate files, so it is better to keep each project (data, config and results) in a separate directory. If you prefer to keep two or more projects in one folder, just keep their configuration file names different enough to trace what happens.

Linking the software and the work directory

While keeping your work directory as the current, say:

~/APSampler/src/link_me

if everything is OK, all the Perl and validation scripts are now linked to your working directory and it is ready to run the pattern identifying process. The validation procedure and scripts are are described in detail in ReadmeValidation and Readme#Validation.

Running

Now all the input data has been prepared, and the working directory is the current:

Run APSampler

Invoke APSampler by the following :

./APSampler <config-file-name>

This will result in program finding the optimal patterns. The resulting file name is pointed by statisticsfile= tagged line in the configuration file. The file is text, but it is hardly human readable. The next task is to sort the findings, validate them compute the statistical parameters.

Validation

Normally you would want to run permutation validation on all results. However, running simple validation before may be useful to save a lot of time (advanced validation takes usually long): if results do not pass the simple validation, you can stop since they will never pass permutation tests which are more stringent.

There are two levels of validation in the APSampler package - simple validation and full permutation-based) validation. See more details in the Readme file.

Simple validation

As far as APSampler finished, you can start the validation script. Say

perl patterns_validation.pl <config-file-name> > simple-validation-results-file-name

The result is the output file enumerating the patterns that are sorted by p-value. The validation script has build contingency table, calculate exact Fisher's test, or the Kruskal's gamma p-value for each pattern. If the configuration line keep_patterns= has value all during the APSampler and the validation script run, it also evaluates Bonferroni correction and Benjamini-Hochberg FDR value for each pattern.

The output record in the result file for each pattern looks like:

Pattern contains 3 informative alleles: 
Ex_locus_5:C,T; Ex_locus_6:2_loc_6.

Fisher 4-pole table:
      ctrl      case
        18         3     carriers
        90       112  noncarriers

Fisher's exact p-value = 0.0002557199
OR=0.13393  CI(95%)=[0.03824..0.46904]

Corrected (Bonferroni) p-value = 0.33576020311669
q-value (Benjamini-Hochberg FDR) = 0.33576020311669

Permutation-based validation

We see in the example above that Bonferroni correction appears to be too cruel. At the same time, we need a kind of multiple hypothesis correction for the p-values. The more sophisticated option is a permutation-based validation. It takes much more time than the original run. To prepare it, say:

perl null_statistics_gather.pl <config-file-name> <N>

N is the number of permutations, at least 50 is recommended. The script will prepare statistics for N runs with level file that is permuted in a balanced manner. The procedure prepares N permuted datafiles preparation, then runs the APSsampler on each of them, validates results for each run gathering the statistics of all the p-values and, finally, collects all the N resulting null distributions to file <config-file-name>.null-distribution. As far as the file exists, you can say

perl patterns_validation.pl <config-file-name> > full-validation-results-file-name

and you will obtain the result file. Its format will be similar with that of the simple validation file, but it will contain lines like:

Permutation (Westfall-Young) p-value = 0.1
FDR = 0.2

for each pattern.

The validation reference

The validation scripts use the same configuration file as the main program does; there are some tags that are used only by configuration.

The sripts and the configuration tags are described more detatiled in the validation readme file.

Interpreting the results

The file obtained from this full validation is the final file from APSampler package. Further actions are related to interpreting the results and are done manually.

See some explanations in the main readme file.

Using shell scripts

There are two shell scripts in the src directory and they are copied into any working directory by the link_me script.

do_sample

To run the do_sample script, say

bash do_sample <config-file-name>

and the script will run APSampler and then the simple validation. It also creates a flag file .sampled.<config-file-name> to let the second script, do_validate, know that the sampling is over.

The result is in <config-file-name>.validation-simple file.

do_validate

To run do_validate script, say

bash do_validate <config-file-name> <N>

N is the number of permuted runs, the recommended is 50. The script will prepare the permuted run results, then wait for .sampled.<config-file-name> flag and run the validation again.

The result is in <config-file-name>.validation-full file.

We use the <parameter> notation for obligatory parameters and the [parameter] for optional parameters for scripts, e.g.
perl perl_script <obligatory parameter> [optional parameter]

Wiki: APSamplerWikiHome
Wiki: Readme
Wiki: ReadmeValidation