APSampler Wiki

APSampler is a tool that finds multifactorial association patterns

Brought to you by: dimalvov, favorov

Readme

Labels: Readme (1)

Authors: Anonymous

ReadMe

Preface

In this manual you will find information about installing, configuring and using the APSampler software.

Functions and current development status

APSampler is a tool that allows to search complex associations between multi-locus genetic and multi-level phenotypic trait personal data. The inputs are data on e.g. patients and healthy controls, genotyped for a number or candidate SNPs, or a number of genes, and the goal would be for example determining the set of SNPs, or genes that divides the case from controls best. The main difficulty in such a problem is that given the multiple loci, and multiple alleles, the number of all possible classifiers tends to be extremely large. Therefore, Monte Carlo Markov Chain method is applied to reduce the space of solutions and sample only from regions where it is likely to find a good classifier. Once a set of classifiers is found, there is a problem to validate the results, and this is done using a number of well known methods. In case of single disease level, the resulting classifier divides the space of healthy and ill individuals, and the result is represented in a classic contingency table. Odds ratio and Fisher's p-value are calculated and displayed.

Use of ranks and Wilcoxon-Mann-Whitney statistics provides for analysis of multi-level diseases, which is useful in e.g. Multiple Sclerosis research which often uses EDSS scale to describe the disease state. In this case, a multi-pole Fisher table is constructed to represent the results. APSampler is not limited to Multiple Sclerosis data analysis.

APSampler is a live open source project and currently it is being changed extensively and often, therefore you might want to check the update page once in a while to see if there are any updates available.

System requirements

APSampler is distributed as C source files and perl scripts; it runs under Unix using the native gcc and perl. In order to use it on Windows you will need a Unix emulator, such as Cygwin. Perl interpreter is also required (Active Perl, or the one available with Cygwin installation, you have to check Perl tick box in the Cygwin installation interface when installing it).

Knowledge requirements

In order to gain most benefit from the program, it is better for a person to have understanding about at least interpretation of some statistical methods in addition to understanding the nature of the data. Main statistical terms that need understanding are Odds Ratio, 2-pole and multiple-pole contingency tables, Fisher's p-value.

Basic computer skills such as running a program with parameters from command prompt and understanding what is it a home directory are required.

Knowledge of UNIX, bash, and Perl not required but recommended i.e. at least one should understand how to run a program from the terminal.

See Advanced Bash-Scripting Guide in English and Advanced Bash-Scripting Guide in Russian for reference.

Data flow

APSampler reads the configuration file. This file contains the names of the disease level file and the genotyping data file to read. All the three input files (config, levels and genes) are necessary. APSampler outputs reportfile and the statisticsfile. The latter carries information about the findings. The names for output files are set in the configuration file, too. If at least one of the output files already exist, the program outputs a message and stops without doing anything.

The statisticsfile is a text file, but it is not convenient for human reading. Extracting the information from this file is one of the purposes of the validation procedure.

Compilation

The installation procedure depends on whether you have downloaded source code or Windows executables. If you are installing from source code, you need to compile it first.

Source installation package consists of source folder APSampler/src containing source, Perl and Bash scripts, help, and example files. Copy the downloaded archive and extract it into your home directory. Obvuiosly, you can choose any other directory for APSampler, but in this documetation we will refer to it as ~/APsampler. Then cd to APSampler/src and say

make

In case "make" generates errors such as missing g++ then you may be required to install the g++ compiler.

If makefile fails to determine your GCC version automatically, say:

make GCC3=Y

make GCC4=Y

depending what version do you use.

If everything is OK, after a screen of compilation and linking messages, you will get your command line prompt back and an APSampler executable file will appear in your ~/APSampler/src . Its name is APSampler.exe if you use Windows and APSampler otherwise. To test whether it works, say:

APSampler --version

if everything goes well, you will get something like:

APSampler version is Open Source Release 1.15 (version 3.6.15)

Actually, there are two executables files that are made by the make call: the APSampler and permut in PemutationTest folder.

Installation

There are some conventions to ensure pleasant and safe use of APSampler. The idea is that all the working files are kept in one directory, and the actual work is done in another folder, and there is a new folder for each calculation project. This way you will ensure that no important files are deleted from the installation while working (APSampler will produce some files which you would need to delete occasionally if using same folder for all calculations, and this may result in deleting something important). If you use the Windows binary, copy all the program files and folders in each working directory. If you use the source and compile it, you can link the program files and folders in the work directory. The linkage is maintained by a UNIX'es symbolic link (basically a shortcut), linkage is automatic and the script to run linking it is called "link_me".

Windows binary

If you use Windows binary bundle, you are to extract the content of APSAMPLER_##_##_## (##_##_## is version number) folder of the archive in the working directory. Each working directory is to carry the unpacked executable, scripts and folders.

Source bundle

If you use the source bundle and you already compiled it somewhere, it is not necessary to copy the results in the working directory. Instead of it, you can link the executable files, scripts and folders from the /src directory using the with link_me script

Configuration

This part describes the actions to be taken to run data analysis with APSampler. Examples of all the files described below can be found in the installation directory with the extension .example. The main file in each run is the configuration file. The configuration file is where all parameters are brought together and file names are specified.

In all the input files (configuration, data) the lines that start with ## are treated as comments.

levels file

The disease (phenotype trait) level file is a plain text that carries N real numbers. These numbers represent the level of the trait. The absolute level of the value does not matter, the information is coded in comparisons. Thus (1,2,3) means the same as (0.1,0.2,10). The order of the values correspond to the order of the strings in genes data file. N-th value in disease level file and N+1-th string in genome data file describe the same person, since there is no header in level file and there is one in genes file.

If in the disease level file the program meets a string that cannot be parsed as a real value, the program omits the corresponding person information completely. It is useful if we want to narrow the sample. For example if there are healthy controls (0) and EDSS level (1,2,3) persons and you would like to compare only healthy controls with EDSS level 3 patients, replace 1, 2 with some word, e.g. "omit". Similarly, to compare all healthy versus all ill, simply replace all 1, 2, 3 with one number, say 1. This will ensure that there is no need to keep track of relations between levels and genes files, number of records in both will stay the same. Do not forget to create a separate folder for each calculation as after substituting e.g. 1,2,3 with 1, the backward substitution is not possible.

Here is an example for 5 persons having 3 levels:

Genes file

Genome_data_file is organized as a plain text that contains 1+N of lines. The first line (header) contains L tab- or space-delimited words that are the names of the L loci investigated by the program run. All the N lines (2..1+N) carry genotyping results. Each of them contain 2*L tab- or space-delimited words that show the names of alleles in two chromosomes in each locus. Two alleles corresponding to the same locus are adjacent. There is a special allele name "0" (number zero) that means that the allele of the person is unknown (e.g. it was not typed). Such a hole in the data do not prevent the person to be involved in statistics for patterns that do not need the unknown allele.

Here is an example for 5 persons having genotyped for 3 SNPs:

 SNP1   SNP2   SNP3
 a b    f f    c t     
 a b    0 f    c c
 b a    g g    c c
 a a    f f    0 t
 b b    g g    t c

Configuration file

The configuration file is a plain text file, which describes line by line the task for a program start. Every line's format is following:

Tag = something

Any line tail after ## is ignored as comment, traditionally. We ignore spaces and tabs everywhere in the line.

We support Boolean, integer, double, string and list of integer tag value types. Each type has a corresponding procedure reading its tags. A Boolean tag is one of strings: "true","yes","1" or "on", mean "true" or "false","no","0" or "off" mean "false". Any integer tag, in-list or alone, has C radix notification 0x20=040=32. Be careful for 040 being read as double means 40, not 32. The integer list is comma-, semicolon- or colon-delimited. The notification for double is traditional. The configuration file consists of three parts - input data description, where you specify some information on the input data, parameter description, which specifies parameters for actual calculations and output description.

Main rule

All lines of configuration file that are not used by APSampler are simply ingnored, meaning that if a parameter name is written incorrectly and the parameter is not obligatory, the program will not produce an error. Also this means that this configuration file can store configuration data for other programs created by APSampler users.

Input data description

To run the APSampler without tackling the statistical parameters you need only to specify actual names to the genes and disease level files, count and specify the number of gene sets (lines) in the genome file, count of the loci, count and values of alleles for each locus. The following configuration file lines are obligatory:

gene_sets_number=N
N (a natural number) is the size of the sample we work with (and so, it is the number of gene sets).

loci_in_set=L
L (a natural-number) is the number of loci that are involved in the search.

/allele_variants=comma-delimited-list-of-natural-numbers
The loci could be biallelic (SNP, deletions, etc) or polyallelic, so this shows the number of variants in each allele. This list contains one number per locus. The order of loci is the same as in the gene data file. The list can be in 2,2,2,2,2,3,2,2,2,2 as well as in 25,3,24 (the last is better for long lists)

genome_data_file=filename
The filename is the name of the genome data file.

disease_levels_file=filename
The filename is the name of the describing level of disease or other phenotype trait of interest.

Parameter description

Although these parameters are pre-configured to suit some standard situations, there may be need to adjust them for particular application. Parameters are given here with example values for better comprehension.
Boolean can be 'off', 'on, 'yes', 'no', 'true', 'false', 1 or 0.

patterns_number=N
An obligatory tag, N is the numbers of patterns in the patternset to look for. Usually 2 or 3, see (1) for more details.

proposal=semi-dirichlet or dirichlet or flat
The kind of proposal distribution for a point mutation. Flat - the proposal is drawn from flat distribution. The two others require the monoallelic test for every allele in every locus. It is the result of usual test for a set of patterns applied to a set, which consists of one pattern, which contains only the allele of interest. The distribution is the result of allelewise Dirichlet distribution. semi_dirichlet - the locus position is flat, the resulting allele is drawn from Dirichlet. dirichlet - both the loci position and the allele are drawn from Dirichlet.

null_hypothesis_prior=p
Prior of a pattern to be not associated with the disease. It is to be between 0 and 1 or to be auto (or omitted, or -1). The auto value depends on the study type (see the next). For gwas, it is quite low (11/number of loci). For candidate, we set it to 0.5

study = gwas | candidate | unknown. The latter is the default. If so, we decide what the dtudy is based on number of loci: 256 is the threshold between candidate and gwas.

zero_prior=p
A prioiri probability of drawing 0 as proposal in a point mutation step (prior probability that a given locus is included in a pattern associated with the disease). If it is -1 or auto or omitted, the probability equals to 1-1/2n), where n is number of the loci with nonzero alleles variants in dictionary. We want prior expectation of the number of alleles in a pattern to be 1.

steps_number=N
Deprecated. Now, max_warm_up_steps_number is used.

report_every_n_points=N
A record will be written to the report_output_file every n points visited (steps). Usual N is 100000

notice_every_n_points=N
A notifying string will be written to stdout every N points (steps). 100000

gather_statistics_not_less_then=N
For how long do we want the sampler to gather the 'best patternsets' snd 'best pattern' lists. 1000000

patternsets_to_keep=N
The length of the 'best-of' list of patternsets. If keep_patterns is all, we do not need it and N can be set to 0. Otherwise, usual value is 2587

keep_patterns=keep_all or as-is
as-is means to keep patterns only in the best patternsets; all requires to keep all the patterns, the latter option makes the simple validation to be the most informative, but significantly slows down the whole process

generate_first_point_from_data=boolean
Obsoleted. Compatibility. True (default) is synonym to first_point == selection. False is synonym to first_point == given. It works only is first_point is not given explicitely.

generated_first_point_number=N
The N is the ordinal number of the original patternset in the of combinations. Works only with generate_first_point_from_data=yes). Default is 1.

first_point = zero | given | selections | combinations | ladder. How do we generate the first point. If given, we do not generate anything, just read the original_pattern_set_file or start with zeroes. zero is just particular to start with zeroes. random generates first patternset independetly from priors. If the value is selections (default), or ladder, or combinations we first generate the list of the alleles that has nonzero strength. For selections, we generate different selections of these alleles for each of pattern, one-by-one. generated_first_point_number is the number of the selections for the first pattern. First, we take 1-element selections (the alleles themselves), then pairs, etc. If we do not have enough selections, we leave the remainder of the patterns empty. If the value of the parameter is ladder, we init the first pattern with all nonzero alleles, the second - one allele less, etc, etc. If there are more patterns than nonzero alleles, some patterns remain zero-filled. generated_first_point_number works in the same way as for selections. combinations do the same as selections, but the size of the combinations (AKA selections) is the same for all the patterns and it is automatically chosen to be enough of to be providing the the most different selections.

original_pattern_set_file=pfile
We read the original patternset that file pfile.

long_random_seed_1=long_1 long_random_seed_2=long_2
The random seeds for the random generator. One is able obtain another chain for the same data by changing the parameters.

use_recombination=boolean
There are two possible kinds of updates between APSampler steps: a point mutation that changes only one allele or a recombination that recombine two pattern is the current patternset. This key says whether to use recombination steps during sampling. yes is a good value

max_mute_steps_number=N
Synonym to max_warm_up_steps_number

max_warm_up_steps_number=N
Maximal number of steps before sampler should start gathering statistics. The warm-up in a regular run is restricted by a data-based threshold, so you do need the parameted for an exploratory run. Still, if you need more predictable behaviour on noisy data, especially, if you suppose to run permutation validation, the parameter is nesseccary. The most natural value is 0, that means n warm-up, that statistics is gathered from the first step.
The default value is the gather_statistics_not_less_then value.

unknown_allele_tag=word
A word or symbol that denotes missing data in genes file. By default, missing data are represented with 0's, but if 0 is something meaningful, another tag can be used.

recombination_probability=p
If recombination steps are used (parameter use_recombination), p (p is in (0,1)) is the probability of a recombination attempt versus a point mutation attempt. See (1) for more details.

forbid_void_attempts=boolean
Should we block the probability engine from suggesting an option that does not change anything (uses the same point as previous in then chain). Normally yes, use no only in case the user intends to use the information about sampler visits## for each set of patterns in a frequentist manner.

patternsets_to_keep_ratio=double
Can be used to store a proportion of patternsets rather than explicit number via patternsets_to_keep parameter.

Output description

statistics_output_file=statisticsfilename
The file name for the statistics output.

statistics_output_file_format=yaml or matrix
Old (3.6.1 and aerlier versions) the only format was matrix. The format was overloaded with zeroes and was very expensive. Current (>=3.6.2) default is yaml.

reliability_level_for_detailed_report=p
The tag regulates log level of the report. The former sets up level so as any point with the reliability higher than the level is to be reported with a lot of the sampler details. The output point are additional to that regulated by. The default value is 1.1 (the reliability is always between 0 and 1, so no point can satisfy it).

report_output_file=reportfilename
Filename of the report file.

snapshots_file=snapshotfilename
Where to write snapshots. Snapshots are step-to-step chain reports, they are useful to undedstand what happens if you see something strange.
Default is stdout.

snapshots_number=unsigned long
Snapshots number to output. If <=0, no spanshots are written. If >0, the number of snapshots will be sequentially written from the start point.

0 (no snpshots)\n");

As far as the validation scripts use the same configuration file as the main program does, there are some tags that are used only by configuration. See their description at the validation Readme.

More than one project in a folder

The validation procedure creates a lot of intermediate files, so it is better to keep each project (data, config and results) in a separate directory. If you prefer to keep two or more projects in one folder, just keep their configuration file names different enough to trace what happens.

Linking the software and the work directory

While keeping your work directory as the current, say:

~/APSampler/src/link_me

if everything is OK, all the Perl and sh utility scripts are now linked to your working directory and the directory is ready to run the pattern identifying process. The validation procedure and scripts are are described in ReadmeValidation and Readme validation. Actually, you can swap the two last steps (data preparation and linking) but it is more convenient to do in this direction.

Operation

APSampler software package consists of several parts - C executables and Perl scripts. Pattern sampling from all possible patterns space is done by the executable file APSampler (APSapler.exe for Windows); the results validation is done by by multiple Perl scripts that are described more detailed in ReadmeValidation. After link_me finished, all the necessary files are linked to the working project folder. Following parts describe how to run them and in which order. In order to start operation ensure you have correctly configured the input as per the configuration chapter. To start calculations, ensure that all input files are located in the project working directory.

Run APSampler

Invoke APSampler by the typing the following (here and in all the text all the names like italic are to be substituted by real file names) when the project's working directory is current:

./APSampler <config-file-name>

You should see some changes on the screen as APSampler performs pattern sampling. It should report a line periodically (it depends on report_every_n_points parameter in the configuration file). Once it is finished, it should indicate this by writing Done. This will result in program finding the optimal patterns. The resulting file name is pointed by statisticsfile= tagged line in the configuration file. The file is text, but it is hardly human readable. Normally you do not want to see what is inside as far as there are only preliminary results without any statistical validation. Readability is not necessary: the next task is to sort the findings, validate them and then compute statistical parameters. The statistics file is an input for this procedure that is referred as validation in APSampler documentation.

Validation

Validation procedure do a lot of things:

convert the APSampler-outputed statistics file to more human-readable form
ascribe a association p-values for findings
simple validation calculates exact Fisher ok Goodman and Kruscal gamma statistics (simple p-value) for each pattern.
permutation-based validation calculates permutation-based p-value from the simple validation resuls for each pattern and from similar results (null-distribution) for a set of runs of APSampler with the simple validation procedure for permuted datasets; each permuted dataset is actually the input dataset with the disease level permuted
sort the results according to the p-value
calculate multiple hypothesis corrections (MHC) for the p-values
filter the output to conserve only the interesting patterns (see results interpetation)

Both simple validation and permutation-based validation are done only by means of patterns_validation.pl script. It does all the validation procedure, if it finds the null-distribution data file (it name is <config-file-name>.null-distribution) it evaluates both permutation-based and simple p-value and corresponding MHC. If it does, we call it permutation-based validation, if it does not, it is simple validation. The null-distribution data file is prepared by the null_statistics_gather.pl script.

Simple validation

As far as APSampler finished, you can start the validation script. Say

perl patterns_validation.pl <config-file-name> > simple-validation-results-file-name

The result is the output file enumerating the patterns that are sorted by p-value. The validation script has built contingency table and has calculated exact Fisher's test, or the Kruskal's gamma p-value for each pattern. Also, Odds ratio (in case any of the categories of contingency tables is zero, Peto's estimation of Odds ratio is used to avoid division by zero) is calculated for each pattern. If the configuration line keep_patterns= has value all during the APSampler and the validation script run, it also evaluates Bonferroni correction and Benjamin-Hochberg FDR value for each pattern.

The output record in the result file for each pattern looks like:

Pattern contains 3 informative alleles: 
Ex_locus_5:C,T; Ex_locus_6:2_loc_6.

Fisher 4-pole table:
      ctrl      case
        18         3     carriers
        90       112  noncarriers

Fisher's exact p-value = 0.0002557199
OR=0.13393  CI(95%)=[0.03824..0.46904]

Corrected (Bonferroni) p-value = 0.33576020311669
q-value (Bengamini-Hochberg FDR) = 0.33576020311669

Permutation-based validation

We see in the example above that Bonferroni correction appears to be too cruel. At the same time, we need a kind of multiple hypothesis correction for the p-values. The more sophisticated option is a permutation-based validation. It works by doing same APSampler run on the data that differs from original by balancely permuted case/control tag. The data from these runs appears from a rum that has zll the characteristics (data dimension, run parameters) that are absolutely the same with the original run, but any association we looked for is removed. So, everything found in this data has stochastic nature rather shows a real association and thus the permuted runs findings are used to form the null distribution. To prepare it, say:

perl null_statistics_gather.pl <config-file-name> <N>

N is the number of permutations, at least 50 is recommended. The script will prepare statistics for N runs with level file that is permuted in a balanced manner. The procedure prepares N permuted datafiles preparation, then runs the APSsampler on each of them, validates results for each run gathering the statistics of all the p-values and, finally, collects all the N resulting null distributions to file <config-file-name>.null-distribution. As far as the file exists, you can say

perl patterns_validation.pl <config-file-name> > full-validation-results-file-name

and you will obtain the result file. Its format will be similar with that of the simple validation file, but it will contain lines like:

Fisher's exact p-value = ...
OR=0.06510  CI(95%)=...
Corrected (Bonferroni) p-value = ...
q-value (Bengamini-Hochberg FDR) = ...
Standard permutation p-value = ...
Permutation (Westfall-Young) p-value = ...
FDR = ...

for each pattern.

Interpreting the results

The file obtained from this full validation is the final file from APSampler package. Further actions are related to interpreting the results and are done manually.

Interpreting and choosing the results is the least automated part in the APSampler software package and is currently done manually. Below are just a few hints on the standard action to perform on the results file.

Depending on what is the approximate number of the results, and what are your statistical significance demands, limit results by setting a threshold in Fisher's and permutation p-values. By default those would be set to either 0,05 or 0,01 both. From APSampler version 3.54 you can limit results with '-f' key
Seek to remove redundant patterns - for example if a singe-locus pattern has a greater p-value than the same pattern with some addition, the latter may be removed since the addition only worsens the effect of the previous pattern. From APSampler version 3.54 this is done by default.
Beware of patterns that have exactly same p-values and contingency tables, and are very much alike, with one being an extended other and the extension is in the same allele. This usually means that the input data is not in Hardy-Weinberg or not gathered properly.

Automating: Using shell scripts

There are two shell scripts in the src directory and they are copied into any working directory by the link_me script. At some point you may want to load all calculation and after check back only when results are available. That is possible via using soem scripts.

do_sample

To run the do_sample script, say

bash do_sample <config-file-name>

and the script will run APSampler and then the simple validation. It also creates a flag file .sampled.<config-file-name> to let the second script, do_validate, know that the sampling is over.

The result is in <config-file-name>.validation-simple file.

do_validate

To run do_validate script, say

bash do_validate <config-file-name> <N>

N is the number of permuted runs, the recommended is 50. The script will prepare the permuted run results, then wait for .sampled.<config-file-name> flag and run the validation again.

The result is in <config-file-name>.validation-full file.

1 Favorov, A. V., Andreewski, T. V., Sudomoina, M. A., Favorova, O. O., Parmigiani, G., and Ochs, M. F. (2005). A Markov Chain Monte Carlo technique for identification of combinations of allelic variants underlying complex diseases in humans. Genetics, 171(4), 21132121. PMID: 16118183.

We use the <parameter> notation for obligatory parameters and the [parameter] for optional parameters for scripts, e.g.
perl perl_script <obligatory parameter> [optional parameter]

Wiki: APSamplerWikiHome
Wiki: ReadmeQuickStart
Wiki: ReadmeValidation