This documents describes a toolkit - APSampler DAta CONversion toolkit, which is built because of natural need to convert data between various established data formats for analysis in genomics. Additionally, the toolkit provides a script for composition of APSampler configuration file.
More information on APSampler used data format can be found in APSampler documentation .
Specifically, the conversion directions are:
All scripts are written in Perl, so check that you have it installed. to run the scripts, place them near your data files, and call then via command prompt. The common pattern to call the scripts is:
perl <name of the script> --<name of the parameter 1> <parameter 1 value> .. etc.
Following is more detailed explanation for each script.
This script converts files in the APSampler format that is described in the documentation to the lgen format that is compatible with PLINK and other popular software: http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#long. You have the options to specify either APSampler control file with e.g. --cfile cfile_sclerosis.txt or specify levels and genes files separtely: --levels levels_scl.txt --genes snps_scl.txt, the only difference is that in first case, the levels and genes files are being read from files that are referred to in the control file. The result will contain three files in the directory, corresponding to LGEN format: MAP.map, LGEN.lgen, FAM.fam.
ltac does the action opposite to atolec: it converts the LGEN format to the APSampler format, taking the .map, .lgen and .fam files as input. The call string in case of map will be following:
perl ltac.pl --lgen <name of .lgen file> --fam <name of .fam file> --map <name of .fam file>perl ltac.pl --mask <common name for the 3 files assuming different extensions .fam, .map, .lgen><name of .lgen file>.levels and <name of .lgen file>.genes files will be produced at the output. NOTE Do not delete you original file after conversion, as during such conversion, some valuable information may not be transferred to APSampler because of format differences (like family information from .fam, or locus distance infomation from .map, which are not used in APsampler.
ptac will convert the widely used http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped PED format .ped and .map files into APSampler .levels and .genes format. The script can be called as follows:
perl ptac.pl --ped <name of .ped file> --map <name of .fam file>perl ptac.pl --mask <common name for the 2 files with extensions .ped, .map><name of lgenfile>.levels and <name of lgenfile>.genes files, which you can use with APSampler. NOTE that PED is a strictly binary in terms of locus description (each gene may have only 1 or 0, A or G and so on), but APSampler is truly multilocus and multiallelic. There fore, and in order to avoid the hassle with justifying the dichotomization, we do not provide conversion to PED format. However, a dichotomizer exists, and may be provided by us on request.
After .levels and .genes files were created, in order to run APSampler, you will need to compose the APSampler control file. Sometimes this may take some additional unwanted time since you will have to put down some information about the data content. This procedure can be done easily with the wric utility. To run :
perl wric --levels <affection levels file> --genes <genotype file>which will produce <levels file name>.cfile as output.
This script is intended to markup the results of validation. You can run it either on the validation file itself, or use enrichment information in the format of ONIONTREE XML. In all cases the result will be stored in an HTML file. When used simply with validation it will generate random colors to mark the loci involved in the results for a nicer user experience. When also the ONIONTREE XML file is provided, and if it includes information on groupings of loci by some parameters (e.g. inclusion of loci in KEGG pathways), the script will check the inclusion of APSampler patterns into pathways and mark those patterns that fall fully into a distinct pathway. There can be any descriptors for loci, not just pathways, but then you have to generate an ONIONTREE yourself (see link above for schema description). In case of pathways, there is a tool available to query KEGG for pathways for a list of loci (see ASAP).
To run: ** perl markup.pl <cfile> **
** in order for the script to be used for enrichment, a file oniontree.xml should exist in the same directory **
We use the <parameter> notation for obligatory parameters and the [parameter] for optional parameters for scripts, e.g.
perl perl_script <obligatory parameter> [optional parameter]