Menu

Modules

Pasi Rastas

Lep-MAP3 (LM3) consists of multiple modules. Each of these modules can run by

java -cp path moduleName [parameters]

, where "path" is the directory where the class files are located. Each module requires some parameters, simply running "java -cp path moduleName" lists these parameters and a short description of each parameter.

The most commonly used modules are ParentCall2, Filtering2, SeparateChromosomes2, JoinSingles2All and OrderMarkers2. These as analogous to the modules (ParentCall, Filtering, SeparateChromosomes, JoinSingles and OrderMarkers) of Lep-MAP2.

In order to map millions of markers (e.g. whole genome sequencing), modules SeparateIndenticals, JoinSingles2Indeticals and JoinLGs can be used. However, these are still in very preliminary form. Moreover, as JoinSingles2All and SeparateChromosomes2 now utilise multiple cores, these could probably be used as well on larger data.

In order to verify the linkage mapping results, module LMPlot can be used. It will output the Lep-MAP graph in Graphviz DOT language format. The output can be visualized with dot or with xdot.py from https://github.com/jrfonseca/xdot.py .

ParentCall2

The module ParentCall2 is used to call (possible missing or erroneous) parental genotypes and markers on sex chromosomes (XY or ZW system, assumes known offspring sex). Grandparents and half-sib familes are also supported (especially the grandparents should be provided if there is data on them).

The input of ParentCall2 consists of genotype likelihoods (posteriors) for each 10 possible SNP genotypes AA, AC, AG, AT, CC, CG, CT, GG, GT and TT. Other kind of variants (like indels) can be given as input by specifying them as SNPs (e.g. AA=homozygote indel, AT=heterozygote indel, TT=no indel). Typically these likelihoods can be obtained from sequencing data or from SNP assays. The output is also in the same likelihood format.

The first 6 lines presents the pedigree. First line is the family name, second individual name, third and fourth are the father and mother. Line 5 containts the sex of each individual (1 male, 2 female, 0 unknown) and the last line is the phenotype (can be 0 for all individuals, this is not currently used). The likelihoods can be provided from line 7 forward (columns must match) or on a separate file given as parameter posteriorFile or vcfFile. Finally, columns 1-2 give marker names (scaffold and pos) for genotypes, and can be any value for pedigree part. Thus, make sure that each line has n+2 tab separated columns if there are n individuals and column i + 2 gives the genotype and pedigree information on individual i.

Note that Lep-MAP2 is provided with scripts to convert linkage files to posteriors. These scripts make it easy to use ParentCall2 also for data only in genotype format. However, for best maps, genotype likelihoods (without called genotypes) should be obtained especially on sequencing data.

Examples:
java ParentCall2 data=post_with_pedigree.txt >data.call

zcat post.gz|java ParentCall2 data=pedigree.txt posteriorFile=-|gzip >data.call.gz

zcat data.vcf.gz|java ParentCall2 data=pedigree.txt vcfFile=-|gzip >data.call.gz

Filtering2

The Filtering2 module handles filtering of the data, i.e. filtering markers based on, e.g. high segregation distortion (dataTolerance) and excess number of missing genotypes (missingLimit). This module outputs the filtered data in the same format to be used with other modules and for further analysis (e.g. QTL mapping).

Example :
java Filtering2 data=data1.call dataTolerance=0.001 >data_f.call

zcat data.call.gz|java Filtering2 data=- dataTolerance=0.001|gzip >data_f.call.gz

SeparateChromosomes2

The SeparateChromosomes2 module assigns markers into linkage groups (LGs) by computing all pair-wise LOD scores between markers and joins markers that with LOD score higher than a user given parameter lodLimit. The output is a map file where line i (not commented by #) corresponds to marker i and gives its LG name (number).

Example :
java SeparateChromosomes2 data=data_f.call lodLimit=5 >map5.txt

java SeparateChromosomes2 data=data_f.call lodLimit=5 theta=0.05 >map5.5.txt

(parameter numThreads utilises multiple cores)

JoinSingles2All

The JoinSingles2All module assigns singular markers to existing LGs by computing LOD scores between each single marker and markers from the existing LGs. It needs a map file as input parameter and then it outputs a new map file with possible more markers assigned to LGs.

Example :
java -cp bin/ JoinSingles2All map=map5.txt data=data_f.call lodLimit=4 >map5_js.txt

zcat data_f.call.gz|java -cp bin/ JoinSingles2All map=map5.txt data=- lodLimit=3 lodDifference=2 >map5_js.txt

(parameter numThreads utilises muliple cores)

(iterated joinSingles2All yields same result as iterating it until no markers can be added)
java -cp bin/ JoinSingles2All map=map5.txt data=data_f.call lodLimit=4 iterate=1 >map5_js_iterated.txt

OrderMarkers2

OrderMarkers2 orders the markers within each LG by maximizing the likelihood of the data given the order. Either a map file (map=map_file) or output of previous run of OrderMarkers2 (evaluateOrder=order_file) have to be provided (only the first column (the marker name or lg name) is read from these files).

The output consists of marker name (numbered as 1..N) with paternal and maternal map position on columns 1-3. To provide identical output as in Lep-MAP2, there is also 0 as an error estimate on column 4.
If the option outputPhasedData=1 (or 2) is given, columns 5- give the phased data on each marker and family as well as the number of maternal and paternal recombinations. Each print (string of 0 and 1) will first give the paternal and then maternal segregation pattern.

Important parameters for OrderMarkers2 (defaults should be ok, however, sometimes better results can be obtained by tinkering these):
recombination1 (and 2): recombination rate for male (female)
interference1 (and 2): recombination interference for male (female)
scale: to cope with more markers than individuals
minError: genotype likelihoods are capped to this

faster marker ordering:
identicalLimit: adjacent markers are collapsed if their likelihoods are within this limit
numThreads: maximum number of threads to use

Examples :
java -cp bin/ OrderMarkers2 map=map5_js.txt data=data_f.call chromosome=1 >order1.txt

java -cp bin/ OrderMarkers2 evaluateOrder=order1.txt data=data_f.call >order1.2.txt

Compute map distances from a given order:
java -cp bin/ OrderMarkers2 evaluateOrder=order2.txt data=data_f.call improveOrder=0 >order1.3.txt

Compute sex averaged map distances from a given order
java -cp bin/ OrderMarkers2 evaluateOrder=order1.txt data=data_f.call improveOrder=0 sexAveraged=1 >order1.SA.txt

achiasmatic meiosis (no recombination in male):
java -cp bin/ OrderMarkers2 map=map.txt data=data_f.call recombination1=0

achiasmatic meiosis (no recombination in female):
java -cp bin/ OrderMarkers2 map=map.txt data=data_f.call recombination2=0

It is typically more convinient to order each chromosome separately
java -cp bin/ OrderMarkers2 map=mapBig.txt data=dataBig.call chromosome=1 >order1.1.txt
...
java -cp bin/ OrderMarkers2 map=mapBig.txt data=dataBig.call chromosome=N >orderN.1.txt

Hints for OrderMarkers2:

To be sure on the marker order, it is wise to repeat marker ordering a few times. Order with best likelihood is typically the most correct.

The errors in marker orders can be detected by making Lep-MAP graph and visualising it. See the top of this page for more information. (It is difficult to produce and comprehend LOD score plots for very large datasets)

LM3 ouputs the number of recombinations for each offspring (to the error stream, use 2>err.txt to store these). Problematic individuals can be spotted as those recombining too many times.

It can be wise to remove markers at the map ends that cause long gaps.

BE SURE that your individuals match the pedigree. For example, use Lep-MAP3 IBD module
(with genotype likelihood data before ParentCall), to verify that your individuals are full-sibs.

LMPlot

LMPlot outputs a Lep-MAP graph in Graphviz DOT language format from a given order. It can be used to verify the linkage mapping results.

The output can be visualized with dot or with xdot.py from https://github.com/jrfonseca/xdot.py .

Example:

for every chromosome :
java -cp bin/ OrderMarkers2 map=map.txt data=data_f.call outputPhasedData=1 chromosome=1 >order_wp_1.txt
java -cp bin/ LMPlot order_wp_1.txt >order_wp_1.dot
xdot.py order_wp_1.dot

or:
java -cp bin/ OrderMarkers2 evaluateOrder=order1.txt data=data_f.call outputPhasedData=1 >order_wp_1.txt
java -cp bin/ LMPlot order_wp_1.txt >order_wp_1.dot
xdot.py order_wp_1.dot

The nodes of the graph are numbered in the order they first occur in the map (order_wp1_txt). The edge labels give the index for individual haplotype that recombines (changes). If the order is (about) correct, the node number should be in order 1,2, ..., N when following the nodes from one end of the chain to the other. Also the size of the nodes gives information how common each marker is, uncommon markers at the ends could be erroneous.

IBD

Examples:

zcat post_from_pipeline.gz|awk '{if (NR==1) {print;print;print;print;print}; print}'|java IBD data=- >ibd.txt

#Then Listing ibd values in descending order
sort -n -r -k 3,3 ibd.txt|less

#or relatedness to a potential parent
sort -n -r -k 3,3 ibd.txt|grep -w possible_parent|less

zcat post_from_pipeline.gz|awk '{if (NR==1) {print;print;print;print;print}; print}'|java IBD data=- parents=file:parents.txt >parenthood.txt

#listing the combinations with ascending error-rate
sort -n -k 4,4 parenthood.txt|grep -v NA|less

WGS data

There are WGS versions of modules separating markers into linkage groups (SeparateIdenticals, JoinSingles2Identicals, ...). However, the current version of SeparateChromosomes2 is fast enough to run within a few days even on several millions of markers on a computer cluster with enough cores (say 24 cores and numThreads=24). Using SeparateChromosomes2 is the preferred way as it is much simpler to run. Further note that you can refine linkage group assignments and mask markers by providing map file to SeparateChromosomes2 (map=file).

Phasing and QTL mapping

It is easiest to use data phased by LM3 for QTL mapping. This can be obtained with OrderMarkers2 using parameters grandparentPhase=1, outputPhasedData=1 (or 2) to output the grandparental phased data.
With outputPhasedData=1, there won't be any missing data making QTL analysis straighforward.

The phased data can be converted to fully informative "genotype" data by map2gentypes.awk script. If you provide parameter fullData=1 to this script, also the pedigree information and parents are given out. However, as the order file does not contain individual names, the individuals will be re-named with running numbers. The offspring are in the same order as in the input data and parents are given as the first individuals of every family. These new parental genotypes are always "1 2". Moreover, this data is phased so that the first digit of the genotypes is inherited from father and the second from mother. If the data is in grandparental phase, this also applies to the parents.

The grandparentPhase=1 uses only those markers that can be phased using the grandparents. However, to obtain maximum number of markers, you can construct the maps without grandParentPhase=1, then evaluate the order and get phase using grandparentPhase=1, and finally match the phases of the two data on the common markers. The script phasematch.awk can be used for matching the phases.

To be continued...


Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.