Menu

LM3 Home

Pasi Rastas
Attachments
example_data.txt (4195 bytes)
pipeline.png (9488 bytes)

Lep-MAP3: Documentation

Pasi Rastas (C), 2014-2019, pasi.rastas@gmail.com

Lep-MAP3 (LM3) is a novel linkage map construction software suite. It can handle millions of markers and thousands of individuals possibly on multiple families. Input genotype data can be from genome sequencing (RADseq or whole genome sequencing), SNP assay, microsatellites or any mixture of them. However, as the input data should be given as genotype likelihoods (posteriors), genome sequencing and SNP assay data is preferred.

If you use Lep-MAP3, please cite
P. Rastas. Lep-MAP3: Robust linkage mapping even for low-coverage whole genome sequencing data, Bioinformatics, https://doi.org/10.1093/bioinformatics/btx494 .

The following wiki will document its features. Each module of LM3 is documented below with examples on simulated data. The used data can be downloaded from here.

Introduction

Lep-MAP3 (LM3) consists of multiple modules. Each of these modules can run by

java -cp path moduleName [parameters]

, where "path" is the directory where the class files are located. Each module requires some parameters, simply running "java -cp path moduleName" lists these parameters and a short description of each parameter.

The most commonly used modules are ParentCall2, Filtering2, SeparateChromosomes2, JoinSingles2All and OrderMarkers2. These as analogous to the modules (ParentCall, Filtering, SeparateChromosomes, JoinSingles and OrderMarkers) of Lep-MAP2.

In order to map millions of markers (e.g. whole genome sequencing), modules SeparateIndenticals, JoinSingles2Indeticals and JoinLGs can be used. However, these are still in very preliminary form. Moreover, as JoinSingles2All and SeparateChromosomes2 now utilise multiple cores, these could probably be used as well on larger data.

In order to verify the linkage mapping results, module LMPlot can be used. It will output the Lep-MAP graph in Graphviz DOT language format. The output can be visualized with dot or with xdot.py from https://github.com/jrfonseca/xdot.py .

Installation

LM3 is implemented in java. To run it, java runtime evironment is required (java.com). The compiled java classes and source code can be downloaded from this sourceforge page.

Using Lep-MAP3 non-Linux systems

Lep-MAP3 is developed in the Linux environment. The examples below are not neccessary working in other operation systems. Here are some hints from Lep-MAP users on how to use in other systems.

macOS

At least the zcat command is not working in the same way in mac as in Linux. However, replacing zcat by
gunzip -c
should do the trick. For example:

gunzip -c post.gz|java ParentCall2 data=pedigree_t.txt posteriorFile=- removeNonInformative=1|gzip >data.call.gz

Windows

The use of compressed files is difficult in Windows. Using uncompressed files should work normally, e.g.

java -cp bin ParentCall2 data=pedigree_t.txt vcfFile=file.vcf removeNonInformative=1 >data.call
java -cp bin Filtering2 data=data.call dataTolerance=0.001 >data.filt
...

Modules


Figure 1. Default Lep-MAP3 pipeline

ParentCall2

The module ParentCall2 is used to call (possible missing or erroneous) parental genotypes and markers on sex chromosomes (XY or ZW system, assumes known offspring sex). Grandparents and half-sib familes are also supported (especially the grandparents should be provided if there is data on them).

The input of ParentCall2 consists of genotype likelihoods (posteriors) for each 10 possible SNP genotypes AA, AC, AG, AT, CC, CG, CT, GG, GT and TT. Other kind of variants (like indels) can be given as input by specifying them as SNPs (e.g. AA=homozygote indel, AT=heterozygote indel, TT=no indel). Typically these likelihoods can be obtained from sequencing data or from SNP assays. The output is also in the same likelihood format.

The first 6 lines presents the pedigree. First line is the family name, second individual name, third and fourth are the father and mother. Line 5 containts the sex of each individual (1 male, 2 female, 0 unknown) and the last line is the phenotype (can be 0 for all individuals, this is not currently used). The likelihoods can be provided from line 7 forward (columns must match) or on a separate file given as parameter posteriorFile or vcfFile. Finally, columns 1-2 give marker names (scaffold and pos) for genotypes, and can be any value for pedigree part. Thus, make sure that each line has n+2 tab separated columns if there are n individuals and column i + 2 gives the genotype and pedigree information on individual i.

Example pedigree (in correct transpose, should be tab separated) is below:

CHR POS F   F   F   F   F   F
CHR POS female  male    progeny_1   progeny_2   progeny_3   progeny_4
CHR POS 0   0   male    male    male    male
CHR POS 0   0   female  female  female  female
CHR POS 2   1   0   0   0   0
CHR POS 0   0   0   0   0   0

If you have a pedigree file (pedigree.txt) in transpose (6 columns, one row for each individual), it can be converted to proper LM3 format by following command:

./transpose_tab pedigree.txt|awk '{print "CHR\tPOS\t" $0}' >ped_t.txt

Note that Lep-MAP2 is provided with scripts to convert linkage files to posteriors. These scripts make it easy to use ParentCall2 also for data only in genotype format. However, for best maps, genotype likelihoods (without called genotypes) should be obtained especially on sequencing data.

Examples:

java ParentCall2 data=post_with_pedigree_t.txt removeNonInformative=1 >data.call
zcat post.gz|java ParentCall2 data=pedigree_t.txt posteriorFile=- removeNonInformative=1|gzip >data.call.gz

First few lines and columns of actual data can be seen in example_data.txt.

zcat data.vcf.gz|java ParentCall2 data=pedigree_t.txt vcfFile=- removeNonInformative=1|gzip >data.call.gz

Calling markers on autosomes and on X chromosome (X-)

java ParentCall2 data=... XLimit=2 ...

Calling markers on autosomes and on Z chromosome (Z-)

java ParentCall2 data=... ZLimit=2 ...

The markers called on the sex chromosome have a star (*) after the position field, e.g. "contig1 2345" becomes "contig1 2345*".

Converting data from other formats to Lep-MAP3

JoinMap loc files:

awk -f locsingle.awk file.loc|awk -f loc2genotypes.awk|awk -f genotypes2post.awk |java -cp Lep-MAP3/bin ... data=- ...

Genotype assay data (in tab separated affx_genotypes.txt):

SNP   IND1 IND2 IND3 IND4   ...
AX-01 AA   AB   BB   NoCall ...
AX-02 AB   AA   AA   BB     ...
...
awk -f affx2post.awk affx_genotypes.txt|java -cp Lep-MAP3/bin ParentCall2 data=pedigree.txt posteriorFile=- ...

Lep-MAP2 files (linkage):

java -cp Lep-MAP2/bin Transpose data.linkage|awk '{print "CHR\tPOS\t"$0}' |awk -f genotypes2post.awk >data.post

ParentCallPloidy

The polyploid version of ParentCall2. It has a limited number of options, and supports only single family data. Script convert2diploid.awk converts simplex markers from ParentCallPloidy output to diploid data to be used in Lep-MAP3. Parameter ploidy must be set in Pileup2Likelihoods, ParentCallPloidy and convert2diploid.awk.

Filtering2

The Filtering2 module handles filtering of the data, i.e. filtering markers based on, e.g. high segregation distortion (dataTolerance) and excess number of missing genotypes (missingLimit). This module outputs the filtered data in the same format to be used with other modules and for further analysis (e.g. QTL mapping).

Note that Filtering2 is best suited for multi-family data especially with default dataTolerance(=0.01). For single family data, distortionLod=1 in SeparateChromosomes2 and JoinSingles2All can provide a better solution to deal with distorted markers. This is because Filtering can cause long gaps on single family crosses. If filtering is used on such data, ofter a smaller dataTolerance is more suitable (like 0.001 or 0.0001).

Example :

java Filtering2 data=data1.call dataTolerance=0.001 >data_f.call
zcat data.call.gz|java Filtering2 data=- dataTolerance=0.001|gzip >data_f.call.gz

SeparateChromosomes2

The SeparateChromosomes2 module assigns markers into linkage groups (LGs) by computing all pair-wise LOD scores between markers and joins markers that with LOD score higher than a user given parameter lodLimit. The output is a map file where line i (not commented by #) corresponds to marker i and gives its LG name (number).

Example :

java SeparateChromosomes2 data=data_f.call lodLimit=5 >map5.txt
java SeparateChromosomes2 data=data_f.call lodLimit=5 theta=0.05 >map5.5.txt
java SeparateChromosomes2 data=data.call lodLimit=5 distortionLod=1 >map5.nofilt.txt

(parameter numThreads utilises multiple cores)

The size distribution of linkage groups can be obtained like this:

sort map5.txt|uniq -c|sort -n

If you get the snp names to a file snps.txt by

awk '(NR>=7)' data_f.call|cut -f 1,2 >snps.txt

You can see the genomic coordinates of markers in each linkage group by

paste snps.txt map5.txt|awk '($3>0)'

JoinSingles2All

The JoinSingles2All module assigns singular markers to existing LGs by computing LOD scores between each single marker and markers from the existing LGs. It needs a map file as input parameter and then it outputs a new map file with possible more markers assigned to LGs.

Example :

java -cp bin/ JoinSingles2All map=map5.txt data=data_f.call lodLimit=4 >map5_js.txt
zcat data_f.call.gz|java -cp bin/ JoinSingles2All map=map5.txt data=- lodLimit=3 lodDifference=2 >map5_js.txt

(parameter numThreads utilises muliple cores)

(iterated joinSingles2All yields same result as iterating it until no markers can be added)

java -cp bin/ JoinSingles2All map=map5.txt data=data_f.call lodLimit=4 iterate=1 >map5_js_iterated.txt

The size distribution of linkage groups can be obtained like this:

cut -f 1 map5_js.txt|sort|uniq -c|sort -n

OrderMarkers2

OrderMarkers2 orders the markers within each LG by maximizing the likelihood of the data given the order. Either a map file (map=map_file) or output of previous run of OrderMarkers2 (evaluateOrder=order_file) have to be provided (only the first column (the marker name or lg name) is read from these files).

The output consists of marker name (numbered as 1..N) with paternal and maternal map position on columns 1-3. To provide identical output as in Lep-MAP2, there is also 0 as an error estimate on column 4.
If the option outputPhasedData=1 (or 2) is given, columns 5- give the phased data on each marker and family as well as the number of maternal and paternal recombinations. Each print (string of 0 and 1) will first give the paternal and then maternal segregation pattern.

Important parameters for OrderMarkers2 (defaults should be ok, however, sometimes better results can be obtained by tinkering these):
recombination1 (and 2): recombination rate for male (female)
interference1 (and 2): recombination interference for male (female)
scale: to cope with more markers than individuals
minError: genotype likelihoods are capped to this

faster marker ordering:
identicalLimit: adjacent markers are collapsed if their likelihoods are within this limit
numThreads: maximum number of threads to use

Examples :

java -cp bin/ OrderMarkers2 map=map5_js.txt data=data_f.call chromosome=1 >order1.txt
java -cp bin/ OrderMarkers2 evaluateOrder=order1.txt data=data_f.call >order1.2.txt

Compute map distances from a given order:

java -cp bin/ OrderMarkers2 evaluateOrder=order2.txt data=data_f.call improveOrder=0 >order1.3.txt

Compute sex averaged map distances from a given order

java -cp bin/ OrderMarkers2 evaluateOrder=order1.txt data=data_f.call improveOrder=0 sexAveraged=1 >order1.SA.txt

achiasmatic meiosis (no recombination in male):

java -cp bin/ OrderMarkers2 map=map.txt data=data_f.call recombination1=0

achiasmatic meiosis (no recombination in female):

java -cp bin/ OrderMarkers2 map=map.txt data=data_f.call recombination2=0

It is typically more convinient to order each chromosome separately

java -cp bin/ OrderMarkers2 map=mapBig.txt data=dataBig.call chromosome=1 >order1.1.txt
...
java -cp bin/ OrderMarkers2 map=mapBig.txt data=dataBig.call chromosome=N >orderN.1.txt

Hints for OrderMarkers2:

Converting marker numbers (1..N) back to genomic coordinates (order.txt is the output from OrderMarkers2 using data data.call.gz coming from ParentCall2):

zcat data.call.gz|cut -f 1,2|awk '(NR>=7)' >snps.txt
#note that first line of snps.txt contains "CHR POS"
awk -vFS="\t" -vOFS="\t" '(NR==FNR){s[NR-1]=$0}(NR!=FNR){if ($1 in s) $1=s[$1];print}' snps.txt order.txt >order.mapped
#because of first line of snps.txt, we use NR-1 instead of NR

To be sure on the marker order, it is wise to repeat marker ordering a few times. Order with best likelihood is typically the most correct.

The errors in marker orders can be detected by making Lep-MAP graph and visualising it. See the top of this page for more information. (It is difficult to produce and comprehend LOD score plots for very large datasets)

LM3 ouputs the number of recombinations for each offspring (to the error stream, use 2>err.txt to store these). Problematic individuals can be spotted as those recombining too many times.

It can be wise to remove markers at the map ends that cause long gaps.

BE SURE that your individuals match the pedigree. For example, use Lep-MAP3 IBD module
(with genotype likelihood data before ParentCall), to verify that your individuals are full-sibs.

LMPlot

LMPlot outputs a Lep-MAP graph in Graphviz DOT language format from a given order. It can be used to verify the linkage mapping results.

The output can be visualized with dot or with xdot.py from https://github.com/jrfonseca/xdot.py .

Example:

for every chromosome :

java -cp bin/ OrderMarkers2 map=map.txt data=data_f.call outputPhasedData=1 chromosome=1 >order_wp_1.txt
java -cp bin/ LMPlot order_wp_1.txt >order_wp_1.dot
xdot.py order_wp_1.dot

or:

java -cp bin/ OrderMarkers2 evaluateOrder=order1.txt data=data_f.call outputPhasedData=1 >order_wp_1.txt
java -cp bin/ LMPlot order_wp_1.txt >order_wp_1.dot
xdot.py order_wp_1.dot

The nodes of the graph are numbered in the order they first occur in the map (order_wp1_txt). The edge labels give the index for individual haplotype that recombines (changes). If the order is (about) correct, the node number should be in order 1,2, ..., N when following the nodes from one end of the chain to the other. Also the size of the nodes gives information how common each marker is, uncommon markers at the ends could be erroneous. The erroneous edges are highlited with red color.

IBD

Examples:

zcat post_from_pipeline.gz|java IBD posteriorFile=- >ibd.txt

Then Listing ibd values in descending order

sort -n -r -k 3,3  ibd.txt|less

or relatedness to a potential parent

sort -n -r -k 3,3  ibd.txt|grep -w possible_parent|less

Faster runtime can be obtained by taking a subset of markers (10% here) and using more threads (numThreads)

zcat post_from_pipeline.gz|awk '(NR==1||rand()<0.1) print}'|java IBD posteriorFile=- numThreads=8 >ibd_faster.txt

You can also input vcf file

zcat file.vcf.gz|java IBD vcfFile=- numThreads=8 >ibd_from_vcf.txt

Calculating Mendel error rates:

zcat post_from_pipeline.gz|java IBD posteriorFile=-  parents=file:parents.txt >parenthood.txt

Listing the combinations with ascending error-rate

sort -n -k 4,4 parenthood.txt|grep -v NA|less

WGS data

There are WGS versions of modules separating markers into linkage groups (SeparateIdenticals, JoinSingles2Identicals, ...). However, the current version of SeparateChromosomes2 is fast enough to run within a few days even on several millions of markers on a computer cluster with enough cores (say 24 cores and numThreads=24). Using SeparateChromosomes2 is the preferred way as it is much simpler to run. Further note that you can refine linkage group assignments and mask markers by providing map file to SeparateChromosomes2 (map=file).

Phasing and QTL mapping

It is easiest to use data phased by LM3 for QTL mapping. This can be obtained with OrderMarkers2 using parameters grandparentPhase=1, outputPhasedData=1 (or 2) to output the grandparental phased data.
With outputPhasedData=1, there won't be any missing data making QTL analysis straighforward.

The phased data can be converted to fully informative "genotype" data by map2gentypes.awk script. If you provide parameter fullData=1 to this script, also the pedigree information and parents are given out. However, as the order file does not contain individual names, the individuals will be re-named with running numbers. The offspring are in the same order as in the input data (within each family, families are in the order of first occurring individual) and parents are given as the first individuals of every family. These new parental genotypes are always "1 2". Moreover, this data is phased so that the first digit of the genotypes is inherited from father and the second from mother. If the data is in grandparental phase, this also applies to the parents.

The grandparentPhase=1 uses only those markers that can be phased using the grandparents. However, to obtain maximum number of markers, you can construct the maps without grandParentPhase=1, then evaluate the order and get phase using grandparentPhase=1, and finally match the phases of the two data on the common markers. The script phasematch.awk can be used for matching the phases.

QTL pipeline

A basic QTL pipeline has been now added to the LM3 git. This include scripts qtl.R, qtlPerm.R, and example data qtlphenotypes.txt and qtldata1.txt. The phenotypes are listed in the same order as in the data, there are phenotypes for the parents as well (but these are not used).

This 32 family example data was generated as following (order1.txt is from OrderMarkers2, either denovo or in the physical order):

awk -vfullData=1 -f map2genotypes.awk order1.txt >qtldata1.12

The LOD plot is generated by running qtl.R, significance by permutation test can be calulated by qtlPerm.R. To use these for your own data, you have to change them a bit.

Sequencing data processing pipeline

The sequencing data processing pipeline for Lep-MAP3 is based on "SAMtools mpileup" and Pileup2Likelihoods module (old pipeline uses custom scripts
pileupParser2.awk and pileup2posterior.awk, provided in scripts.zip. Replace "java -cp bin Pileup2Likelihoods" with awk -f pileupParser2.awk|awk -f pileup2posterior.awk" to use these). The usage is

samtools mpileup -q 10 -Q 10 -s $(cat sorted_bams)|java -cp bin/ Pileup2Likelihoods|gzip >post.gz

This command requires two files, sorted_bams and mapping.txt, both containing exactly one line listing the file names for sorted bams and individual names, respectively and in the same order. If the data of each individual is in its own bam, then the files can be same (but it is more clear to remove the bam suffix from the individual names). Please note that this pipeline does not work with the old version of samtools (0.X).

For example (3 individuals in 4 bams):
sorted_bams: 1.bam 1a.bam 2.bam 3.bam
mapping.txt: 1 1 2 3

The post.gz can be used as input for ParentCall2, e.g.

zcat post.gz|java ParentCall2 data=pedigree.txt posteriorFile=- removeNonInformative=1 ... |gzip >post.call.gz

This SNP calling pipeline can be run parallel for each contig. Below is a sketch of this :
1. You need index files for each of you bams (e.g. samtools index fille1.bam; samtools index file2.bam; ...)
2. Construct file contigs.txt containing all your contig names (e.g. grep ">" ref.fasta|awk '{print substr($1,2)}' >contigs.txt)
3. Then create commands (SNP_calling.txt) to create individual post files for each contig

for i in $(cat contigs.txt)
do 
echo "samtools mpileup -r \"$i\" -q 10 -Q 10 -s \$(cat sorted_bams)|java -cp bin Pileup2Likelihoods|gzip >\"$i\".post.gz"
done >SNP_calling.txt
  1. Run commands in parellel (16 cores)
parallel --jobs 16 < SNP_calling.txt
  1. And finally (after parallel finishes) join the individual files (assuming there are no other files with post.gz suffix)
zcat *.post.gz | awk '(NR==1 || ($1!="CHR"))'|gzip >all_post.gz

Selfing crosses

As Lep-MAP3 assumes two parents for each family, a selfing crosses cannot be directly analysed with it. However, it is possible to add two dummy parents (one male and another female) to the pedigree. Data for the single parent is not really needed, but the grandparents (say two individuals from different lines crossed to form the parent) can be used. With data on these grandparents use grandparentPhase=1 in OrderMarkers2, without grandparental data use selfingPhase=1. The phasing results without these flags can be problematic.

Please try increasing the scale parameter, especially for higher order (S2, S3,...) selfing (and maybe other) crosses. Value "2M/N" might be a safe choice to try instead of the default ("M/N"). You can provide the value literally as

java OrderMarkers2 ... scale=2M/N

, Lep-MAP3 will replace M (number of individuals) and N (number of markers) with the proper values based on the input data. If the marker density is low, scale parameter does not work. Then increasing recombination and interference parameters might work similarly (e.g. to 0.03 to achieve same effect, 0.03*0.03 ~ 0.001).

Also note the new Lep-MAP3 update. There is now "heterozygoteRate=NUM" parameter for Filtering2 to take into account the higher rate of homozygotes in selfing crosses (S2=0.25, S3=0.125,...).

Linkage mapping without parental data

Lep-MAP3 can be used to make maps without parents for F1 single family data as follows:

First step is to use ignoreParentOrder=1 in ParentCall2 (you have to add dummy parents to the pedigree in order to have a full-sib family). This will call the parental genotypes by arbitrary assigning which parent is informative.

You can run Filtering2, SeparateChromosome2 and JoinSingles2All directly on this data (assuming most markers are bi-allelic). The markers where both parents are informative should put markers into linkage groups corresponding to chromosomes. So you should be able to put markers into right number of linkage groups, for example as follows:

zcat post.gz|java -cp bin ParentCall2 data=ped.txt posteriorFile=- removeNonInformative=1 ignoreParentOrder=1|gzip >data.call.gz
zcat data.call.gz|java -cp bin SeparateChromosomes2 data=- lodLimit=10 distortionLod=1 sizeLimit=X >map10.txt

However, running OrderMarkers2 on this data yields a weird result as there are mixed markers where only mother or father are informative.

To proceed, you have to first swap the parental genotypes (manually) so that first parent is always informative:

#data.call.gz is from ParentCall2 with ignoreParentOrder=1
zcat data.call.gz|awk -f allPaternal.awk|gzip >data_ap.call.gz

Now running SeparateChromosomes2 with informativeMask=1 on this data should split each previous linkage group (without informativeMask) into 2 groups. If you had 22 groups in map10.txt, the new map should have 44 groups.

zcat data_ap.call.gz|java -cp bin SeparateChromosomes2 data=- lodLimit=10 distortionLod=1 informativeMask=1 sizeLimit=X >map10_ap.txt
#to validate 
paste map10.txt map10_ap.txt|awk '($1>0&&$2>0)'|sort|uniq -c|sort -n

Then you have to swap the parental genotypes on one of these groups for each original linkage group...

paste map10.txt map10_ap.txt|awk '($1>0&&$2>0)'|sort|uniq -c|sort -n -r|awk '{if (!($2 in d)) d[$2]=$3}END{for (i in d) print d[i]}' >flipped.txt

awk '(NR==FNR){f[$1]=1}(NR!=FNR){if (FNR==1) print; else print (f[$1]+0)}' flipped.txt map10_ap.txt >map10.flip

zcat data_ap.call.gz|awk -f somePaternal.awk map10.flip -|gzip >data.final.gz

And finally, you can run OrderMarkers2 on this data.final.gz (with swapped parental genotypes) and use map10.txt as the linkage groups. You can get correct male and female maps as well using this technique, but you don't know which is male and which is female (unless the recombination patterns are clearly separable).

Alternatively, you can use refineParentOrder=1 in OrderMarkers2 and minLod=0 in SeparateChromosomes and JoinSingles2All. This should work easier for multi-family data as well.

The wiki uses Markdown syntax.

Project Members: