Recent changes to LM3 Home

LM3 Home modified by Pasi Rastas

Pasi Rastas — Mon, 05 Aug 2024 08:25:56 -0000

--- v74
+++ v75
@@ -440,7 +440,7 @@
 zcat data.call.gz|java -cp bin SeparateChromosomes2 data=- lodLimit=10 distortionLod=1 sizeLimit=X >map10.txt
 ~~~

-However, running OrderMarkers2 on this data yield a weird result as there are mixed markers where only mother or father are informative.
+However, running OrderMarkers2 on this data yields a weird result as there are mixed markers where only mother or father are informative.

 To proceed, you have to first swap the parental genotypes (manually) so that first parent is always informative:
 ~~~
@@ -467,7 +467,7 @@

 And finally, you can run OrderMarkers2 on this data.final.gz (with swapped parental genotypes) and use map10.txt as the linkage groups. You can get correct male and female maps as well using this technique, but you don't know which is male and which is female (unless the recombination patterns are clearly separable).

-Alternatively, you can use refineParentOrder=1 in OrderMarkers2 and minLod=0 in SeparateChromosomes and JoinSingles2All. This should work easier for on multi-family data as well.
+Alternatively, you can use refineParentOrder=1 in OrderMarkers2 and minLod=0 in SeparateChromosomes and JoinSingles2All. This should work easier for multi-family data as well.

 The wiki uses [Markdown](/p/lep-map3/wiki/markdown_syntax/) syntax.

LM3 Home modified by Pasi Rastas

Pasi Rastas — Thu, 01 Aug 2024 12:53:33 -0000

--- v73
+++ v74
@@ -440,7 +440,7 @@
 zcat data.call.gz|java -cp bin SeparateChromosomes2 data=- lodLimit=10 distortionLod=1 sizeLimit=X >map10.txt
 ~~~

-However, you cannot run OrderMarkers2 as there are mixed markers where only mother or father are informative.
+However, running OrderMarkers2 on this data yield a weird result as there are mixed markers where only mother or father are informative.

 To proceed, you have to first swap the parental genotypes (manually) so that first parent is always informative:
 ~~~
@@ -456,7 +456,7 @@
 paste map10.txt map10_ap.txt|awk '($1>0&&$2>0)'|sort|uniq -c|sort -n
 ~~~

-Then you have to swap the parental genotypes on one of these groups for each linkage group...
+Then you have to swap the parental genotypes on one of these groups for each original linkage group...
 ~~~
 paste map10.txt map10_ap.txt|awk '($1>0&&$2>0)'|sort|uniq -c|sort -n -r|awk '{if (!($2 in d)) d[$2]=$3}END{for (i in d) print d[i]}' >flipped.txt

LM3 Home modified by Pasi Rastas

Pasi Rastas — Tue, 30 Jul 2024 09:07:25 -0000

--- v72
+++ v73
@@ -431,7 +431,7 @@

 Lep-MAP3 can be used to make maps without parents for F1 single family data as follows:

-First step is to use ignoreParentOrder=1 in ParentCall2 (you have to add dummy parents to the pedigree). This will call the parental genotypes arbitrary assigning which parent is informative.
+First step is to use ignoreParentOrder=1 in ParentCall2 (you have to add dummy parents to the pedigree in order to have a full-sib family). This will call the parental genotypes by arbitrary assigning which parent is informative.

 You can run Filtering2, SeparateChromosome2 and JoinSingles2All directly on this data (assuming most markers are bi-allelic). The markers where both parents are informative should put markers into linkage groups corresponding to chromosomes. So you should be able to put markers into right number of linkage groups, for example as follows:

@@ -448,10 +448,10 @@
 zcat data.call.gz|awk -f allPaternal.awk|gzip >data_ap.call.gz
 ~~~

-Now running SeparateChromosomes2 with informativeMask=1 on this data should split each previous linkage group (without informativeMask) into 2 groups. 
-
-~~~
-zcat data_ap.call.gz|java -cp bin SeparateChromosomes2 data=- lodLimit=10 distortionLod=1 informativeMask=1 >map10_ap.txt
+Now running SeparateChromosomes2 with informativeMask=1 on this data should split each previous linkage group (without informativeMask) into 2 groups. If you had 22 groups in map10.txt, the new map should have 44 groups.
+
+~~~
+zcat data_ap.call.gz|java -cp bin SeparateChromosomes2 data=- lodLimit=10 distortionLod=1 informativeMask=1 sizeLimit=X >map10_ap.txt
 #to validate 
 paste map10.txt map10_ap.txt|awk '($1>0&&$2>0)'|sort|uniq -c|sort -n
 ~~~
@@ -465,7 +465,7 @@
 zcat data_ap.call.gz|awk -f somePaternal.awk map10.flip -|gzip >data.final.gz
 ~~~

-And finally, you can run OrderMarkers2 on this data.final.gz (with swapped parental genotypes).
+And finally, you can run OrderMarkers2 on this data.final.gz (with swapped parental genotypes) and use map10.txt as the linkage groups. You can get correct male and female maps as well using this technique, but you don't know which is male and which is female (unless the recombination patterns are clearly separable).

 Alternatively, you can use refineParentOrder=1 in OrderMarkers2 and minLod=0 in SeparateChromosomes and JoinSingles2All. This should work easier for on multi-family data as well.

LM3 Home modified by Pasi Rastas

Pasi Rastas — Tue, 30 Jul 2024 08:58:27 -0000

--- v71
+++ v72
@@ -437,7 +437,7 @@

 ~~~
 zcat post.gz|java -cp bin ParentCall2 data=ped.txt posteriorFile=- removeNonInformative=1 ignoreParentOrder=1|gzip >data.call.gz
-zcat data.call.gz|java -cp bin SeparateChromosomes2 data=- lodLimit=10 distortionLod=1 >map10.txt
+zcat data.call.gz|java -cp bin SeparateChromosomes2 data=- lodLimit=10 distortionLod=1 sizeLimit=X >map10.txt
 ~~~

 However, you cannot run OrderMarkers2 as there are mixed markers where only mother or father are informative.
@@ -456,11 +456,18 @@
 paste map10.txt map10_ap.txt|awk '($1>0&&$2>0)'|sort|uniq -c|sort -n
 ~~~

-Then you have to swap the parental genotypes on one of these groups for each linkage group... TO BE CONTINUED
-
-And finally, you can run OrderMarkers2 on this data (with swapped parental genotypes).
-
-Alternatively, you can use refineParentOrder=1 in OrderMarkers2 and minLod=0 in SeparateChromosomes and JoinSingles2All. This should work for on multi-family data as well.
+Then you have to swap the parental genotypes on one of these groups for each linkage group...
+~~~
+paste map10.txt map10_ap.txt|awk '($1>0&&$2>0)'|sort|uniq -c|sort -n -r|awk '{if (!($2 in d)) d[$2]=$3}END{for (i in d) print d[i]}' >flipped.txt
+
+awk '(NR==FNR){f[$1]=1}(NR!=FNR){if (FNR==1) print; else print (f[$1]+0)}' flipped.txt map10_ap.txt >map10.flip
+
+zcat data_ap.call.gz|awk -f somePaternal.awk map10.flip -|gzip >data.final.gz
+~~~
+
+And finally, you can run OrderMarkers2 on this data.final.gz (with swapped parental genotypes).
+
+Alternatively, you can use refineParentOrder=1 in OrderMarkers2 and minLod=0 in SeparateChromosomes and JoinSingles2All. This should work easier for on multi-family data as well.

 The wiki uses [Markdown](/p/lep-map3/wiki/markdown_syntax/) syntax.

LM3 Home modified by Pasi Rastas

Pasi Rastas — Mon, 29 Jul 2024 13:08:09 -0000

--- v70
+++ v71
@@ -427,6 +427,40 @@

 Also note the new Lep-MAP3 update. There is now "heterozygoteRate=NUM" parameter for Filtering2 to take into account the higher rate of homozygotes in selfing crosses (S2=0.25, S3=0.125,...).

+##Linkage mapping without parental data
+
+Lep-MAP3 can be used to make maps without parents for F1 single family data as follows:
+
+First step is to use ignoreParentOrder=1 in ParentCall2 (you have to add dummy parents to the pedigree). This will call the parental genotypes arbitrary assigning which parent is informative.
+
+You can run Filtering2, SeparateChromosome2 and JoinSingles2All directly on this data (assuming most markers are bi-allelic). The markers where both parents are informative should put markers into linkage groups corresponding to chromosomes. So you should be able to put markers into right number of linkage groups, for example as follows:
+
+~~~
+zcat post.gz|java -cp bin ParentCall2 data=ped.txt posteriorFile=- removeNonInformative=1 ignoreParentOrder=1|gzip >data.call.gz
+zcat data.call.gz|java -cp bin SeparateChromosomes2 data=- lodLimit=10 distortionLod=1 >map10.txt
+~~~
+
+However, you cannot run OrderMarkers2 as there are mixed markers where only mother or father are informative.
+
+To proceed, you have to first swap the parental genotypes (manually) so that first parent is always informative:
+~~~
+#data.call.gz is from ParentCall2 with ignoreParentOrder=1
+zcat data.call.gz|awk -f allPaternal.awk|gzip >data_ap.call.gz
+~~~
+
+Now running SeparateChromosomes2 with informativeMask=1 on this data should split each previous linkage group (without informativeMask) into 2 groups. 
+
+~~~
+zcat data_ap.call.gz|java -cp bin SeparateChromosomes2 data=- lodLimit=10 distortionLod=1 informativeMask=1 >map10_ap.txt
+#to validate 
+paste map10.txt map10_ap.txt|awk '($1>0&&$2>0)'|sort|uniq -c|sort -n
+~~~
+
+Then you have to swap the parental genotypes on one of these groups for each linkage group... TO BE CONTINUED
+
+And finally, you can run OrderMarkers2 on this data (with swapped parental genotypes).
+
+Alternatively, you can use refineParentOrder=1 in OrderMarkers2 and minLod=0 in SeparateChromosomes and JoinSingles2All. This should work for on multi-family data as well.

 The wiki uses [Markdown](/p/lep-map3/wiki/markdown_syntax/) syntax.

LM3 Home modified by Pasi Rastas

Pasi Rastas — Wed, 13 Dec 2023 13:00:09 -0000

--- v69
+++ v70
@@ -134,6 +134,10 @@
 java -cp Lep-MAP2/bin Transpose data.linkage|awk '{print "CHR\tPOS\t"$0}' |awk -f genotypes2post.awk >data.post
 ~~~

+### ParentCallPloidy
+
+The polyploid version of ParentCall2. It has a limited number of options, and supports only single family data. Script convert2diploid.awk converts simplex markers from ParentCallPloidy output to diploid data to be used in Lep-MAP3. Parameter ploidy must be set in Pileup2Likelihoods, ParentCallPloidy and convert2diploid.awk.
+

 ### Filtering2

LM3 Home modified by Pasi Rastas

Pasi Rastas — Wed, 22 Mar 2023 08:49:46 -0000

--- v68
+++ v69
@@ -355,7 +355,7 @@
 It is easiest to use data phased by LM3 for QTL mapping. This can be obtained with OrderMarkers2 using parameters grandparentPhase=1, outputPhasedData=1 (or 2) to output the grandparental phased data. 
 With outputPhasedData=1, there won't be any missing data making QTL analysis straighforward.

-The phased data can be converted to fully informative "genotype" data by map2gentypes.awk script. If you provide parameter fullData=1 to this script, also the pedigree information and parents are given out. However, as the order file does not contain individual names, the individuals will be re-named with running numbers. The offspring are in the same order as in the input data (withing each family, families are in the order of first occurring individual) and parents are given as the first individuals of every family. These new parental genotypes are always "1 2". Moreover, this data is phased so that the first digit of the genotypes is inherited from father and the second from mother. If the data is in grandparental phase, this also applies to the parents. 
+The phased data can be converted to fully informative "genotype" data by map2gentypes.awk script. If you provide parameter fullData=1 to this script, also the pedigree information and parents are given out. However, as the order file does not contain individual names, the individuals will be re-named with running numbers. The offspring are in the same order as in the input data (within each family, families are in the order of first occurring individual) and parents are given as the first individuals of every family. These new parental genotypes are always "1 2". Moreover, this data is phased so that the first digit of the genotypes is inherited from father and the second from mother. If the data is in grandparental phase, this also applies to the parents. 

 The grandparentPhase=1 uses only those markers that can be phased using the grandparents. However, to obtain maximum number of markers, you can construct the maps without grandParentPhase=1, then evaluate the order and get phase using grandparentPhase=1, and finally match the phases of the two data on the common markers. The script phasematch.awk can be used for matching the phases.

LM3 Home modified by Pasi Rastas

Pasi Rastas — Wed, 22 Mar 2023 08:48:16 -0000

--- v67
+++ v68
@@ -23,15 +23,15 @@

 The most commonly used modules are ParentCall2, Filtering2, SeparateChromosomes2, JoinSingles2All and  OrderMarkers2. These as analogous to the modules (ParentCall, Filtering, SeparateChromosomes, JoinSingles and  OrderMarkers) of Lep-MAP2.

-In order to map millions of markers (e.g. whole genome sequencing), modules, JoinSingles2All and SeparateChromosomes2 can be run faster by utilising multiple cores (numThreads parameter). Moreover, samplePairs in SeparateChromosomes2 makes it much faster without much difference in the output. A typical value for samplePairs could be 0.1 to obtain 10x speedup.
+In order to map millions of markers (e.g. whole genome sequencing), modules SeparateIndenticals, JoinSingles2Indeticals and JoinLGs can be used. However, these are still in very preliminary form. Moreover, as JoinSingles2All and SeparateChromosomes2 now utilise multiple cores, these could probably be used as well on larger data.

 In order to verify the linkage mapping results, module LMPlot can be used. It will output the Lep-MAP graph in [Graphviz](http://www.graphviz.org/) DOT language format. The output can be visualized with dot or with xdot.py from https://github.com/jrfonseca/xdot.py . 
 ## Installation

-LM3 is implemented in java. To run it, java runtime environment is required ([java.com](http://java.com)). The compiled java classes and source code can be downloaded from this sourceforge page.  
+LM3 is implemented in java. To run it, java runtime evironment is required ([java.com](http://java.com)). The compiled java classes and source code can be downloaded from this sourceforge page.  

 ## Using Lep-MAP3 non-Linux systems
-Lep-MAP3 is developed in the Linux environment. The examples below are not necessary working in other operation systems. Here are some hints from Lep-MAP users on how to use in other systems.
+Lep-MAP3 is developed in the Linux environment. The examples below are not neccessary working in other operation systems. Here are some hints from Lep-MAP users on how to use in other systems.

 ### macOS
 At least the zcat command is not working in the same way in mac as in Linux. However, replacing zcat by 
@@ -62,7 +62,7 @@

 The input of ParentCall2 consists of genotype likelihoods (posteriors) for each 10 possible SNP genotypes AA, AC, AG, AT, CC, CG, CT, GG, GT and TT. Other kind of variants (like indels) can be given as input by specifying them as SNPs (e.g. AA=homozygote indel, AT=heterozygote indel, TT=no indel). Typically these likelihoods can be obtained from sequencing data or from SNP assays. The output is also in the same likelihood format. 

-The first 6 lines presents the pedigree. First line is the family name, second individual name, third and fourth are the father and mother. Line 5 contains the sex of each individual (1 male, 2 female, 0 unknown) and the last line is the phenotype (can be 0 for all individuals, this is not currently used). The likelihoods can be provided from line 7 forward (columns must match) or on a separate file given as parameter posteriorFile or vcfFile. Finally, columns 1-2 give marker names (scaffold and pos) for genotypes, and can be any value for pedigree part. Thus, make sure that each line has n+2 tab separated columns if there are n individuals and column i + 2 gives the genotype and pedigree information on individual i.
+The first 6 lines presents the pedigree. First line is the family name, second individual name, third and fourth are the father and mother. Line 5 containts the sex of each individual (1 male, 2 female, 0 unknown) and the last line is the phenotype (can be 0 for all individuals, this is not currently used). The likelihoods can be provided from line 7 forward (columns must match) or on a separate file given as parameter posteriorFile or vcfFile. Finally, columns 1-2 give marker names (scaffold and pos) for genotypes, and can be any value for pedigree part. Thus, make sure that each line has n+2 tab separated columns if there are n individuals and column i + 2 gives the genotype and pedigree information on individual i.

 Example pedigree (in correct transpose, should be tab separated) is below:
 ~~~
@@ -139,7 +139,7 @@

 The Filtering2 module handles filtering of the data, i.e. filtering markers based on, e.g. high segregation distortion (dataTolerance) and excess number of missing genotypes (missingLimit). This module outputs the filtered data in the same format to be used with other modules and for further analysis (e.g. QTL mapping). 

-Note that Filtering2 is best suited for multi-family data especially with default dataTolerance(=0.01). (Note: Now the default dataTolerance is 0.001) For single family data, distortionLod=1 in SeparateChromosomes2 and JoinSingles2All can provide a better solution to deal with distorted markers. This is because Filtering can cause long gaps on single family crosses. If filtering is used on such data, often a smaller dataTolerance is more suitable (like 0.001 or 0.0001). 
+Note that Filtering2 is best suited for multi-family data especially with default dataTolerance(=0.01). For single family data, distortionLod=1 in SeparateChromosomes2 and JoinSingles2All can provide a better solution to deal with distorted markers. This is because Filtering can cause long gaps on single family crosses. If filtering is used on such data, ofter a smaller dataTolerance is more suitable (like 0.001 or 0.0001). 

 Example :
 ~~~
@@ -176,7 +176,7 @@
 sort map5.txt|uniq -c|sort -n
 ~~~

-If you get the SNP names to a file snps.txt by
+If you get the snp names to a file snps.txt by
 ~~~
 awk '(NR>=7)' data_f.call|cut -f 1,2 >snps.txt
 ~~~
@@ -199,9 +199,9 @@
 zcat data_f.call.gz|java -cp bin/ JoinSingles2All map=map5.txt data=- lodLimit=3 lodDifference=2 >map5_js.txt
 ~~~

-(parameter numThreads utilises multiple cores)
-
-(iterated joinSingles2All yields almost the same result as iterating it until no markers can be added) 
+(parameter numThreads utilises muliple cores)
+
+(iterated joinSingles2All yields same result as iterating it until no markers can be added) 
 ~~~
 java -cp bin/ JoinSingles2All map=map5.txt data=data_f.call lodLimit=4 iterate=1 >map5_js_iterated.txt
 ~~~
@@ -256,7 +256,7 @@
 java -cp bin/ OrderMarkers2 map=map.txt data=data_f.call recombination2=0
 ~~~

-It is typically more convenient to order each chromosome separately
+It is typically more convinient to order each chromosome separately
 ~~~
 java -cp bin/ OrderMarkers2 map=mapBig.txt data=dataBig.call chromosome=1 >order1.1.txt
 ...
@@ -281,7 +281,7 @@
 It can be wise to remove markers at the map ends that cause long gaps.

 BE SURE that your individuals match the pedigree. For example, use Lep-MAP3 IBD module
-(with genotype likelihood data before ParentCall2), to verify that your individuals are full-sibs.
+(with genotype likelihood data before ParentCall), to verify that your individuals are full-sibs.

 ### LMPlot

@@ -305,7 +305,7 @@
 xdot.py order_wp_1.dot
 ~~~

-The nodes of the graph are numbered in the order they first occur in the map (order_wp1_txt). The edge labels give the index for individual haplotype that recombines (changes). If the order is (about) correct, the node number should be in order 1,2, ..., N when following the nodes from one end of the chain to the other. Also the size of the nodes gives information how common each marker is, uncommon markers at the ends could be erroneous. The erroneous edges are highlighted with red color. 
+The nodes of the graph are numbered in the order they first occur in the map (order_wp1_txt). The edge labels give the index for individual haplotype that recombines (changes). If the order is (about) correct, the node number should be in order 1,2, ..., N when following the nodes from one end of the chain to the other. Also the size of the nodes gives information how common each marker is, uncommon markers at the ends could be erroneous. The erroneous edges are highlited with red color. 

 ### IBD

@@ -314,7 +314,7 @@
 ~~~
 zcat post_from_pipeline.gz|java IBD posteriorFile=- >ibd.txt
 ~~~
-Then listing IBD values in descending order
+Then Listing ibd values in descending order
 ~~~
 sort -n -r -k 3,3  ibd.txt|less
 ~~~
@@ -333,6 +333,7 @@
 zcat file.vcf.gz|java IBD vcfFile=- numThreads=8 >ibd_from_vcf.txt
 ~~~

+
 Calculating Mendel error rates:

 ~~~
@@ -346,14 +347,15 @@

 ## WGS data

-There are WGS versions of modules separating markers into linkage groups (SeparateIdenticals, JoinSingles2Identicals, ...). However, the current version of SeparateChromosomes2 is fast enough to run within a few days even on several millions of markers on a computer cluster with enough cores (say 24 cores and numThreads=24). Using SeparateChromosomes2 is the preferred way as it is much simpler to run. Further note that you can refine linkage group assignments and mask markers by providing map file to SeparateChromosomes2 (map=file).  Also samplePairs makes SeparateChromosomes2 run faster.
+There are WGS versions of modules separating markers into linkage groups (SeparateIdenticals, JoinSingles2Identicals, ...). However, the current version of SeparateChromosomes2 is fast enough to run within a few days even on several millions of markers on a computer cluster with enough cores (say 24 cores and numThreads=24). Using SeparateChromosomes2 is the preferred way as it is much simpler to run. Further note that you can refine linkage group assignments and mask markers by providing map file to SeparateChromosomes2 (map=file).  
+

 ## Phasing and QTL mapping

 It is easiest to use data phased by LM3 for QTL mapping. This can be obtained with OrderMarkers2 using parameters grandparentPhase=1, outputPhasedData=1 (or 2) to output the grandparental phased data. 
-With outputPhasedData=1, there won't be any missing data making QTL analysis straightforward.
-
-The phased data can be converted to fully informative "genotype" data by map2gentypes.awk script. If you provide parameter fullData=1 to this script, also the pedigree information and parents are given out. However, as the order file does not contain individual names, the individuals will be re-named with running numbers. The offspring are in the same order as in the input data and parents are given as the first individuals of every family. These new parental genotypes are always "1 2". Moreover, this data is phased so that the first digit of the genotypes is inherited from father and the second from mother. If the data is in grandparental phase, this also applies to the parents. 
+With outputPhasedData=1, there won't be any missing data making QTL analysis straighforward.
+
+The phased data can be converted to fully informative "genotype" data by map2gentypes.awk script. If you provide parameter fullData=1 to this script, also the pedigree information and parents are given out. However, as the order file does not contain individual names, the individuals will be re-named with running numbers. The offspring are in the same order as in the input data (withing each family, families are in the order of first occurring individual) and parents are given as the first individuals of every family. These new parental genotypes are always "1 2". Moreover, this data is phased so that the first digit of the genotypes is inherited from father and the second from mother. If the data is in grandparental phase, this also applies to the parents. 

 The grandparentPhase=1 uses only those markers that can be phased using the grandparents. However, to obtain maximum number of markers, you can construct the maps without grandParentPhase=1, then evaluate the order and get phase using grandparentPhase=1, and finally match the phases of the two data on the common markers. The script phasematch.awk can be used for matching the phases.

@@ -361,11 +363,11 @@

 A basic QTL pipeline has been now added to the LM3 git. This include scripts qtl.R, qtlPerm.R, and example data qtlphenotypes.txt and qtldata1.txt. The phenotypes are listed in the same order as in the data, there are phenotypes for the parents as well (but these are not used).

-This 32 family example data was generated as following (order1.txt is from OrderMarkers2, either *de novo* or in the physical order):
+This 32 family example data was generated as following (order1.txt is from OrderMarkers2, either denovo or in the physical order):

 `awk -vfullData=1 -f map2genotypes.awk order1.txt >qtldata1.12`

-The LOD plot is generated by running qtl.R, significance by permutation test can be calculated by qtlPerm.R. To use these for your own data, you have to change them a bit. 
+The LOD plot is generated by running qtl.R, significance by permutation test can be calulated by qtlPerm.R. To use these for your own data, you have to change them a bit. 

 ## Sequencing data processing pipeline 

@@ -376,7 +378,7 @@
 samtools mpileup -q 10 -Q 10 -s $(cat sorted_bams)|java -cp bin/ Pileup2Likelihoods|gzip >post.gz
 ~~~

-This command requires two files, sorted_bams and mapping.txt, both containing exactly one line listing the file names for sorted bams and individual names, respectively and in the same order. If the data of each individual is in its own bam, then the files can be same (but it is more clear to remove the bam suffix from the individual names). Please note that this pipeline does not work with the old version of samtools (0.X) due software bugs in old samtools. 
+This command requires two files, sorted_bams and mapping.txt, both containing exactly one line listing the file names for sorted bams and individual names, respectively and in the same order. If the data of each individual is in its own bam, then the files can be same (but it is more clear to remove the bam suffix from the individual names). Please note that this pipeline does not work with the old version of samtools (0.X). 

 For example (3 individuals in 4 bams):
 sorted_bams: 1.bam 1a.bam 2.bam 3.bam

LM3 Home modified by Pasi Rastas

Pasi Rastas — Fri, 10 Mar 2023 13:17:30 -0000

--- v66
+++ v67
@@ -23,15 +23,15 @@

 The most commonly used modules are ParentCall2, Filtering2, SeparateChromosomes2, JoinSingles2All and  OrderMarkers2. These as analogous to the modules (ParentCall, Filtering, SeparateChromosomes, JoinSingles and  OrderMarkers) of Lep-MAP2.

-In order to map millions of markers (e.g. whole genome sequencing), modules SeparateIndenticals, JoinSingles2Indeticals and JoinLGs can be used. However, these are still in very preliminary form. Moreover, as JoinSingles2All and SeparateChromosomes2 now utilise multiple cores, these could probably be used as well on larger data.
+In order to map millions of markers (e.g. whole genome sequencing), modules, JoinSingles2All and SeparateChromosomes2 can be run faster by utilising multiple cores (numThreads parameter). Moreover, samplePairs in SeparateChromosomes2 makes it much faster without much difference in the output. A typical value for samplePairs could be 0.1 to obtain 10x speedup.

 In order to verify the linkage mapping results, module LMPlot can be used. It will output the Lep-MAP graph in [Graphviz](http://www.graphviz.org/) DOT language format. The output can be visualized with dot or with xdot.py from https://github.com/jrfonseca/xdot.py . 
 ## Installation

-LM3 is implemented in java. To run it, java runtime evironment is required ([java.com](http://java.com)). The compiled java classes and source code can be downloaded from this sourceforge page.  
+LM3 is implemented in java. To run it, java runtime environment is required ([java.com](http://java.com)). The compiled java classes and source code can be downloaded from this sourceforge page.  

 ## Using Lep-MAP3 non-Linux systems
-Lep-MAP3 is developed in the Linux environment. The examples below are not neccessary working in other operation systems. Here are some hints from Lep-MAP users on how to use in other systems.
+Lep-MAP3 is developed in the Linux environment. The examples below are not necessary working in other operation systems. Here are some hints from Lep-MAP users on how to use in other systems.

 ### macOS
 At least the zcat command is not working in the same way in mac as in Linux. However, replacing zcat by 
@@ -62,7 +62,7 @@

 The input of ParentCall2 consists of genotype likelihoods (posteriors) for each 10 possible SNP genotypes AA, AC, AG, AT, CC, CG, CT, GG, GT and TT. Other kind of variants (like indels) can be given as input by specifying them as SNPs (e.g. AA=homozygote indel, AT=heterozygote indel, TT=no indel). Typically these likelihoods can be obtained from sequencing data or from SNP assays. The output is also in the same likelihood format. 

-The first 6 lines presents the pedigree. First line is the family name, second individual name, third and fourth are the father and mother. Line 5 containts the sex of each individual (1 male, 2 female, 0 unknown) and the last line is the phenotype (can be 0 for all individuals, this is not currently used). The likelihoods can be provided from line 7 forward (columns must match) or on a separate file given as parameter posteriorFile or vcfFile. Finally, columns 1-2 give marker names (scaffold and pos) for genotypes, and can be any value for pedigree part. Thus, make sure that each line has n+2 tab separated columns if there are n individuals and column i + 2 gives the genotype and pedigree information on individual i.
+The first 6 lines presents the pedigree. First line is the family name, second individual name, third and fourth are the father and mother. Line 5 contains the sex of each individual (1 male, 2 female, 0 unknown) and the last line is the phenotype (can be 0 for all individuals, this is not currently used). The likelihoods can be provided from line 7 forward (columns must match) or on a separate file given as parameter posteriorFile or vcfFile. Finally, columns 1-2 give marker names (scaffold and pos) for genotypes, and can be any value for pedigree part. Thus, make sure that each line has n+2 tab separated columns if there are n individuals and column i + 2 gives the genotype and pedigree information on individual i.

 Example pedigree (in correct transpose, should be tab separated) is below:
 ~~~
@@ -139,7 +139,7 @@

 The Filtering2 module handles filtering of the data, i.e. filtering markers based on, e.g. high segregation distortion (dataTolerance) and excess number of missing genotypes (missingLimit). This module outputs the filtered data in the same format to be used with other modules and for further analysis (e.g. QTL mapping). 

-Note that Filtering2 is best suited for multi-family data especially with default dataTolerance(=0.01). For single family data, distortionLod=1 in SeparateChromosomes2 and JoinSingles2All can provide a better solution to deal with distorted markers. This is because Filtering can cause long gaps on single family crosses. If filtering is used on such data, ofter a smaller dataTolerance is more suitable (like 0.001 or 0.0001). 
+Note that Filtering2 is best suited for multi-family data especially with default dataTolerance(=0.01). (Note: Now the default dataTolerance is 0.001) For single family data, distortionLod=1 in SeparateChromosomes2 and JoinSingles2All can provide a better solution to deal with distorted markers. This is because Filtering can cause long gaps on single family crosses. If filtering is used on such data, often a smaller dataTolerance is more suitable (like 0.001 or 0.0001). 

 Example :
 ~~~
@@ -176,7 +176,7 @@
 sort map5.txt|uniq -c|sort -n
 ~~~

-If you get the snp names to a file snps.txt by
+If you get the SNP names to a file snps.txt by
 ~~~
 awk '(NR>=7)' data_f.call|cut -f 1,2 >snps.txt
 ~~~
@@ -199,9 +199,9 @@
 zcat data_f.call.gz|java -cp bin/ JoinSingles2All map=map5.txt data=- lodLimit=3 lodDifference=2 >map5_js.txt
 ~~~

-(parameter numThreads utilises muliple cores)
-
-(iterated joinSingles2All yields same result as iterating it until no markers can be added) 
+(parameter numThreads utilises multiple cores)
+
+(iterated joinSingles2All yields almost the same result as iterating it until no markers can be added) 
 ~~~
 java -cp bin/ JoinSingles2All map=map5.txt data=data_f.call lodLimit=4 iterate=1 >map5_js_iterated.txt
 ~~~
@@ -256,7 +256,7 @@
 java -cp bin/ OrderMarkers2 map=map.txt data=data_f.call recombination2=0
 ~~~

-It is typically more convinient to order each chromosome separately
+It is typically more convenient to order each chromosome separately
 ~~~
 java -cp bin/ OrderMarkers2 map=mapBig.txt data=dataBig.call chromosome=1 >order1.1.txt
 ...
@@ -281,7 +281,7 @@
 It can be wise to remove markers at the map ends that cause long gaps.

 BE SURE that your individuals match the pedigree. For example, use Lep-MAP3 IBD module
-(with genotype likelihood data before ParentCall), to verify that your individuals are full-sibs.
+(with genotype likelihood data before ParentCall2), to verify that your individuals are full-sibs.

 ### LMPlot

@@ -305,7 +305,7 @@
 xdot.py order_wp_1.dot
 ~~~

-The nodes of the graph are numbered in the order they first occur in the map (order_wp1_txt). The edge labels give the index for individual haplotype that recombines (changes). If the order is (about) correct, the node number should be in order 1,2, ..., N when following the nodes from one end of the chain to the other. Also the size of the nodes gives information how common each marker is, uncommon markers at the ends could be erroneous. The erroneous edges are highlited with red color. 
+The nodes of the graph are numbered in the order they first occur in the map (order_wp1_txt). The edge labels give the index for individual haplotype that recombines (changes). If the order is (about) correct, the node number should be in order 1,2, ..., N when following the nodes from one end of the chain to the other. Also the size of the nodes gives information how common each marker is, uncommon markers at the ends could be erroneous. The erroneous edges are highlighted with red color. 

 ### IBD

@@ -314,7 +314,7 @@
 ~~~
 zcat post_from_pipeline.gz|java IBD posteriorFile=- >ibd.txt
 ~~~
-Then Listing ibd values in descending order
+Then listing IBD values in descending order
 ~~~
 sort -n -r -k 3,3  ibd.txt|less
 ~~~
@@ -333,7 +333,6 @@
 zcat file.vcf.gz|java IBD vcfFile=- numThreads=8 >ibd_from_vcf.txt
 ~~~

-
 Calculating Mendel error rates:

 ~~~
@@ -347,13 +346,12 @@

 ## WGS data

-There are WGS versions of modules separating markers into linkage groups (SeparateIdenticals, JoinSingles2Identicals, ...). However, the current version of SeparateChromosomes2 is fast enough to run within a few days even on several millions of markers on a computer cluster with enough cores (say 24 cores and numThreads=24). Using SeparateChromosomes2 is the preferred way as it is much simpler to run. Further note that you can refine linkage group assignments and mask markers by providing map file to SeparateChromosomes2 (map=file).  
-
+There are WGS versions of modules separating markers into linkage groups (SeparateIdenticals, JoinSingles2Identicals, ...). However, the current version of SeparateChromosomes2 is fast enough to run within a few days even on several millions of markers on a computer cluster with enough cores (say 24 cores and numThreads=24). Using SeparateChromosomes2 is the preferred way as it is much simpler to run. Further note that you can refine linkage group assignments and mask markers by providing map file to SeparateChromosomes2 (map=file).  Also samplePairs makes SeparateChromosomes2 run faster.

 ## Phasing and QTL mapping

 It is easiest to use data phased by LM3 for QTL mapping. This can be obtained with OrderMarkers2 using parameters grandparentPhase=1, outputPhasedData=1 (or 2) to output the grandparental phased data. 
-With outputPhasedData=1, there won't be any missing data making QTL analysis straighforward.
+With outputPhasedData=1, there won't be any missing data making QTL analysis straightforward.

 The phased data can be converted to fully informative "genotype" data by map2gentypes.awk script. If you provide parameter fullData=1 to this script, also the pedigree information and parents are given out. However, as the order file does not contain individual names, the individuals will be re-named with running numbers. The offspring are in the same order as in the input data and parents are given as the first individuals of every family. These new parental genotypes are always "1 2". Moreover, this data is phased so that the first digit of the genotypes is inherited from father and the second from mother. If the data is in grandparental phase, this also applies to the parents. 

@@ -363,11 +361,11 @@

 A basic QTL pipeline has been now added to the LM3 git. This include scripts qtl.R, qtlPerm.R, and example data qtlphenotypes.txt and qtldata1.txt. The phenotypes are listed in the same order as in the data, there are phenotypes for the parents as well (but these are not used).

-This 32 family example data was generated as following (order1.txt is from OrderMarkers2, either denovo or in the physical order):
+This 32 family example data was generated as following (order1.txt is from OrderMarkers2, either *de novo* or in the physical order):

 `awk -vfullData=1 -f map2genotypes.awk order1.txt >qtldata1.12`

-The LOD plot is generated by running qtl.R, significance by permutation test can be calulated by qtlPerm.R. To use these for your own data, you have to change them a bit. 
+The LOD plot is generated by running qtl.R, significance by permutation test can be calculated by qtlPerm.R. To use these for your own data, you have to change them a bit. 

 ## Sequencing data processing pipeline 

@@ -378,7 +376,7 @@
 samtools mpileup -q 10 -Q 10 -s $(cat sorted_bams)|java -cp bin/ Pileup2Likelihoods|gzip >post.gz
 ~~~

-This command requires two files, sorted_bams and mapping.txt, both containing exactly one line listing the file names for sorted bams and individual names, respectively and in the same order. If the data of each individual is in its own bam, then the files can be same (but it is more clear to remove the bam suffix from the individual names). Please note that this pipeline does not work with the old version of samtools (0.X). 
+This command requires two files, sorted_bams and mapping.txt, both containing exactly one line listing the file names for sorted bams and individual names, respectively and in the same order. If the data of each individual is in its own bam, then the files can be same (but it is more clear to remove the bam suffix from the individual names). Please note that this pipeline does not work with the old version of samtools (0.X) due software bugs in old samtools. 

 For example (3 individuals in 4 bams):
 sorted_bams: 1.bam 1a.bam 2.bam 3.bam

LM3 Home modified by Pasi Rastas

Pasi Rastas — Wed, 01 Mar 2023 12:55:51 -0000

--- v65
+++ v66
@@ -106,6 +106,7 @@
 ~~~
 java ParentCall2 data=... ZLimit=2 ...
 ~~~
+The markers called on the sex chromosome have a star (`*`) after the position field, e.g. "contig1 2345" becomes "contig1 2345*".

 ####Converting data from other formats to Lep-MAP3