Hi Pasi - I have two families that I'm building maps for, one map per family. In one family I have two diploid parents genotyped and offspring come from haploid tissue. In the other family, I have only one diploid parent genotyped and the offspring are also from haploid tissue. Likelihoods for the haploid genotypes (e.g., for A) were input as likelihoods for homozygote diploid genotypes (e.g., A/A).
I would like to confirm the ParentCall2 command and flags.
I've set both commands to be : ParentCall2 data=ped_file.txt posteriorFile=post_file.txt XLimit=0 ignoreParentOrder=1 outputParentPosterior=1 > data.call.gz
In the family with two parents genotyped, 44541 / 80997 (71132 informative) markers were called Z/X, while in the family with one parent genotyped I had all 52625 / 52625 (39077 informative) markers called Z/X.
Do I have the commands set up correctly, and is the contrasting outputs of the Z/X reasonable (where in one family all markers were called Z/X and in the other only a proportion were called Z/X)?
Thanks!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
one last thing - I was a little surprised to see that not all markers were informative - I ensured that at least one parent had a genotype, that the number of unique alleles across parental genotypes was > 1, and that the number of unique alleles across offspring was also > 1.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Could you explain a bit more, I don't quite understand how you can have haploid offspring but two parents?
If you are using XLimit flag, then the haploid allele is coming from the mother? I think you should then get the same result with all homozygote father. If you set the father's likelihoods to "1 0 0 0 0 0 0 0 0 0", it would probably work better (AA). If you have data on the mother, you could put there as well, but probably it works even when omitted (ten 1s, all missing).
(CORRECTION: Put father as "1 0 0 0 0 0 0 0 0 0" instead of 1 0 0 0 1 0 0 1 0 1).
And I would remove flags ignoreParentOrder=1 and outputParentPosterior=1, probably these won't help here. XLimit will then convert data to maternally informative, e.g. offspring to A/A or A/T instead of A/A or T/T. Remember to put all your offspring as males!
Hope this helps and please ask again if not.
Cheers,
Pasi
Last edit: Pasi Rastas 2025-04-01
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
XLimit=NUM is a good choice here. It finds father as X- and mother as XX, and male offspring as X-. It is exactly how your data should be analysed. (NUM could be 0-3)
If the families are >= 10 offspring, you probably don't need parental data at all. Just code father as homozygote (e.g. AA) and mother as missing. With less offspring, the mother data might make a difference,
Cheers,
Pasi
Last edit: Pasi Rastas 2025-04-01
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm now on the SeparateChromosomes2 step, and neither of my families ended up with output with the correct number of linkage groups.
I iterated all combinations of the following:
- all possible 'informativeMask' : '0123', '1', '2', '3', '23', '13', '123'
- for lodLimit - I tried all odd numbers between 0 and 50 (including 1 and 49)
- sizeLimit - either 90 or 100
- all iterations I set distortionLod=1 since I have one family
However as you can see with these figures, lodLimits less than about 5 seemed to be best for the data and iterations of SeparateChromosomes2 that I ran. The red line is the expected number of linkaged groups (n=13).
To move forward, I was thinking of exploring lodLimit between 0 and 10, in 0.5 increments, and exploring lower limits of sizeLimit. Do you see anything wrong with this - I've seen you comment to others that use low lodLimit that they should explore higher values. Was curious on your opinion here.
It can be tricky to get to correct linkage groups. I am not exactly sure how you have analysed your data but there seems to be good signal. You are close.
1) I think the informativeMask should be 1 or 2, depending how you coded the data to LM3. (If the data is haploid and from one parent?)
2) distortionLod=1 is good choice here.
3) sizeLimit could be omitted, instead just look at the size distribution of the markers. If there is large dip in the size between 13 and 14, just remove groups >=14 (manually or re-run with proper sizeLimit). You can also see how the large groups are split into smaller ones when you increase LOD, the actual group split can be seen by comparing different clusterings from SeparateChromosomes2. If you know the size distribution of chromosomes, this helps as well (all chrs about equal size vs. large and small chrs).
4) If the two families have common markers, joining them increases information and eases the (initial) clustering of the markers.
Cheers,
Pasi
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
1) I'm happy to provide specific information if you'd like. The quick version is the exome capture genetic data (100 offspring per family) is from a species with haploid chromosome number of 13 (2n=26) and is output from samtools - the genome size is about 16Gbp. I have 53,000 or 81,000 SNPs depending on family (snps filtered for < 20% missing data and MAF, and informativeness like non-homozygote parental genotypes etc). I've encoded haploid genotypes as homozygous, and I also have diploid data for parents - for one family I have diploid data for one parent, and the other family I have diploid data for both parents.
- I only ran one unique command for each family for ParentCall2 and Filtering2. So I only start to explore parameter choices with SeparateChromosomes2:
- ParentCall2 data={pedfile} posteriorFile={postfile} XLimit=0 ignoreParentOrder=1 outputParentPosterior=1 > {outfile}
- Filtering2 data={f} dataTolerance=0.01 > {f2}
2) understood
3) this species has variable chromosome lengths (inferred from karyotyping and other linkage maps), but does not have a chromosome-scale reference genome. With this in mind, and the fact that we're using exome capture, I would be hesitant to try and use eg snps/LG as a heuristic to choose among different outputs.
4) unfortunately while the two families were both genotyped similarly, they are from different varieties of the species and we expect their linkage maps to be different.
Last edit: Brandon 2025-03-11
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I would remove parameters ignoreParentOrder=1 and outputParentPosterior=1 as extra complication. I would use default dataTolerance=0.001 or even smaller with 100 offspring.
If you use XLimit, then just code father as homozygote (AA) and mother as missing. And put all offspring as males. This is the way.
I think the data can be analysed as selfing cross as well. Just put two dummy parents for each family (no XLimit then) and read the wiki about selfing crosses (some parameters have to be altered, e.g. selfingPhase=1, heterozygoteRate=0.01).
Cheers,
Pasi
Last edit: Pasi Rastas 2025-04-01
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
A couple quick questions. When you say code father as homozygote do you mean puting encoding the likelihoods as [1, 0, 0, 0, 1, 0, 0, 1, 0, 1]? with ['A/A', 'A/C', 'A/G', 'A/T', 'C/C', 'C/G', 'C/T', 'G/G', 'G/T', 'T/T'] - the mother would be all 0
By 'using' XLimit, do you mean specifying it in the command? I found some weird patters if I did not use it at all in the command. And by 'no XLimit' do you mean leaving it out of the command. I had assumed that not specifying XLimit would fall back on the default of inf?
Thanks
Brandon
Last edit: Brandon 2025-03-14
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Option1: The father as ([1, 0, 0, 0, 0, 0, 0, 0, 0, 0]) and mother as [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]. All offspring as males (sex=1). Then you provide XLimit=2 for ParentCall2. You can use informativeMask=2 for all steps after this (but probably it is not needed).
Option2: two dummy parents for both families. No XLimit for ParentCall2 but selfingPhase=1, heterozygoteRate=0.01 for other steps (all markers kept should have both dummy parents informative).
Cheers,
Pasi
Last edit: Pasi Rastas 2025-04-01
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
When running with true mother genotype likelihoods, father as 0 for all genotype likelihoods, and offspring as 0 sex, and using the same parameter exploration as above, I do get some output from SeparateChromosomes2 with the correct number of linkage groups (13). Even output with more linkage groups will have similar number of markers within linkage groups as the outputs with 13 linakge groups.
Is there any reason I shouldn't move forward with these other runs?
Thanks!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I don't understand how this is working:) Clearly there is signal in the data as you can get correct number of linkage groups. If you get a good map there is no need to change anything.
By specifying the sex as 0, ParentCall2 will take an average likelihood between setting each offspring either male or female. I have never used ParentCall2 this way.
If you put all likelihoods as 0, it will effectively normalise it to all 1 so this is ok.
If you want me to take a look at your data, please send a small subset of the genotype data (100-1000 markers?) by email. I am happy the check it.
Cheers,
Pasi
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi Pasi - I have two families that I'm building maps for, one map per family. In one family I have two diploid parents genotyped and offspring come from haploid tissue. In the other family, I have only one diploid parent genotyped and the offspring are also from haploid tissue. Likelihoods for the haploid genotypes (e.g., for A) were input as likelihoods for homozygote diploid genotypes (e.g., A/A).
I would like to confirm the ParentCall2 command and flags.
I've set both commands to be : ParentCall2 data=ped_file.txt posteriorFile=post_file.txt XLimit=0 ignoreParentOrder=1 outputParentPosterior=1 > data.call.gz
In the family with two parents genotyped, 44541 / 80997 (71132 informative) markers were called Z/X, while in the family with one parent genotyped I had all 52625 / 52625 (39077 informative) markers called Z/X.
Do I have the commands set up correctly, and is the contrasting outputs of the Z/X reasonable (where in one family all markers were called Z/X and in the other only a proportion were called Z/X)?
Thanks!
for the family with only one parent genotyped - I input 0 for all genotype likelihoods for the missing parent.
Last edit: Brandon 2025-01-16
one last thing - I was a little surprised to see that not all markers were informative - I ensured that at least one parent had a genotype, that the number of unique alleles across parental genotypes was > 1, and that the number of unique alleles across offspring was also > 1.
Dear Brandon,
Thank you for you question.
Could you explain a bit more, I don't quite understand how you can have haploid offspring but two parents?
If you are using XLimit flag, then the haploid allele is coming from the mother? I think you should then get the same result with all homozygote father. If you set the father's likelihoods to "1 0 0 0 0 0 0 0 0 0", it would probably work better (AA). If you have data on the mother, you could put there as well, but probably it works even when omitted (ten 1s, all missing).
(CORRECTION: Put father as "1 0 0 0 0 0 0 0 0 0" instead of 1 0 0 0 1 0 0 1 0 1).
And I would remove flags ignoreParentOrder=1 and outputParentPosterior=1, probably these won't help here. XLimit will then convert data to maternally informative, e.g. offspring to A/A or A/T instead of A/A or T/T. Remember to put all your offspring as males!
Hope this helps and please ask again if not.
Cheers,
Pasi
Last edit: Pasi Rastas 2025-04-01
Hi Pasi,
The tissue is megagametophyte haploid tissue taken from conifer seeds with known parent(s).
There was a similar dataset in the past, but maybe I misunderstood and it was an option to use XLimit=0 OR convert haplotype genotypes to homozygote genotypes - https://sourceforge.net/p/lep-map3/discussion/general/thread/525c83d8fe/?limit=25#a69c
Do you think having XLimit=0 is a good choice here?
Dear Brandon,
XLimit=NUM is a good choice here. It finds father as X- and mother as XX, and male offspring as X-. It is exactly how your data should be analysed. (NUM could be 0-3)
If the families are >= 10 offspring, you probably don't need parental data at all. Just code father as homozygote (e.g. AA) and mother as missing. With less offspring, the mother data might make a difference,
Cheers,
Pasi
Last edit: Pasi Rastas 2025-04-01
Thanks, Pasi.
I'm now on the SeparateChromosomes2 step, and neither of my families ended up with output with the correct number of linkage groups.
I iterated all combinations of the following:
- all possible 'informativeMask' : '0123', '1', '2', '3', '23', '13', '123'
- for lodLimit - I tried all odd numbers between 0 and 50 (including 1 and 49)
- sizeLimit - either 90 or 100
- all iterations I set distortionLod=1 since I have one family
However as you can see with these figures, lodLimits less than about 5 seemed to be best for the data and iterations of SeparateChromosomes2 that I ran. The red line is the expected number of linkaged groups (n=13).
To move forward, I was thinking of exploring lodLimit between 0 and 10, in 0.5 increments, and exploring lower limits of sizeLimit. Do you see anything wrong with this - I've seen you comment to others that use low lodLimit that they should explore higher values. Was curious on your opinion here.
Thanks!
Dear Brandon,
It can be tricky to get to correct linkage groups. I am not exactly sure how you have analysed your data but there seems to be good signal. You are close.
1) I think the informativeMask should be 1 or 2, depending how you coded the data to LM3. (If the data is haploid and from one parent?)
2) distortionLod=1 is good choice here.
3) sizeLimit could be omitted, instead just look at the size distribution of the markers. If there is large dip in the size between 13 and 14, just remove groups >=14 (manually or re-run with proper sizeLimit). You can also see how the large groups are split into smaller ones when you increase LOD, the actual group split can be seen by comparing different clusterings from SeparateChromosomes2. If you know the size distribution of chromosomes, this helps as well (all chrs about equal size vs. large and small chrs).
4) If the two families have common markers, joining them increases information and eases the (initial) clustering of the markers.
Cheers,
Pasi
Hi Pasi,
1) I'm happy to provide specific information if you'd like. The quick version is the exome capture genetic data (100 offspring per family) is from a species with haploid chromosome number of 13 (2n=26) and is output from samtools - the genome size is about 16Gbp. I have 53,000 or 81,000 SNPs depending on family (snps filtered for < 20% missing data and MAF, and informativeness like non-homozygote parental genotypes etc). I've encoded haploid genotypes as homozygous, and I also have diploid data for parents - for one family I have diploid data for one parent, and the other family I have diploid data for both parents.
- I only ran one unique command for each family for ParentCall2 and Filtering2. So I only start to explore parameter choices with SeparateChromosomes2:
- ParentCall2 data={pedfile} posteriorFile={postfile} XLimit=0 ignoreParentOrder=1 outputParentPosterior=1 > {outfile}
- Filtering2 data={f} dataTolerance=0.01 > {f2}
2) understood
3) this species has variable chromosome lengths (inferred from karyotyping and other linkage maps), but does not have a chromosome-scale reference genome. With this in mind, and the fact that we're using exome capture, I would be hesitant to try and use eg snps/LG as a heuristic to choose among different outputs.
4) unfortunately while the two families were both genotyped similarly, they are from different varieties of the species and we expect their linkage maps to be different.
Last edit: Brandon 2025-03-11
Dear Brandon,
Thank you for your detailed information.
I would remove parameters ignoreParentOrder=1 and outputParentPosterior=1 as extra complication. I would use default dataTolerance=0.001 or even smaller with 100 offspring.
If you use XLimit, then just code father as homozygote (AA) and mother as missing. And put all offspring as males. This is the way.
I think the data can be analysed as selfing cross as well. Just put two dummy parents for each family (no XLimit then) and read the wiki about selfing crosses (some parameters have to be altered, e.g. selfingPhase=1, heterozygoteRate=0.01).
Cheers,
Pasi
Last edit: Pasi Rastas 2025-04-01
Thanks Pasi.
A couple quick questions. When you say code father as homozygote do you mean puting encoding the likelihoods as [1, 0, 0, 0, 1, 0, 0, 1, 0, 1]? with ['A/A', 'A/C', 'A/G', 'A/T', 'C/C', 'C/G', 'C/T', 'G/G', 'G/T', 'T/T'] - the mother would be all 0
By 'using' XLimit, do you mean specifying it in the command? I found some weird patters if I did not use it at all in the command. And by 'no XLimit' do you mean leaving it out of the command. I had assumed that not specifying XLimit would fall back on the default of inf?
Thanks
Brandon
Last edit: Brandon 2025-03-14
Dear Brandon,
Option1: The father as ([1, 0, 0, 0, 0, 0, 0, 0, 0, 0]) and mother as [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]. All offspring as males (sex=1). Then you provide XLimit=2 for ParentCall2. You can use informativeMask=2 for all steps after this (but probably it is not needed).
Option2: two dummy parents for both families. No XLimit for ParentCall2 but selfingPhase=1, heterozygoteRate=0.01 for other steps (all markers kept should have both dummy parents informative).
Cheers,
Pasi
Last edit: Pasi Rastas 2025-04-01
Hi Pasi,
I tried this, but all outputs from SeparateChromosomes2 had all markers as singles.
I tried 5704 different combinations of
- xlimit : default / not specified, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 30, 45, 60, 75, 90, 105
- lodlimit : 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98
- sizelimit : 50, 90
- theta : 0.03, 0.05
- informative mask : 2 (only 2)
When running with true mother genotype likelihoods, father as 0 for all genotype likelihoods, and offspring as 0 sex, and using the same parameter exploration as above, I do get some output from SeparateChromosomes2 with the correct number of linkage groups (13). Even output with more linkage groups will have similar number of markers within linkage groups as the outputs with 13 linakge groups.
Is there any reason I shouldn't move forward with these other runs?
Thanks!
Dear Brandon,
I don't understand how this is working:) Clearly there is signal in the data as you can get correct number of linkage groups. If you get a good map there is no need to change anything.
By specifying the sex as 0, ParentCall2 will take an average likelihood between setting each offspring either male or female. I have never used ParentCall2 this way.
If you put all likelihoods as 0, it will effectively normalise it to all 1 so this is ok.
If you want me to take a look at your data, please send a small subset of the genotype data (100-1000 markers?) by email. I am happy the check it.
Cheers,
Pasi
Thanks, Pasi. I sent an email from my gmail.
Dear Brandon,
This was my mistake, you have to put father as any homozygote like AA ("1 0 0 0 0 0 0 0 0 0 "). I corrected this to the previous messages as well...
Cheers,
Pasi