I have been using your mapping software for a while to build a high-density map which we haves used to tie together our genome assembly.
Now I am conducting QTL mapping, and would like to use the phase info to increase the number of markers in our analysis. Here is my design for creating an outbred F2 between red and yellow mokeyflowers:
R x Y R x Y (four grandparents)
F1 x F1 (two F1 parents)
F2s (many F2’s)
Using a series of manual filtering steps, I have been able to identify ~650 markers where Red and Yellow alleles can be tracked all of the way through this cross. Using a the map distances between the these markers and their genotypes, I have conducted a preliminary QTL analysis which has reveled a massive QTL that coincides with a known flower color locus, so I feel really good about this set markers.
So, next I have exported phase info for a much larger set of markers (~7500) for hundreds of individuals. From some previous posts, I have interpreted that the four different phase categories (00, 11, 01, 10), represent the different combinations of parental haplotypes (chromosomes) that a marker can have. First, is that correct?
Second, if that is correct, I figure I should be able to use the subset of markers for which I know the grandparental alleles to convert those phase categories into genotypes for markers that are less informative. In other words, one phase category should always be associated individuals that inherited two red alleles, another should always be associated with individuals that have two yellow alleles, and the remaining two categories should be associated with heterozygotes. Is this correct?
Last edit: Sean Stankowski 2016-10-19
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The phase reported by LM (00,01,10,11) is relative to the genotype data. I have an earlier post about it.
However, I think it might be easier to use flag outputPhasedData=1 to output the full data in phased format. Then you have to figure out how to map this to your grandparental phase (R and Y) using the subset of markers you have this information and you are done.
Cheers,
Pasi
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for your response! I have gone through the process of outputting the phased data for our F2 individuals, but there does not appear to be any correspondence between the phase information and the marker genotypes.
Could you please take a look at the attached screenshot, which shows part of a lepmap2 output file for LG4 that includes phased data for the F2s? To illustrate the problem, the top of the image shows data for three markers that I selected from this LG. The markers differ in their parental genotypes, and thus their segregation patterns also differ. For these markers, I have pasted genotype data from the .linkage file directly under the phased data for a sample of the individuals. To do this, I assumed that the order of individuals in the output file is the same as in the .loc file from joinmap. Is that correct? I have also included the genotypes of the F1 parents for each marker. For the third marker, I have also recoded the genotypes in the parents and F2 according to whether the allele comes from a red (R) or yellow (Y) flowered grandparent, which we determined by manually screening each marker manually before we built the map.
Here is the problem I am having: for each marker, the four phase categories do not correspond to specific genotypes among the individuals. For example, for the first marker, phase 11 is associated with all 4 possible genotypes (55, 56, 57, 67). Similarly, individuals with genotype 57, are associated with all 4 phases. The same pattern can be seen for markers 2 and 3 (and all other markers).
This is not at all what I expected to see based on my understanding of what the phased data is supposed to tell us. For example, for the first marker, where the parents are genotypes 56 and 57, the first F2 individual in our file (#392) with genotype 67 should have phase 11. Individual #11 with genotype 57 should be phase 01, etc. In addition, at the third marker, where both F1 parents are RY heterozygotes, I expected the RR (88) and YY (99) categories in the F2 to each be associated with their own phase category across the entire set of individuals; the other two phase categories should be associated with the two heterozygous genotypes (89, 98), both RY. For the second marker, where the parents are 33/34, I would expect 33 genotypes in the F2 to be consistently associated with two of the phase categories, and 34 to be associated with the other two phase categories. Is my thinking correct here, or am I missing something critical?
I have gone through the data files multiple times to make sure that the problem does not lie in a clerical error related to my incorrect matching of individuals or marker ids from different files. So at this point, I am stumped on how to proceed. Do you have any suggestions?
However, outputPhasedData=1 will output most likely data, this is not neccessary consistent with the genotypes. In this case, the genotype error estimate should be elevated for the first markers (where it is likely that the markers with high error will stack). Is there any variation in the phased output? In the screenshot, the phased data seems identical for all markers.
Maybe we should continue the discussion by email? I will post here the final solution to your problem, when we figure it out.
Cheers,
Pasi
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I think I figured this out. The phased output of LM is not interlaced (first individual being first and second character), but the first half of the phased data are the paternal patterns and the second half the maternal. Sorry for confusion.
Cheers,
Pasi
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thankyou for looking into this. I have just checked this for one LG, and I can confrim that everything now makes sense.
To clarify, if you have n individuals in your dataset, the phased data for indivual i can be obtained by combinging the value at position i (paternal) and position n+i (maternal).
Sean
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi Pasi,
I have been using your mapping software for a while to build a high-density map which we haves used to tie together our genome assembly.
Now I am conducting QTL mapping, and would like to use the phase info to increase the number of markers in our analysis. Here is my design for creating an outbred F2 between red and yellow mokeyflowers:
R x Y R x Y (four grandparents)
F1 x F1 (two F1 parents)
F2s (many F2’s)
Using a series of manual filtering steps, I have been able to identify ~650 markers where Red and Yellow alleles can be tracked all of the way through this cross. Using a the map distances between the these markers and their genotypes, I have conducted a preliminary QTL analysis which has reveled a massive QTL that coincides with a known flower color locus, so I feel really good about this set markers.
So, next I have exported phase info for a much larger set of markers (~7500) for hundreds of individuals. From some previous posts, I have interpreted that the four different phase categories (00, 11, 01, 10), represent the different combinations of parental haplotypes (chromosomes) that a marker can have. First, is that correct?
Second, if that is correct, I figure I should be able to use the subset of markers for which I know the grandparental alleles to convert those phase categories into genotypes for markers that are less informative. In other words, one phase category should always be associated individuals that inherited two red alleles, another should always be associated with individuals that have two yellow alleles, and the remaining two categories should be associated with heterozygotes. Is this correct?
Last edit: Sean Stankowski 2016-10-19
Dear Sean,
Thank you for your question.
The phase reported by LM (00,01,10,11) is relative to the genotype data. I have an earlier post about it.
However, I think it might be easier to use flag outputPhasedData=1 to output the full data in phased format. Then you have to figure out how to map this to your grandparental phase (R and Y) using the subset of markers you have this information and you are done.
Cheers,
Pasi
Dear Pasi
Thanks for your response! I have gone through the process of outputting the phased data for our F2 individuals, but there does not appear to be any correspondence between the phase information and the marker genotypes.
Could you please take a look at the attached screenshot, which shows part of a lepmap2 output file for LG4 that includes phased data for the F2s? To illustrate the problem, the top of the image shows data for three markers that I selected from this LG. The markers differ in their parental genotypes, and thus their segregation patterns also differ. For these markers, I have pasted genotype data from the .linkage file directly under the phased data for a sample of the individuals. To do this, I assumed that the order of individuals in the output file is the same as in the .loc file from joinmap. Is that correct? I have also included the genotypes of the F1 parents for each marker. For the third marker, I have also recoded the genotypes in the parents and F2 according to whether the allele comes from a red (R) or yellow (Y) flowered grandparent, which we determined by manually screening each marker manually before we built the map.
Here is the problem I am having: for each marker, the four phase categories do not correspond to specific genotypes among the individuals. For example, for the first marker, phase 11 is associated with all 4 possible genotypes (55, 56, 57, 67). Similarly, individuals with genotype 57, are associated with all 4 phases. The same pattern can be seen for markers 2 and 3 (and all other markers).
This is not at all what I expected to see based on my understanding of what the phased data is supposed to tell us. For example, for the first marker, where the parents are genotypes 56 and 57, the first F2 individual in our file (#392) with genotype 67 should have phase 11. Individual #11 with genotype 57 should be phase 01, etc. In addition, at the third marker, where both F1 parents are RY heterozygotes, I expected the RR (88) and YY (99) categories in the F2 to each be associated with their own phase category across the entire set of individuals; the other two phase categories should be associated with the two heterozygous genotypes (89, 98), both RY. For the second marker, where the parents are 33/34, I would expect 33 genotypes in the F2 to be consistently associated with two of the phase categories, and 34 to be associated with the other two phase categories. Is my thinking correct here, or am I missing something critical?
I have gone through the data files multiple times to make sure that the problem does not lie in a clerical error related to my incorrect matching of individuals or marker ids from different files. So at this point, I am stumped on how to proceed. Do you have any suggestions?
Dear Sean,
Your example looks a bit worrying.
However, outputPhasedData=1 will output most likely data, this is not neccessary consistent with the genotypes. In this case, the genotype error estimate should be elevated for the first markers (where it is likely that the markers with high error will stack). Is there any variation in the phased output? In the screenshot, the phased data seems identical for all markers.
Maybe we should continue the discussion by email? I will post here the final solution to your problem, when we figure it out.
Cheers,
Pasi
Dear Sean and all,
I think I figured this out. The phased output of LM is not interlaced (first individual being first and second character), but the first half of the phased data are the paternal patterns and the second half the maternal. Sorry for confusion.
Cheers,
Pasi
Hi Pasi,
Thankyou for looking into this. I have just checked this for one LG, and I can confrim that everything now makes sense.
To clarify, if you have n individuals in your dataset, the phased data for indivual i can be obtained by combinging the value at position i (paternal) and position n+i (maternal).
Sean