Users are encouraged to post questions here regarding the BP&P
software. When reporting apparent bugs or other unexpected program
behavior please state the version of the program that you are using as
well as any parameter settings.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2013-01-05
I wonder if the BP&P could be used to delimitate bacterial species considering that the populations had no horizontal gene transfer and no (or low levels of) recombination. What kind of special cares should we take to use BP&P in bacterial populations?
I appreciate you answer and comments,
Jose
j.castillo@proinpa.org
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The current method assumes that the different sequence loci are freely recombining (unlinked). One issue with bacterial species is the limited recombination/horizontal transfer which creates dependence among sequences/genes/chromosomes across the genome. One strategy would be to use a single sequence locus but that would likely have low power.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2013-02-12
I am performing a species delimitation analysis with mtDNA and two phased nDNA loci. When I run the program I get an error stating that I have more sequences at locus two (26) than allowed by the control file. This locus has 13 individuals with two haplotypes each. I was wondering what the maximum number of sequences per locus may be? Thanks.
CHris
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
This sounds like a problem with the control file format. Can you post the content of your control file?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2017-11-03
Post awaiting moderation.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2013-02-13
I think I figured it out. It had to do with how I was numbering my individuals in the data file. For example, I was assigning the same sequence ID to different phased alleles, so in the Imap and control file there were not enough sequences per species. I specified new sequence IDs for the second allele for each nuclear gene but included these in the same 'species' as the first allele in the Imap file. I hope this is correct.
Chris
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2013-02-19
Hello,
I have performed a species delimitation analysis for ten closely related and recently diverged species and I have two questions about the output.
First, although the ratios of theta to tau are the same for both algorithms, the absolute values of theta and tau differ between the two rjMCMC algorithms, with the mean values from algorithm 1 being 2-4 times greater than those from algorithm 0, although in some cases the 95% credible intervals overlap. Should I be worried about this? I have done eight replicate runs of each algorithm and have found this behavior to be consistent across runs.
Second, I am interested in measuring the divergence between species and subclades in units of Ne generations. I have a known substitution rate, and can convert theta to Ne and tau to time. However, I'm not sure which is the appropriate value of theta to use. Should I calculate the geometric mean of the mean theta values across the entire species tree?
Thank you for your help,
Ron
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
First, Algorithms 0 and 1 should produce identical results, so the difference is a concern. Do you see similar differences when you turn off speciesdelimitation and use a fixed tree?
I am not sure about your second question. The model assumes that each population has its own Ne, so you need decide which one to use. Also since tau/theta = (time*mu)/(4 Ne mu) = time/(4 Ne), so you don't have to know the mutation rate since mu cancels.
ziheng
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2013-03-21
I am trying to use MCcoal for simulating and then analyzing the simulated data with BPP.
I was wondering if the heredity scalars are working in both programs. I just tried to specify a file with these scalars but MCcoal reported a problem. A file with locus rates was apparently read without problems.
thanks for your help
sincerely,
Arley Camargo
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
i am still learning how to use sourceforge. sorry this is such a late reply.
this question is perhaps answered already.
anyway, bpp can deal with heredity scalars, but MCcoal can't. I guess the way to go may be for you to write simple (perl) scripts to generate the MCcoal control file and simulate the alignments for different scalars separately and then merge the alignments into one data file. That way you know how many loci should be generated for each scalar. The scalar is used to multiple theta's, but the tau's should remain unchanged.
ziheng
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2013-05-22
I am trying to run BPP2.2 to delimit 5 possible fungal species within a species complex. I have run the program with each of the possible starting trees, but the following generations always run using the 1111 tree, so the branches are not collapsing and I am not getting posterior species model probabilities. I have 70 sequences for each of 3 loci. I suspect my .ctl file (attached) has an error.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2013-06-05
I am getting the exact same thing. Have you received an answer to your question, John? I wrote my control file identically to other control files, so I don't think that is the issue. I've also started on all five starting trees, varied theta and tau combinations (x3) and used both algorithm settings (0 and 1) with the same result every time: first generation it goes to the fully resolved species tree with a probability of 1.000 and stays there through all 50,000 generations...
Thanks,
KPW
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
This is not necessarily any indication of a problem. It is possible that the completely delimited model has probability 1. In that case the program will not visit other delimitations. If you are worried you could try artificially splitting one of your populations into two groups, create a guide tree with the groups as sister species and see whether that node is collapsed in the posterior distribution of models.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The control file looks o.k. Run the program on the command line and
observe its behavior, and if it behaves in the same way as it does on
the example datasets, it should be fine. Version 2.2 includes quite a
few data examples. For example, the two datasets that we analyzed in
our recent genetics paper are in the package. You can try to
duplicate our results in the paper, and then prepare your files in the
same format.
Most likely the analysis supports the fully resolved tree, so model
1111 has posterior prob ~100%. The results are probably correct
especially if you get the same results with different starting trees.
You have many sequences at each locus, so the dataset is quite large.
You can use 1 or 2 for nloci = 3 to see whether the prob becomes
smaller with fewer loci. Also check whether the priors are reasonable
and change them to see whether they have an impact. (Look at the
explanations of those priors in the document.)
thetaprior = 2 2000 # gamma(a, b) for theta
tauprior = 2 20000 1 # gamma(a, b) for root tau & Dirichlet(a) for other tau's
I just saw the post below by KPW and Bruce's reply. Yes, it seems
that there may not be any numerical problem. It is just that the
method is favoring the fully resolved model, with posterior ~100%.
I am interested in the question whether the method (correctly
implemented, without any computational problems) oversplits. We don't
know much about this. The method seems often to favour many species
(or even the fully resolved tree) in empirical datasets.
Nevertheless, in simulations, it does not oversplit. You know your
species, if you believe the method is splitting two populations that
should be one species into two species, it may be interesting to many
people to know. It is possible that the simulations (there are only 2
of these, see below) missed some important features of the real
process and the simulation results are not that relevant, but in that
case, we need know what the important features are.
best,
ziheng
Camargo, A., M. Morando, L. J. Avila, and J. W. Sites. 2012. Species delimitation with ABC and other coalescent-based methods: a test of accuracy with simulations and an empirical example with lizards of the Liolaemus Darwinii complex (Squamata: Liolaemidae). Evolution 66:2834-2849.
Rannala, B., and Z. Yang. 2013. Improved reversible jump algorithms for Bayesian species delimitation. Genetics 194:245-253.
Zhang, C., D.-X. Zhang, T. Zhu, and Z. Yang. 2011. Evaluation of a Bayesian coalescent method of species delimitation. Syst. Biol. 60:747-761.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2013-07-19
Good day,
I have now rerun my dataset numerous times to check for oversplitting. In each series of BPP runs I used both algorithm 1 and 0 and used 3-4 starting trees. In the first series, I created a "false" node by choosing to divide one clade from my coalescent species tree into two clades. The posterior probabilies of this clade always had low support and so oversplitting did not occur. In the second series, I split the same clade by choosing a small subclade within with a 0.64 posterior probability. At the end of these runs, posterior probability of this node was always 1.00, so as I understand it, oversplitting did ocurr. However, there was an error reported at the end of each run: error in scanfile (). In the third series, I split a different clade from my coalescent species tree by choosing a small subclade within with a 0.93 posterior probability. In this case, the BPP runs returned posterior probabilites of about 0.98 each time. This, if I am correct, would not be considered a case of oversplitting. These third series runs also produced the same error message: error in scanfile (). Each of the clades I split had 1.00 support in my coalescent species tree. I am very interested in your evaluation of these conditions and results.
Thanks you,
John
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2013-06-02
Hi
I am running BP&P for species delimitation with 16 species in my guide tree and 130 sequences for 6 locus. The analysis initiates, sets the parameters, calculates de likelihood and it does not seem to proceed to print out the percentage progress indicator, acceptance proportions and the posteriors for the parameters.
Have set something incorrectly?
Thanks!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I don't know, but the problem is that the program is having trouble reading the sequence data file. did you look at the files in the package. you have to use the same format.
ziheng yang
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2013-07-15
I can't seem to get the acceptance proportion of my GBtj finetune parameter into the 0.3-0.4 range, no matter what I change the corresponding finetune parameter to in my ctl file. It stays around 0.64 no matter what. I must be missing something. Any advice?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
This is not a problem, and is noted before. Basically the GBtj move changes one coalescent time tj in a gene tree. It is a very small move, and does not change the likelihood and prior much so the acceptance proportion is high.
ziheng
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2013-08-02
Regarding my post from 5-22-13
I have now had the opportunity to run BPP multiple times on my dataset to test for over splitting. I have run 3 tests. I have run both algorithm 0 and 1 and used 3 starting trees for each test.
I have 5 clades with nodes supported at posterior probabilites of 1 from my coalescent run in MrBayes. These 5 clades are supported during my BPP runs at 100% (1111 on the guide tree). I then took Dr. Rannala's suggestion and ran a test by creating an artificial clade to see if it would be supported in the posterior distribution. I created an artificial split of one of my well supported clades. This node was not supported by BPP (11110). Next I split the clade using a small sub-branch within with posterior probability of 0.93, that node was supported 100 percent(11111) in BPP. I then split the clade a third time using a small sub-branch with with posterior probability of 0.64, that node was also supported at 100 percent (11111) in BPP. The program seems to be running correctly. I am still trying to determine the species delimitation of this group and am wondering if these tests would indicate oversplitting, or if I should be satisfied with my original result of my 5 clade run.
Thank you for your advice,
John
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
With only 3 loci, using the data to create additional splits of subclades with high posterior probability in the guide tree could lead to over-splitting (we haven't really explored this issue with such a small number of loci). However, this effect would presumably disappear (posterior probabilities would decrease) if additional loci were then added.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2013-10-13
If I understand correctly, you are using the reconstructed gene trees, perhaps with posterior probabilities from mrbayes, to define clades/populations, so there is a selection bias, and you would expect bpp to tend to suggest those populations as distinct species. It is like multiple comparisons.
If you make splits at random, it sounds like that bpp does not over split.
It is a good question how one should evaluate whether bpp over splits.
Ziheng
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Users are encouraged to post questions here regarding the BP&P
software. When reporting apparent bugs or other unexpected program
behavior please state the version of the program that you are using as
well as any parameter settings.
I wonder if the BP&P could be used to delimitate bacterial species considering that the populations had no horizontal gene transfer and no (or low levels of) recombination. What kind of special cares should we take to use BP&P in bacterial populations?
I appreciate you answer and comments,
Jose
j.castillo@proinpa.org
The current method assumes that the different sequence loci are freely recombining (unlinked). One issue with bacterial species is the limited recombination/horizontal transfer which creates dependence among sequences/genes/chromosomes across the genome. One strategy would be to use a single sequence locus but that would likely have low power.
I am performing a species delimitation analysis with mtDNA and two phased nDNA loci. When I run the program I get an error stating that I have more sequences at locus two (26) than allowed by the control file. This locus has 13 individuals with two haplotypes each. I was wondering what the maximum number of sequences per locus may be? Thanks.
CHris
This sounds like a problem with the control file format. Can you post the content of your control file?
I think I figured it out. It had to do with how I was numbering my individuals in the data file. For example, I was assigning the same sequence ID to different phased alleles, so in the Imap and control file there were not enough sequences per species. I specified new sequence IDs for the second allele for each nuclear gene but included these in the same 'species' as the first allele in the Imap file. I hope this is correct.
Chris
Hello,
I have performed a species delimitation analysis for ten closely related and recently diverged species and I have two questions about the output.
First, although the ratios of theta to tau are the same for both algorithms, the absolute values of theta and tau differ between the two rjMCMC algorithms, with the mean values from algorithm 1 being 2-4 times greater than those from algorithm 0, although in some cases the 95% credible intervals overlap. Should I be worried about this? I have done eight replicate runs of each algorithm and have found this behavior to be consistent across runs.
Second, I am interested in measuring the divergence between species and subclades in units of Ne generations. I have a known substitution rate, and can convert theta to Ne and tau to time. However, I'm not sure which is the appropriate value of theta to use. Should I calculate the geometric mean of the mean theta values across the entire species tree?
Thank you for your help,
Ron
First, Algorithms 0 and 1 should produce identical results, so the difference is a concern. Do you see similar differences when you turn off speciesdelimitation and use a fixed tree?
I am not sure about your second question. The model assumes that each population has its own Ne, so you need decide which one to use. Also since tau/theta = (time*mu)/(4 Ne mu) = time/(4 Ne), so you don't have to know the mutation rate since mu cancels.
ziheng
I am trying to use MCcoal for simulating and then analyzing the simulated data with BPP.
I was wondering if the heredity scalars are working in both programs. I just tried to specify a file with these scalars but MCcoal reported a problem. A file with locus rates was apparently read without problems.
thanks for your help
sincerely,
Arley Camargo
i am still learning how to use sourceforge. sorry this is such a late reply.
this question is perhaps answered already.
anyway, bpp can deal with heredity scalars, but MCcoal can't. I guess the way to go may be for you to write simple (perl) scripts to generate the MCcoal control file and simulate the alignments for different scalars separately and then merge the alignments into one data file. That way you know how many loci should be generated for each scalar. The scalar is used to multiple theta's, but the tau's should remain unchanged.
ziheng
I am trying to run BPP2.2 to delimit 5 possible fungal species within a species complex. I have run the program with each of the possible starting trees, but the following generations always run using the 1111 tree, so the branches are not collapsing and I am not getting posterior species model probabilities. I have 70 sequences for each of 3 loci. I suspect my .ctl file (attached) has an error.
Thanks for your assistance,
John
I am getting the exact same thing. Have you received an answer to your question, John? I wrote my control file identically to other control files, so I don't think that is the issue. I've also started on all five starting trees, varied theta and tau combinations (x3) and used both algorithm settings (0 and 1) with the same result every time: first generation it goes to the fully resolved species tree with a probability of 1.000 and stays there through all 50,000 generations...
Thanks,
KPW
This is not necessarily any indication of a problem. It is possible that the completely delimited model has probability 1. In that case the program will not visit other delimitations. If you are worried you could try artificially splitting one of your populations into two groups, create a guide tree with the groups as sister species and see whether that node is collapsed in the posterior distribution of models.
The control file looks o.k. Run the program on the command line and
observe its behavior, and if it behaves in the same way as it does on
the example datasets, it should be fine. Version 2.2 includes quite a
few data examples. For example, the two datasets that we analyzed in
our recent genetics paper are in the package. You can try to
duplicate our results in the paper, and then prepare your files in the
same format.
Most likely the analysis supports the fully resolved tree, so model
1111 has posterior prob ~100%. The results are probably correct
especially if you get the same results with different starting trees.
You have many sequences at each locus, so the dataset is quite large.
You can use 1 or 2 for nloci = 3 to see whether the prob becomes
smaller with fewer loci. Also check whether the priors are reasonable
and change them to see whether they have an impact. (Look at the
explanations of those priors in the document.)
I just saw the post below by KPW and Bruce's reply. Yes, it seems
that there may not be any numerical problem. It is just that the
method is favoring the fully resolved model, with posterior ~100%.
I am interested in the question whether the method (correctly
implemented, without any computational problems) oversplits. We don't
know much about this. The method seems often to favour many species
(or even the fully resolved tree) in empirical datasets.
Nevertheless, in simulations, it does not oversplit. You know your
species, if you believe the method is splitting two populations that
should be one species into two species, it may be interesting to many
people to know. It is possible that the simulations (there are only 2
of these, see below) missed some important features of the real
process and the simulation results are not that relevant, but in that
case, we need know what the important features are.
best,
ziheng
Camargo, A., M. Morando, L. J. Avila, and J. W. Sites. 2012. Species delimitation with ABC and other coalescent-based methods: a test of accuracy with simulations and an empirical example with lizards of the Liolaemus Darwinii complex (Squamata: Liolaemidae). Evolution 66:2834-2849.
Rannala, B., and Z. Yang. 2013. Improved reversible jump algorithms for Bayesian species delimitation. Genetics 194:245-253.
Zhang, C., D.-X. Zhang, T. Zhu, and Z. Yang. 2011. Evaluation of a Bayesian coalescent method of species delimitation. Syst. Biol. 60:747-761.
Good day,
I have now rerun my dataset numerous times to check for oversplitting. In each series of BPP runs I used both algorithm 1 and 0 and used 3-4 starting trees. In the first series, I created a "false" node by choosing to divide one clade from my coalescent species tree into two clades. The posterior probabilies of this clade always had low support and so oversplitting did not occur. In the second series, I split the same clade by choosing a small subclade within with a 0.64 posterior probability. At the end of these runs, posterior probability of this node was always 1.00, so as I understand it, oversplitting did ocurr. However, there was an error reported at the end of each run: error in scanfile (). In the third series, I split a different clade from my coalescent species tree by choosing a small subclade within with a 0.93 posterior probability. In this case, the BPP runs returned posterior probabilites of about 0.98 each time. This, if I am correct, would not be considered a case of oversplitting. These third series runs also produced the same error message: error in scanfile (). Each of the clades I split had 1.00 support in my coalescent species tree. I am very interested in your evaluation of these conditions and results.
Thanks you,
John
Hi
I am running BP&P for species delimitation with 16 species in my guide tree and 130 sequences for 6 locus. The analysis initiates, sets the parameters, calculates de likelihood and it does not seem to proceed to print out the percentage progress indicator, acceptance proportions and the posteriors for the parameters.
Have set something incorrectly?
Thanks!
I am not sure what may be the problem. Can you copy the last few lines of the screen output here.
ziheng
Hey guys.
I am using the BPP with three loci, for two populations. I check all my files several times, and always the program show me the same error.
what that means?
I already ran on Win and Mac, and the error is the same.
Thanks in advance
obs: I do not know if matter, but my first loci, is phased...
I don't know, but the problem is that the program is having trouble reading the sequence data file. did you look at the files in the package. you have to use the same format.
ziheng yang
I can't seem to get the acceptance proportion of my GBtj finetune parameter into the 0.3-0.4 range, no matter what I change the corresponding finetune parameter to in my ctl file. It stays around 0.64 no matter what. I must be missing something. Any advice?
This is not a problem, and is noted before. Basically the GBtj move changes one coalescent time tj in a gene tree. It is a very small move, and does not change the likelihood and prior much so the acceptance proportion is high.
ziheng
Regarding my post from 5-22-13
I have now had the opportunity to run BPP multiple times on my dataset to test for over splitting. I have run 3 tests. I have run both algorithm 0 and 1 and used 3 starting trees for each test.
I have 5 clades with nodes supported at posterior probabilites of 1 from my coalescent run in MrBayes. These 5 clades are supported during my BPP runs at 100% (1111 on the guide tree). I then took Dr. Rannala's suggestion and ran a test by creating an artificial clade to see if it would be supported in the posterior distribution. I created an artificial split of one of my well supported clades. This node was not supported by BPP (11110). Next I split the clade using a small sub-branch within with posterior probability of 0.93, that node was supported 100 percent(11111) in BPP. I then split the clade a third time using a small sub-branch with with posterior probability of 0.64, that node was also supported at 100 percent (11111) in BPP. The program seems to be running correctly. I am still trying to determine the species delimitation of this group and am wondering if these tests would indicate oversplitting, or if I should be satisfied with my original result of my 5 clade run.
Thank you for your advice,
John
With only 3 loci, using the data to create additional splits of subclades with high posterior probability in the guide tree could lead to over-splitting (we haven't really explored this issue with such a small number of loci). However, this effect would presumably disappear (posterior probabilities would decrease) if additional loci were then added.
If I understand correctly, you are using the reconstructed gene trees, perhaps with posterior probabilities from mrbayes, to define clades/populations, so there is a selection bias, and you would expect bpp to tend to suggest those populations as distinct species. It is like multiple comparisons.
If you make splits at random, it sounds like that bpp does not over split.
It is a good question how one should evaluate whether bpp over splits.
Ziheng