For my models, when monitoring deviance and pD manually, I consistently get a
couple of samples of pD that are extremely high. These seem like outliers for
purposes of computing DIC as deviance + pD. (Deviance estimates do not get
such outliers.) Is it kosher to use median pD and median deviance instead of
mean pD and mean deviance?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Not really. The whole theory is based on expected utility. Although rare,
those few samples that give high pD values are telling you something. Your
model gives posterior support to two different parameter values that give
inconsistent predictions.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Fair enough that high pD values should be meaningful, but I get pD values over
200k for a model that should have no more than a few hundred parameters! The
graph size that JAGS reports is about 3000, which, as I understand it,
includes parameters and data. If it's not kosher to discard outlier
estimates of pD, then I really wonder if JAGS is estimating pD correctly.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Can I please follow up on this very old thread? I have seen the same behaviour pointed out by Jack, which however doesn't happen when running the model with R2OpenBUGS --- is there an obvious reason why these outliers (massive values in a very small number of simulations for the parameters, which result in massive values for the deviance and thus pV) aren't picked up by BUGS too?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
JAGS and BUGS do slightly different things here. BUGS implements the original definition of pD = Dbar - Dhat, as described by Spiegelhalter et al in the original DIC paper. I did not think this was quite right. In the discussion of that paper I proposed an alternative definition of pD that solved some problems with the original formula. Later I wrote a paper rigorously justifying it. However - and this is the important part - I did not claim I was inventing anything new, just polishing the DIC idea. Hence I also called my penalty pD and the resulting penalized deviance DIC. This is the version of DIC implemented in JAGS. Most of the time it will be very similar to the original DIC. However, in some cases it is not.
I think this phenomenon of extremely large pD values may be a convergence issue. JAGS estimates the pD penalty using parallel chains. But the estimator assumes that all chains have converged. If you have a multimodal posterior with one chain getting stuck on a mode, or perhaps one of the chains takes an excursion into the tail of the posterior distribution, then you might get this phenomenon.
So, to come back to Jack's original suggestion, the best option for numerical stability might be to take a robust mean of pD to filter out the excursions.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank you, Martyn --- I had read your 2008 paper and have gone back to it already, so I was aware of the slight but fundamental differences in the way in which the two versions of pD (and hence DIC) are computed.
I was curious about the issue with the few very extreme simulations for the deviance --- I can give more details, but ultimately, there's a handful of simulations where some of the main parameters get estimates that are way off the chart, wrt their full posterior. The underlying model is a simple logistic regression
I've tested various versions of the prior for the baseline alpha[i] in each study (that's essentially the "Magensium" example of Spiegelhalter et al 2004). Here there's no need to go overboard and use alpha[i] ~ dnorm(0,0.00001), and the prior above is already "wide" enough --- but I think the issue is that it still induces some considerable conflict with the data (a for all the studies the underlying observed y's are very small).
I think what happens is that JAGS finds a few simulations where alpha[i] are drawn with values closer to logit(0.5), because the prior says this is possible, but as the data contradict this the deviance shoots off to very large values.
I think pV = var(deviance)/2 highlights this as it gets very large, which is actually helpful --- but, if you know what I mean, it kind of overdoes it...
pD as computed by dic.samples gives much more reasonable (for lack of a better word, which in fact may be "interpretable", in this case), so I don't think there's an issue in anything that JAGS does... And the value from dic.samples is aligned with that obtained by running BUGS.
I was just wondering why this behaviour (with the exact same model and data) just doesn't occur with BUGS, which seems more able to discard the (too wide) range suggested in the prior and thus have more stable estimates for the deviance...
Ultimately, I think that's an interesting case --- a holistic approach is of course to go to a fuller evaluation --- even if I didn't know BUGS, I'd compute pD using dic.samples and probably would use a rough estimate for pD a la BUGS using
median(deviance) - deviance(mean parameter configuration)
(with the latter term using the full likelihood etc) --- with the caveat you mention, Martyn, that this is really not what the theoretical rationale would suggest you do... (But I think as an approx it does OK...).
It may be confusing for the practitioner --- the whole reason I got into this is when preparing a relatively complex example for my students/book...
When I got a sec, I'll test with yet another prior in which alpha[i] are not centered around 0.
Does the above make sense?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
--- I have purposely set the options so that, I think, the two models should be identical. In this setup, as you say Martyn, convergence is poor --- but it's especially bad with JAGS. In general, increasing the burn-in and thinning or increasing the total number of iterations does help. BUT: convergence is never as good in JAGS as it is in BUGS...
I am not sure whether there's an artificial effect in either though --- is it BUGS making it look better than it is? Or is it JAGS failing to get full convergence when BUGS can?
As I had originally seen, if you improve the fit, say using n.burnin=4000 and n.thin=18 (or something even better in terms of running for longer), then eventually there's only <5 "bad" simulations in the JAGS run. Which means that the estimates for the model parameters are basically identical in terms of the 95% posterior interval (because they exclude the massive outliers). But the deviance and p_V are still affected of course, so, whatever I do, p_V gets estimated to a very large number by JAGS...
Interestingly, dic.samples() manages to retrieve a decent value for p_D --- very much aligned with the one returned by BUGS and sensible with respect to the nominal number of parameters (32).
If I change the model and consider a simple hierarchical version (where the delta parameters are exchangeable), then things are fine --- the model doesn't have any convergence issue and all the estimates are OK.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for looking into this. This is due to a bug in R2jags which means that the burn-in phase is completely skipped. It does not matter what you specify for n.burnin, the samples will be collected from the first iteration. This means that there is an extremely large transient at the beginning of the samples which inflates the apparent variance of the deviance.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
For my models, when monitoring deviance and pD manually, I consistently get a
couple of samples of pD that are extremely high. These seem like outliers for
purposes of computing DIC as deviance + pD. (Deviance estimates do not get
such outliers.) Is it kosher to use median pD and median deviance instead of
mean pD and mean deviance?
Not really. The whole theory is based on expected utility. Although rare,
those few samples that give high pD values are telling you something. Your
model gives posterior support to two different parameter values that give
inconsistent predictions.
Fair enough that high pD values should be meaningful, but I get pD values over
200k for a model that should have no more than a few hundred parameters! The
graph size that JAGS reports is about 3000, which, as I understand it,
includes parameters and data. If it's not kosher to discard outlier
estimates of pD, then I really wonder if JAGS is estimating pD correctly.
Can I please follow up on this very old thread? I have seen the same behaviour pointed out by Jack, which however doesn't happen when running the model with R2OpenBUGS --- is there an obvious reason why these outliers (massive values in a very small number of simulations for the parameters, which result in massive values for the deviance and thus pV) aren't picked up by BUGS too?
This requires a little background to answer.
JAGS and BUGS do slightly different things here. BUGS implements the original definition of pD = Dbar - Dhat, as described by Spiegelhalter et al in the original DIC paper. I did not think this was quite right. In the discussion of that paper I proposed an alternative definition of pD that solved some problems with the original formula. Later I wrote a paper rigorously justifying it. However - and this is the important part - I did not claim I was inventing anything new, just polishing the DIC idea. Hence I also called my penalty pD and the resulting penalized deviance DIC. This is the version of DIC implemented in JAGS. Most of the time it will be very similar to the original DIC. However, in some cases it is not.
I think this phenomenon of extremely large pD values may be a convergence issue. JAGS estimates the pD penalty using parallel chains. But the estimator assumes that all chains have converged. If you have a multimodal posterior with one chain getting stuck on a mode, or perhaps one of the chains takes an excursion into the tail of the posterior distribution, then you might get this phenomenon.
So, to come back to Jack's original suggestion, the best option for numerical stability might be to take a robust mean of pD to filter out the excursions.
Thank you, Martyn --- I had read your 2008 paper and have gone back to it already, so I was aware of the slight but fundamental differences in the way in which the two versions of pD (and hence DIC) are computed.
I was curious about the issue with the few very extreme simulations for the deviance --- I can give more details, but ultimately, there's a handful of simulations where some of the main parameters get estimates that are way off the chart, wrt their full posterior. The underlying model is a simple logistic regression
I've tested various versions of the prior for the baseline
alpha[i]
in each study (that's essentially the "Magensium" example of Spiegelhalter et al 2004). Here there's no need to go overboard and usealpha[i] ~ dnorm(0,0.00001)
, and the prior above is already "wide" enough --- but I think the issue is that it still induces some considerable conflict with the data (a for all the studies the underlying observed y's are very small).I think what happens is that JAGS finds a few simulations where
alpha[i]
are drawn with values closer to logit(0.5), because the prior says this is possible, but as the data contradict this the deviance shoots off to very large values.I think
pV = var(deviance)/2
highlights this as it gets very large, which is actually helpful --- but, if you know what I mean, it kind of overdoes it...pD
as computed bydic.samples
gives much more reasonable (for lack of a better word, which in fact may be "interpretable", in this case), so I don't think there's an issue in anything that JAGS does... And the value fromdic.samples
is aligned with that obtained by running BUGS.I was just wondering why this behaviour (with the exact same model and data) just doesn't occur with BUGS, which seems more able to discard the (too wide) range suggested in the prior and thus have more stable estimates for the deviance...
Ultimately, I think that's an interesting case --- a holistic approach is of course to go to a fuller evaluation --- even if I didn't know BUGS, I'd compute
pD
usingdic.samples
and probably would use a rough estimate forpD
a la BUGS usingmedian(deviance) - deviance(mean parameter configuration)
(with the latter term using the full likelihood etc) --- with the caveat you mention, Martyn, that this is really not what the theoretical rationale would suggest you do... (But I think as an approx it does OK...).
It may be confusing for the practitioner --- the whole reason I got into this is when preparing a relatively complex example for my students/book...
When I got a sec, I'll test with yet another prior in which
alpha[i]
are not centered around 0.Does the above make sense?
OK, just in case this is interesting for people, I've done a bit more exploration on this...
So, the model is something like
and the data are from Spiegelhalter et al (2002) --- the "magnesium" meta-analysis
(yes: the last study is waaaaaay bigger than the others); and then
--- I have purposely set the options so that, I think, the two models should be identical. In this setup, as you say Martyn, convergence is poor --- but it's especially bad with JAGS. In general, increasing the burn-in and thinning or increasing the total number of iterations does help. BUT: convergence is never as good in JAGS as it is in BUGS...
I am not sure whether there's an artificial effect in either though --- is it BUGS making it look better than it is? Or is it JAGS failing to get full convergence when BUGS can?
As I had originally seen, if you improve the fit, say using
n.burnin=4000
andn.thin=18
(or something even better in terms of running for longer), then eventually there's only <5 "bad" simulations in the JAGS run. Which means that the estimates for the model parameters are basically identical in terms of the 95% posterior interval (because they exclude the massive outliers). But the deviance and p_V are still affected of course, so, whatever I do, p_V gets estimated to a very large number by JAGS...Interestingly,
dic.samples()
manages to retrieve a decent value for p_D --- very much aligned with the one returned by BUGS and sensible with respect to the nominal number of parameters (32).If I change the model and consider a simple hierarchical version (where the
delta
parameters are exchangeable), then things are fine --- the model doesn't have any convergence issue and all the estimates are OK.Thanks for looking into this. This is due to a bug in R2jags which means that the burn-in phase is completely skipped. It does not matter what you specify for
n.burnin
, the samples will be collected from the first iteration. This means that there is an extremely large transient at the beginning of the samples which inflates the apparent variance of the deviance.Yes, I figured that too! I think I've fixed the R2jags function --- I'll send them a PR on their Github.