Menu

pD: using median rather than mean?

Help
2012-03-31
2024-06-21
  • Jack Tanner

    Jack Tanner - 2012-03-31

    For my models, when monitoring deviance and pD manually, I consistently get a
    couple of samples of pD that are extremely high. These seem like outliers for
    purposes of computing DIC as deviance + pD. (Deviance estimates do not get
    such outliers.) Is it kosher to use median pD and median deviance instead of
    mean pD and mean deviance?

     
  • Martyn Plummer

    Martyn Plummer - 2012-04-02

    Not really. The whole theory is based on expected utility. Although rare,
    those few samples that give high pD values are telling you something. Your
    model gives posterior support to two different parameter values that give
    inconsistent predictions.

     
  • Jack Tanner

    Jack Tanner - 2012-04-02

    Fair enough that high pD values should be meaningful, but I get pD values over
    200k for a model that should have no more than a few hundred parameters! The
    graph size that JAGS reports is about 3000, which, as I understand it,
    includes parameters and data. If it's not kosher to discard outlier
    estimates of pD, then I really wonder if JAGS is estimating pD correctly.

     
  • Gianluca Baio

    Gianluca Baio - 2024-06-18

    Can I please follow up on this very old thread? I have seen the same behaviour pointed out by Jack, which however doesn't happen when running the model with R2OpenBUGS --- is there an obvious reason why these outliers (massive values in a very small number of simulations for the parameters, which result in massive values for the deviance and thus pV) aren't picked up by BUGS too?

     
  • Martyn Plummer

    Martyn Plummer - 2024-06-19

    This requires a little background to answer.

    JAGS and BUGS do slightly different things here. BUGS implements the original definition of pD = Dbar - Dhat, as described by Spiegelhalter et al in the original DIC paper. I did not think this was quite right. In the discussion of that paper I proposed an alternative definition of pD that solved some problems with the original formula. Later I wrote a paper rigorously justifying it. However - and this is the important part - I did not claim I was inventing anything new, just polishing the DIC idea. Hence I also called my penalty pD and the resulting penalized deviance DIC. This is the version of DIC implemented in JAGS. Most of the time it will be very similar to the original DIC. However, in some cases it is not.

    I think this phenomenon of extremely large pD values may be a convergence issue. JAGS estimates the pD penalty using parallel chains. But the estimator assumes that all chains have converged. If you have a multimodal posterior with one chain getting stuck on a mode, or perhaps one of the chains takes an excursion into the tail of the posterior distribution, then you might get this phenomenon.

    So, to come back to Jack's original suggestion, the best option for numerical stability might be to take a robust mean of pD to filter out the excursions.

     
  • Gianluca Baio

    Gianluca Baio - 2024-06-20

    Thank you, Martyn --- I had read your 2008 paper and have gone back to it already, so I was aware of the slight but fundamental differences in the way in which the two versions of pD (and hence DIC) are computed.

    I was curious about the issue with the few very extreme simulations for the deviance --- I can give more details, but ultimately, there's a handful of simulations where some of the main parameters get estimates that are way off the chart, wrt their full posterior. The underlying model is a simple logistic regression

    y[i] ~ dbin(pi[i],n[i]
    logit(pi[i] <- alpha[i] + delta[i]*Trt[i]
    alpha[i] ~ dnorm(0,0.0625)
    ...
    

    I've tested various versions of the prior for the baseline alpha[i] in each study (that's essentially the "Magensium" example of Spiegelhalter et al 2004). Here there's no need to go overboard and use alpha[i] ~ dnorm(0,0.00001), and the prior above is already "wide" enough --- but I think the issue is that it still induces some considerable conflict with the data (a for all the studies the underlying observed y's are very small).

    I think what happens is that JAGS finds a few simulations where alpha[i] are drawn with values closer to logit(0.5), because the prior says this is possible, but as the data contradict this the deviance shoots off to very large values.

    I think pV = var(deviance)/2 highlights this as it gets very large, which is actually helpful --- but, if you know what I mean, it kind of overdoes it...

    pD as computed by dic.samples gives much more reasonable (for lack of a better word, which in fact may be "interpretable", in this case), so I don't think there's an issue in anything that JAGS does... And the value from dic.samples is aligned with that obtained by running BUGS.

    I was just wondering why this behaviour (with the exact same model and data) just doesn't occur with BUGS, which seems more able to discard the (too wide) range suggested in the prior and thus have more stable estimates for the deviance...

    Ultimately, I think that's an interesting case --- a holistic approach is of course to go to a fuller evaluation --- even if I didn't know BUGS, I'd compute pD using dic.samples and probably would use a rough estimate for pD a la BUGS using
    median(deviance) - deviance(mean parameter configuration)
    (with the latter term using the full likelihood etc) --- with the caveat you mention, Martyn, that this is really not what the theoretical rationale would suggest you do... (But I think as an approx it does OK...).

    It may be confusing for the practitioner --- the whole reason I got into this is when preparing a relatively complex example for my students/book...

    When I got a sec, I'll test with yet another prior in which alpha[i] are not centered around 0.

    Does the above make sense?

     
  • Gianluca Baio

    Gianluca Baio - 2024-06-21

    OK, just in case this is interesting for people, I've done a bit more exploration on this...

    So, the model is something like

    no_pooling=function(){
      for(i in 1:S) {
        # Sampling distributions
        r1[i] ~ dbin(pi1[i], n1[i]) 
        r2[i] ~ dbin(pi2[i], n2[i])
        logit(pi1[i]) <- alpha[i]
        logit(pi2[i]) <- alpha[i] + delta[i]
        # Baselines (NB: sd=4 => precision = 0.0625)
         alpha[i] ~ dnorm(0,0.0625) 
        # Study-specific treatment effects (sd=2 => precision = 0.25)
        delta[i] ~ dnorm(0,0.25)
      }
    }
    

    and the data are from Spiegelhalter et al (2002) --- the "magnesium" meta-analysis

    data=list(
    r1=c(2,23,7,1,8,9,3,1,11,7,12,13,8,118,17,2103),
    n1=c(36,135,200,46,148,56,23,21,75,27,80,33,122,1157,108,29039),
    r2=c(1,9,2,1,10,1,1,0,6,1,2,5,4,90,4,2216),
    n2=c(40,135,200,48,150,59,25,22,76,27,89,23,130,1159,107,29011),
    S=16
    )
    

    (yes: the last study is waaaaaay bigger than the others); and then

    mjags=R2jags::jags(
      data=data,
      parameters.to.save=c(
        "alpha","delta"
      ),
      inits=function(){
        list(alpha=rnorm(16,0,4),delta=rnorm(16,0,2))
      },n.chains=2,n.iter=40000,n.burnin=38000,n.thin=1,DIC=TRUE,
      model.file=no_pooling
    )
    
    mbugs=R2OpenBUGS::bugs(
      data=data,
      parameters.to.save=c(
        "alpha","delta"
      ),
      inits=function(){
        list(alpha=rnorm(16,0,4),delta=rnorm(16,0,2))
      },n.chains=2,n.iter=40000,n.burnin=38000,n.thin=1,DIC=TRUE,
      model.file=no_pooling
    )
    

    --- I have purposely set the options so that, I think, the two models should be identical. In this setup, as you say Martyn, convergence is poor --- but it's especially bad with JAGS. In general, increasing the burn-in and thinning or increasing the total number of iterations does help. BUT: convergence is never as good in JAGS as it is in BUGS...

    I am not sure whether there's an artificial effect in either though --- is it BUGS making it look better than it is? Or is it JAGS failing to get full convergence when BUGS can?

    As I had originally seen, if you improve the fit, say using n.burnin=4000 and n.thin=18 (or something even better in terms of running for longer), then eventually there's only <5 "bad" simulations in the JAGS run. Which means that the estimates for the model parameters are basically identical in terms of the 95% posterior interval (because they exclude the massive outliers). But the deviance and p_V are still affected of course, so, whatever I do, p_V gets estimated to a very large number by JAGS...

    Interestingly, dic.samples() manages to retrieve a decent value for p_D --- very much aligned with the one returned by BUGS and sensible with respect to the nominal number of parameters (32).

    If I change the model and consider a simple hierarchical version (where the delta parameters are exchangeable), then things are fine --- the model doesn't have any convergence issue and all the estimates are OK.

     
  • Martyn Plummer

    Martyn Plummer - 2024-06-21

    Thanks for looking into this. This is due to a bug in R2jags which means that the burn-in phase is completely skipped. It does not matter what you specify for n.burnin, the samples will be collected from the first iteration. This means that there is an extremely large transient at the beginning of the samples which inflates the apparent variance of the deviance.

     
  • Gianluca Baio

    Gianluca Baio - 2024-06-21

    Yes, I figured that too! I think I've fixed the R2jags function --- I'll send them a PR on their Github.

     

Log in to post a comment.