Treating a Derived Quantity as Data

Help
2013-12-19
2014-01-14
  • I am having some trouble using a derived sum of an estimated latent state (z) to further estimate the population-level probability of (z=1).

    The general model layout is a logistic regression with observation error. Observed data, y, is an imperfect detection of latent state, z. I have a single categorical covariate, x1.

    Model Specification:
    for (i in 1:n){
    y[i] ~ Binom(z[i], 1, p.obs)
    z[i] ~ Bern(ilogit(X[i, ] %*% beta))
    }
    beta ~ N(0, 1.5)
    p.obs ~ Beta(1, 1)
    prob ~ Binom(sum(z), n, prob)

    # Code:
    # Example Model for SourceForge
    
    # setwd('')
    
    # Simulate Data
    n = 100
    
    # Single categorical covariate
    x1 = rbinom(n, 1, 0.2)
    X = model.matrix(~x1)
    
    # Transformations
    logit <- function(p) log(p/(1-p))
    expit <- function(x) exp(x)/(exp(x)+1)
    
    # Logistic Regression Coefficients
    beta = c(-1, 1)
    # P(z=1)
    p = expit(X %*% beta)
    # Latent State
    z = rbinom(n, 1, p)
    # Data, with observation error
    y = rep(0, n)
    p.obs = .7
    y[z==1] = rbinom(sum(z==1), 1, p.obs)
    
    data = list(y = y, X = X, n = n)
    inits = list(beta=c(0,0), p.obs = 0.5, prob=sum(y)/n)
    
    sink('JAGS.r')
      cat(" model{
        beta[1] ~ dnorm(0, 1.5)
        beta[2] ~ dnorm(0, 1.5)
        p.obs ~ dbeta(1, 1)
        prob ~ dbeta(1, 1)
    
        for (i in 1:n){
          p[i] <- ilogit(X[i, ] %*% beta)
          z[i] ~ dbern(p[i])
          y[i] ~ dbern(p.obs)
        }
        sum(z) ~ dbin(n, prob)
      }
    ", fill = TRUE)
    
    sink()
    
    library(rjags)
    library(runjags)
    library(coda)
    
    n.adapt= 1000
    n.update= 1000
    n.iter= 10000
    
    jm = jags.model("JAGS.R", data = data, inits = inits, 
                    n.chains = length(inits), n.adapt)
    update(jm, n.iter=n.update)
    
    cm = coda.samples(jm, variable.names = c('beta', 'p.obs', 'prob'), 
                      n.iter = n.iter)
    # End Code
    

    I have had several ideas about this:

    1) Treat the estimate of z as data within each iteration and simply model the P(z=1) as binomial. This is what I attempted to do above but it doesn't like to use sum() as a link function. A colleague says that he has run into this sort of problem many times because JAGS doesn't allow assigning a node more than once, eg:
    count <- sum(z)
    count ~ dbin(n, prob)

    2) Simply take the mean of p. Would this be equivalent? eg:
    prob <- mean(p)

    The background is that z is disease status and we want population-level prevalence, P(z=1).

    Any feedback is appreciated.
    Thanks,
    Nathan

    ps. I will be traveling for the holidays so please don't be too annoyed if I miss any posts next week.

     
    Last edit: Nathan Galloway 2013-12-19
  • Martyn Plummer
    Martyn Plummer
    2014-01-14

    I took a somewhat longer break over the holidays...

    This is an interesting question. But before giving you the answer, I need to fix your model:

    model{
        beta[1] ~ dnorm(0, 1.5)
        beta[2] ~ dnorm(0, 1.5)
        ###p.obs ~ dbeta(1, 1)
        p.obs <- 0.7
        prob ~ dbeta(1, 1)
    
        for (i in 1:n){
          p[i] <- ilogit(X[i, ] %*% beta)
          z[i] ~ dbern(p[i])
          y[i] ~ dbin(p.obs, z[i])
        }
    }
    

    I've made two changes. Firstly, I have changed the definition of y so that it corresponds to the R code that generated the data. So if z[i]==0 then we must have y[i]=0, and if z[i]==1 then we have probability p.obs of observing y[i]=1.

    The second thing I did is to fix p.obs to 0.7 (the true value). With these data you cannot estimate p.obs because there is no replication, i.e. to estimate p.obs you would need either a validation study with y and z both observed on the same individuals, or a reliability study with two independent measurements of y on the same individuals.

    With these data, you have to assume p.obs is known or, as a sensitivity analysis, you can put a strongly informative prior around it, bearing in mind that any uncertainty will feed directly into the posterior. Trying to put a non-informative prior on p.obs will result in an unidentifiable model and no doubt convergence problems.

    Now, your question is how to estimate the prevalence in the population. There are three ways of interpreting this. The first two are:

    1) Your n individuals constitute the whole population
    2) Your n individuals represent a sample from a much larger population.

    If your n individuals constitute the whole population then the population prevalence is, by definition, the mean of z. So you can add this derived quantity to your bugs model

        p.hat1 <- mean(z)
    

    and monitor p.hat1.

    If these n individuals are a sub-sample of a larger population and you want to estimate the prevalence in this hyper-population then you have to work a bit harder. You need to simulate a new individual from the population who is not in the sample. To do this, we need to add a model for the covariate X.

        ## Model for covariate X[2]
        pi ~ dbeta(1, 1)
        for (i in 1:n){
           X[i,2] ~ dbern(pi)
        }
    

    Here I've added a new parameter pi, the prevalence of X[2].

    Then we can simulate sampling a new individual from the population

        ## Sample new individual from the population
        x2.new ~ dbern(pi)
        ## Prevalence in randomly drawn individual
        p.hat2 <- ilogit(beta[1] + x2.new*beta[2])
    

    Monitoring p.hat2 gives you an estimate of the prevalence in the hyper population.

    Both p.hat1 and p.hat2 will have the same posterior mean, but the variance of p.hat2 will be larger, as it includes some sampling error (i.e. your n individuals are representative of the population, but do not describe it exactly).

    3) The third possibility is to monitor
    p.hat3 <- mean(p[])
    This is not so easy to interpret. It represents the prevalence in a hypothetical replicate of this sample of n individuals, with the same values of X but (conditionally) independent values of y. It represents a half-way house between the the two other possibilities and so has an intermediate variance.