## Treating a Derived Quantity as Data

Help
2013-12-19
2014-01-14
• I am having some trouble using a derived sum of an estimated latent state (z) to further estimate the population-level probability of (z=1).

The general model layout is a logistic regression with observation error. Observed data, y, is an imperfect detection of latent state, z. I have a single categorical covariate, x1.

Model Specification:
for (i in 1:n){
y[i] ~ Binom(z[i], 1, p.obs)
z[i] ~ Bern(ilogit(X[i, ] %*% beta))
}
beta ~ N(0, 1.5)
p.obs ~ Beta(1, 1)
prob ~ Binom(sum(z), n, prob)

```# Code:
# Example Model for SourceForge

# setwd('')

# Simulate Data
n = 100

# Single categorical covariate
x1 = rbinom(n, 1, 0.2)
X = model.matrix(~x1)

# Transformations
logit <- function(p) log(p/(1-p))
expit <- function(x) exp(x)/(exp(x)+1)

# Logistic Regression Coefficients
beta = c(-1, 1)
# P(z=1)
p = expit(X %*% beta)
# Latent State
z = rbinom(n, 1, p)
# Data, with observation error
y = rep(0, n)
p.obs = .7
y[z==1] = rbinom(sum(z==1), 1, p.obs)

data = list(y = y, X = X, n = n)
inits = list(beta=c(0,0), p.obs = 0.5, prob=sum(y)/n)

sink('JAGS.r')
cat(" model{
beta[1] ~ dnorm(0, 1.5)
beta[2] ~ dnorm(0, 1.5)
p.obs ~ dbeta(1, 1)
prob ~ dbeta(1, 1)

for (i in 1:n){
p[i] <- ilogit(X[i, ] %*% beta)
z[i] ~ dbern(p[i])
y[i] ~ dbern(p.obs)
}
sum(z) ~ dbin(n, prob)
}
", fill = TRUE)

sink()

library(rjags)
library(runjags)
library(coda)

n.update= 1000
n.iter= 10000

jm = jags.model("JAGS.R", data = data, inits = inits,
update(jm, n.iter=n.update)

cm = coda.samples(jm, variable.names = c('beta', 'p.obs', 'prob'),
n.iter = n.iter)
# End Code
```

1) Treat the estimate of z as data within each iteration and simply model the P(z=1) as binomial. This is what I attempted to do above but it doesn't like to use sum() as a link function. A colleague says that he has run into this sort of problem many times because JAGS doesn't allow assigning a node more than once, eg:
count <- sum(z)
count ~ dbin(n, prob)

2) Simply take the mean of p. Would this be equivalent? eg:
prob <- mean(p)

The background is that z is disease status and we want population-level prevalence, P(z=1).

Any feedback is appreciated.
Thanks,
Nathan

ps. I will be traveling for the holidays so please don't be too annoyed if I miss any posts next week.

Last edit: Nathan Galloway 2013-12-19

• Martyn Plummer
2014-01-14

I took a somewhat longer break over the holidays...

This is an interesting question. But before giving you the answer, I need to fix your model:

```model{
beta[1] ~ dnorm(0, 1.5)
beta[2] ~ dnorm(0, 1.5)
###p.obs ~ dbeta(1, 1)
p.obs <- 0.7
prob ~ dbeta(1, 1)

for (i in 1:n){
p[i] <- ilogit(X[i, ] %*% beta)
z[i] ~ dbern(p[i])
y[i] ~ dbin(p.obs, z[i])
}
}
```

I've made two changes. Firstly, I have changed the definition of y so that it corresponds to the R code that generated the data. So if `z[i]==0` then we must have `y[i]=0`, and if `z[i]==1` then we have probability `p.obs` of observing `y[i]=1`.

The second thing I did is to fix `p.obs` to 0.7 (the true value). With these data you cannot estimate `p.obs` because there is no replication, i.e. to estimate `p.obs` you would need either a validation study with `y` and `z` both observed on the same individuals, or a reliability study with two independent measurements of `y` on the same individuals.

With these data, you have to assume `p.obs` is known or, as a sensitivity analysis, you can put a strongly informative prior around it, bearing in mind that any uncertainty will feed directly into the posterior. Trying to put a non-informative prior on `p.obs` will result in an unidentifiable model and no doubt convergence problems.

Now, your question is how to estimate the prevalence in the population. There are three ways of interpreting this. The first two are:

1) Your `n` individuals constitute the whole population
2) Your `n` individuals represent a sample from a much larger population.

If your `n` individuals constitute the whole population then the population prevalence is, by definition, the mean of `z`. So you can add this derived quantity to your bugs model

```    p.hat1 <- mean(z)
```

and monitor `p.hat1`.

If these `n` individuals are a sub-sample of a larger population and you want to estimate the prevalence in this hyper-population then you have to work a bit harder. You need to simulate a new individual from the population who is not in the sample. To do this, we need to add a model for the covariate X.

```    ## Model for covariate X[2]
pi ~ dbeta(1, 1)
for (i in 1:n){
X[i,2] ~ dbern(pi)
}
```

Here I've added a new parameter `pi`, the prevalence of `X[2]`.

Then we can simulate sampling a new individual from the population

```    ## Sample new individual from the population
x2.new ~ dbern(pi)
## Prevalence in randomly drawn individual
p.hat2 <- ilogit(beta[1] + x2.new*beta[2])
```

Monitoring `p.hat2` gives you an estimate of the prevalence in the hyper population.

Both `p.hat1` and `p.hat2` will have the same posterior mean, but the variance of `p.hat2` will be larger, as it includes some sampling error (i.e. your `n` individuals are representative of the population, but do not describe it exactly).

3) The third possibility is to monitor
p.hat3 <- mean(p[])
This is not so easy to interpret. It represents the prevalence in a hypothetical replicate of this sample of `n` individuals, with the same values of `X` but (conditionally) independent values of `y`. It represents a half-way house between the the two other possibilities and so has an intermediate variance.