## using the zeros trick with censored/truncated data document.SUBSCRIPTION_OPTIONS = { "thing": "thread", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

2013-07-10
2013-07-11
• Charlotte Boyd - 2013-07-10

I am working on some analysis that involves truncated Gaussian random fields. After reading the JAGS manual, I’m not clear whether these data should be regarded as truncated or censored. It’s probably easiest to explain using a rainfall example. We want to analyze the spatial distribution of rainfall, but rainfall less than some known threshold value is difficult to measure and is recorded as 0, yet we want to include these censored/truncated values in our analysis.

I started by developing a non-spatial toy model in JAGS using both the T() function and the dinterval() function. Assuming y = log(rainfall) is normally-distributed, I generated a standard normal variable, truncated/censored a percentage of lower values. I then used JAGS to sample from the posterior distributions of the mean and variance. The following code produces fairly unbiased estimates:

for(i in 1:N ) {
is.observed[i] ~ dinterval(y[i], c[i]);
y[i] ~ dnorm(mu, tau);}

But, for the spatial analysis, I need to use a non-standard likelihood function, so I also ran tests using the zeros trick. Both sets of models lead to consistently biased estimates of the parameters – the mean is always over-estimated and the standard deviation is always underestimated.

If I use the zeros trick on uncensored/truncated data, I again get fairly unbiased estimates. So the bias seems to be a function of the combination of censored/truncated data and the zeros trick. If anyone has any suggestions on how to address this, I’d would very much appreciate it. I'd also appreciate any advice on whether the dinterval() or T() function is more appropriate here.

(Note: In the spatial analysis, I’m interested in estimating the parameters of the spatial process from data that are known to be drawn from a standard normal distribution, hence I’m comfortable using y[i] ~ dnorm(0, 1) in the code below. Samples from the posterior distributions of the spatial parameters also seem to be biased, although there are likely additional factors at work here, in particular the influence of spatial autocorrelation on the posterior distributions of censored/truncated values.)

## Simulate some data

set.seed(6)
N <- 300
Y <- rnorm(N)
Y <- (Y-mean(Y))/sd(Y)

qp <- qnorm(0.4) # -0.2533471
y.upper <- Y[Y > qp]
y.lower <- Y[Y <= qp]

N.upper <- length(y.upper)
N.lower <- length(y.lower)
N.upper # 180

y.lower <- rep(NA, N.lower)
y <- c(y.upper, y.lower)

# The model

model = cat("
model
{
tau ~ dgamma(0.1,0.1);
sigma <- 1/sqrt(tau);
mu ~ dnorm(0,1);

```for(i in 1:N ) {
isObserved[i] ~ dinterval(y[i], limitvec[i]);
y[i] ~ dnorm(0, 1); # y ~ N(0,1)

zeroes[i] ~ dpois(phi[i]) # likelihood is exp(- phi[i])
phi[i] <- 0.5 * log(pow(sigma,2)) + 0.5 * pow((y[i] - mu) / sigma, 2) + C;} # NLL + C
```

}
",file="censored_model.txt")

# Set up simulations

C <- 10
zeroes <- rep(0,N)

limitvec <- rep(qp, N)
isObserved <- c(rep(TRUE,N.upper), rep(FALSE,N.lower))

# Initialise the chains

muInit = 0
tauInit = 1
yInit = rep(qp-1, N)
yInit[isObserved] = NA
jags.inits <- function(){list("mu"=muInit, "tau" = tauInit, "y"=yInit)}

# MCMC parameters

mcmcchains <- 1
mcmcthin <- 10
mcmcburn <- 1000
samples2Save <- 1000 * mcmcthin

jags.data = list(y = y, N = N, isObserved = as.numeric(isObserved), limitvec = limitvec, zeroes = zeroes, C = C)
jags.params = c("mu", "sigma")
model.location = paste("censored_model.txt",sep="")

# Run model

jags.model <- jags(jags.data, inits=jags.inits, parameters.to.save = jags.params, model.file=model.location, n.chains = mcmcchains, n.thin = mcmcthin, n.burnin=mcmcburn, n.iter =(mcmcburn+samples2Save), DIC = F)

# The model

model = cat("
model
{
tau ~ dgamma(0.1,0.1);
sigma <- 1/sqrt(tau);
mu ~ dnorm(0,1);

```for(i in 1:N.lower) {y.lower[i] ~ dnorm(0,1)T(,qp);} # y ~ N(0,1)

# Data plus censored 'data'
Y[1:N.upper] <- y.upper;
Y[(N.upper+1):N] <- y.lower;

for(i in 1:N) {
zeroes[i] ~ dpois(phi[i]) # likelihood is exp(- phi[i])
phi[i] <- 0.5 * log(pow(sigma,2)) + 0.5 * pow((Y[i] - mu) / sigma, 2) + C;} # NLL + C
```

}

",file="truncated_model.txt")

# Set up simulations

C <- 10
zeroes <- rep(0,N)

# MCMC parameters

mcmcchains <- 1
mcmcthin <- 10
mcmcburn <- 1000
samples2Save <- 1000 * mcmcthin

jags.data = list("qp", "y.upper", "N", "N.upper", "N.lower","zeroes", "C")
jags.params = c("mu", "sigma")
model.location = paste("truncated_model.txt",sep="")

# Run model

jags.model <- jags(jags.data, inits=NULL, parameters.to.save = jags.params, model.file=model.location, n.chains = mcmcchains, n.thin = mcmcthin, n.burnin=mcmcburn, n.iter =(mcmcburn+samples2Save), DIC = F)

• Martyn Plummer - 2013-07-11

Your data are censored, not truncated. So forget about the second model.

The key to understanding the difference between censoring and truncation is to consider what a replicate data set might look like. Since your data are simulated, this is easy. If you remove the random seed, then every time you rerun the code that generates the data, different elements of `Y` will fall below the threshold `qp` and will be set to `NA`. This makes your data censored. If every time you generated a new set of data the same elements were above and below the threshold, your data would be truncated.

The reason your first model is not working is that it does not have the correct likelihood. Here is a version that is correct.

```model {
tau ~ dgamma(0.1, 0.1)
sigma <- 1/sqrt(tau)
mu ~ dnorm(0,1)

for(i in 1:N ) {
zeroes[i] ~ dpois(phi[i]) # likelihood is exp(- phi[i])

Y[i] <- ifelse(isObserved[i], y[i], qp)

phi[i] <- -log(ifelse(isObserved[i],
dnorm(Y[i], mu, sigma),
pnorm(Y[i], mu, sigma))) + C
}
}
```

First we reconstruct the data `Y` setting `Y[i]` to be `y[i]` for the observed values, and to be the censoring threshold `qp` for the censored values. Then we define the likelihood as the density of `y[i]` for the observed values, and the tail probability of `qp` for the censored values. Here I'm using the built-in functions `dnorm` and `pnorm` for the normal density and tail probability respectively.

Note that there is no attempt to simulate the unobserved values of `y[i]`. This is just as well, because we use the zeros trick precisely in those situations when we can't define how the data are generated in the BUGS language.

So it is possible to work with censored data, but the downside is that you need an analytic expression for the tail probability. This might make the technique somewhat limited in application.

• Charlotte Boyd - 2013-07-11

Hi Martyn

Many thanks indeed for both the clarification of censored vs truncated data and for the corrections to the model.

For the spatial analysis, my challenge is that I actually need to simulate the censored values of y[i] in order to estimate the parameters of the Gaussian random field over a complete rather than censored process. But, your response has helped me to rethink my approach to this problem.

Many thanks

Charlotte