using the zeros trick with censored/truncated data

2013-07-10
2013-07-11
  • Charlotte Boyd

    Charlotte Boyd - 2013-07-10

    I am working on some analysis that involves truncated Gaussian random fields. After reading the JAGS manual, I’m not clear whether these data should be regarded as truncated or censored. It’s probably easiest to explain using a rainfall example. We want to analyze the spatial distribution of rainfall, but rainfall less than some known threshold value is difficult to measure and is recorded as 0, yet we want to include these censored/truncated values in our analysis.

    I started by developing a non-spatial toy model in JAGS using both the T() function and the dinterval() function. Assuming y = log(rainfall) is normally-distributed, I generated a standard normal variable, truncated/censored a percentage of lower values. I then used JAGS to sample from the posterior distributions of the mean and variance. The following code produces fairly unbiased estimates:

    for(i in 1:N ) {
    is.observed[i] ~ dinterval(y[i], c[i]);
    y[i] ~ dnorm(mu, tau);}

    But, for the spatial analysis, I need to use a non-standard likelihood function, so I also ran tests using the zeros trick. Both sets of models lead to consistently biased estimates of the parameters – the mean is always over-estimated and the standard deviation is always underestimated.

    If I use the zeros trick on uncensored/truncated data, I again get fairly unbiased estimates. So the bias seems to be a function of the combination of censored/truncated data and the zeros trick. If anyone has any suggestions on how to address this, I’d would very much appreciate it. I'd also appreciate any advice on whether the dinterval() or T() function is more appropriate here.

    Many thanks in advance

    (Note: In the spatial analysis, I’m interested in estimating the parameters of the spatial process from data that are known to be drawn from a standard normal distribution, hence I’m comfortable using y[i] ~ dnorm(0, 1) in the code below. Samples from the posterior distributions of the spatial parameters also seem to be biased, although there are likely additional factors at work here, in particular the influence of spatial autocorrelation on the posterior distributions of censored/truncated values.)

    Simulate some data

    set.seed(6)
    N <- 300
    Y <- rnorm(N)
    Y <- (Y-mean(Y))/sd(Y)

    qp <- qnorm(0.4) # -0.2533471
    y.upper <- Y[Y > qp]
    y.lower <- Y[Y <= qp]

    N.upper <- length(y.upper)
    N.lower <- length(y.lower)
    N.upper # 180

    y.lower <- rep(NA, N.lower)
    y <- c(y.upper, y.lower)

    Using dinterval():

    The model

    model = cat("
    model
    {
    tau ~ dgamma(0.1,0.1);
    sigma <- 1/sqrt(tau);
    mu ~ dnorm(0,1);

    for(i in 1:N ) {
        isObserved[i] ~ dinterval(y[i], limitvec[i]);
        y[i] ~ dnorm(0, 1); # y ~ N(0,1)
    
        zeroes[i] ~ dpois(phi[i]) # likelihood is exp(- phi[i])
        phi[i] <- 0.5 * log(pow(sigma,2)) + 0.5 * pow((y[i] - mu) / sigma, 2) + C;} # NLL + C
    

    }
    ",file="censored_model.txt")

    Set up simulations

    C <- 10
    zeroes <- rep(0,N)

    limitvec <- rep(qp, N)
    isObserved <- c(rep(TRUE,N.upper), rep(FALSE,N.lower))

    Initialise the chains

    muInit = 0
    tauInit = 1
    yInit = rep(qp-1, N)
    yInit[isObserved] = NA
    jags.inits <- function(){list("mu"=muInit, "tau" = tauInit, "y"=yInit)}

    MCMC parameters

    mcmcchains <- 1
    mcmcthin <- 10
    mcmcburn <- 1000
    samples2Save <- 1000 * mcmcthin

    jags.data = list(y = y, N = N, isObserved = as.numeric(isObserved), limitvec = limitvec, zeroes = zeroes, C = C)
    jags.params = c("mu", "sigma")
    model.location = paste("censored_model.txt",sep="")

    Run model

    jags.model <- jags(jags.data, inits=jags.inits, parameters.to.save = jags.params, model.file=model.location, n.chains = mcmcchains, n.thin = mcmcthin, n.burnin=mcmcburn, n.iter =(mcmcburn+samples2Save), DIC = F)

    Using T():

    The model

    model = cat("
    model
    {
    tau ~ dgamma(0.1,0.1);
    sigma <- 1/sqrt(tau);
    mu ~ dnorm(0,1);

    for(i in 1:N.lower) {y.lower[i] ~ dnorm(0,1)T(,qp);} # y ~ N(0,1)
    
    # Data plus censored 'data'
    Y[1:N.upper] <- y.upper;
    Y[(N.upper+1):N] <- y.lower;
    
    for(i in 1:N) {
        zeroes[i] ~ dpois(phi[i]) # likelihood is exp(- phi[i])
        phi[i] <- 0.5 * log(pow(sigma,2)) + 0.5 * pow((Y[i] - mu) / sigma, 2) + C;} # NLL + C
    

    }

    ",file="truncated_model.txt")

    Set up simulations

    C <- 10
    zeroes <- rep(0,N)

    MCMC parameters

    mcmcchains <- 1
    mcmcthin <- 10
    mcmcburn <- 1000
    samples2Save <- 1000 * mcmcthin

    jags.data = list("qp", "y.upper", "N", "N.upper", "N.lower","zeroes", "C")
    jags.params = c("mu", "sigma")
    model.location = paste("truncated_model.txt",sep="")

    Run model

    jags.model <- jags(jags.data, inits=NULL, parameters.to.save = jags.params, model.file=model.location, n.chains = mcmcchains, n.thin = mcmcthin, n.burnin=mcmcburn, n.iter =(mcmcburn+samples2Save), DIC = F)

     
  • Martyn Plummer

    Martyn Plummer - 2013-07-11

    Your data are censored, not truncated. So forget about the second model.

    The key to understanding the difference between censoring and truncation is to consider what a replicate data set might look like. Since your data are simulated, this is easy. If you remove the random seed, then every time you rerun the code that generates the data, different elements of Y will fall below the threshold qp and will be set to NA. This makes your data censored. If every time you generated a new set of data the same elements were above and below the threshold, your data would be truncated.

    The reason your first model is not working is that it does not have the correct likelihood. Here is a version that is correct.

    model {
       tau ~ dgamma(0.1, 0.1)
       sigma <- 1/sqrt(tau)
       mu ~ dnorm(0,1)
    
       for(i in 1:N ) {
          zeroes[i] ~ dpois(phi[i]) # likelihood is exp(- phi[i])
    
          Y[i] <- ifelse(isObserved[i], y[i], qp)
    
          phi[i] <- -log(ifelse(isObserved[i], 
                         dnorm(Y[i], mu, sigma),
                         pnorm(Y[i], mu, sigma))) + C
       }
    }
    

    First we reconstruct the data Y setting Y[i] to be y[i] for the observed values, and to be the censoring threshold qp for the censored values. Then we define the likelihood as the density of y[i] for the observed values, and the tail probability of qp for the censored values. Here I'm using the built-in functions dnorm and pnorm for the normal density and tail probability respectively.

    Note that there is no attempt to simulate the unobserved values of y[i]. This is just as well, because we use the zeros trick precisely in those situations when we can't define how the data are generated in the BUGS language.

    So it is possible to work with censored data, but the downside is that you need an analytic expression for the tail probability. This might make the technique somewhat limited in application.

     
  • Charlotte Boyd

    Charlotte Boyd - 2013-07-11

    Hi Martyn

    Many thanks indeed for both the clarification of censored vs truncated data and for the corrections to the model.

    For the spatial analysis, my challenge is that I actually need to simulate the censored values of y[i] in order to estimate the parameters of the Gaussian random field over a complete rather than censored process. But, your response has helped me to rethink my approach to this problem.

    Many thanks

    Charlotte

     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks