Menu

Simple linear regression with Bernoulli submodel

Help
2014-08-20
2014-08-28
  • Stephen Gregory

    Stephen Gregory - 2014-08-20

    I have data on fish age and length collected over several years. Length has been recorded for all fish but age is not always recorded but it can be estimated from fish length.

    I want to look at changes in the mean length of the young-of-year (age 0) fish over time. If all fish were age 0, then I would have a simple linear regression:

    model {
      for (i in 1:n){
        y[i] ~ dnorm(mu[i], tau)
        mu[i] <- a + (b * x[i])
      }
      a ~ dnorm(0, 1e-3)
      b ~ dnorm(0, 1e-3)
      tau <- pow(sigma, -2)
      sigma ~ dunif(0, 100)
    }
    

    where y are observed lengths, x is years and n is the number of fish. (I will investigate a year random effect and hierarchical priors, among other things – this is just a simple illustration.)

    Fish, however, can be of any age in [0, 1, 2, 3]. Since fish age and length are related, I would like to include an age “submodel” in the mean length analysis to assign an age to unaged fish. By including an age submodel, my hope is to allow uncertainty in fish age to propagate to the mean length analysis.

    My idea is to convert the age problem to a Bernoulli problem, whereby fish of age 0 are coded 1 and all other fish are coded 0. The Bernoulli submodel would be something like:

    model {
      for (i in 1:n){
        w[i] ~ dbern(p[i])
        logit(p[i]) <- a + (b * x[i])
      }
      a ~ dnorm(0, 1e-3)
      b ~ dnorm(0, 1e-3)
    }
    

    where w are age codes, x is years and n is the number of fish. (Again, I will investigate random effects and hierarchical priors, etc.)

    I want to combine these models but I’m not sure how to do it. These are the ideas I’ve had:

    1. estimate the mean length likelihood for age 0 fish only using an approach similar to a hurdle model but for continuous data, or
    2. estimate the mean length likelihood for age 0 and age >0 fish separately but disregard the age >0 estimates, or
    3. estimate the mean length likelihood for age 0 fish only.

    My initial thoughts are: I don’t think that idea 1 is appropriate; idea 2 would require an indexing trick; and 3 I don’t know how to do this (hence ideas 1 and 2).

    Below is my closest solution using the jags ifelse function but I’m concerned that setting nu to 0 for all age >0 fish affects the estimation of sigma (and perhaps a1 and b1).

    model{
      for(i in 1:N){
        w[i] ~ dbern(p[i])
        logit(p[i]) <- a0 + (b0 * x1[i])
    
        test[i] <- step(w[i] - 1) + 1
    
        y[i] ~ dnorm(nu[i], tau)
        nu[i] <- ifelse(test[i] == 1, y.hat[i], 0)
        y.hat[i] <- a1[test[i]] + (b1[test[i]] * x2[i])
      }
      a0 ~ dnorm(0, 1e-3)
      b0 ~ dnorm(0, 1e-3)
      a1[2] <- 0
      a1[1] ~ dnorm(0, 1e-3)
      b1[2] <- 0
      b1[1] ~ dnorm(0, 1e-3)
      tau <- pow(sigma, -2)
      sigma ~ dunif(0, 100)
    }
    

    Here is some R code that will run the above model (paste the code into a file called "tmp.jags" in the working directory):

    ## data
    n <- 100
    x1 <- 1:n
    a0 <- 0.1 # estimate should be ~ -5
    b0 <- 5 # estimate should be ~ 0.1
    z <- (a0 * x1) - b0
    w <- sapply(1 / (1 + exp(-z)), function(p) {rbinom(1, 1, p)})
    x2 <- rnorm(n, 10, 2)
    a11 <- 0.1 # estimate should be ~ 0.1
    b11 <- 5 # estimate should be ~ 5
    a12 <- 0
    b12 <- 0
    epsilon <- rnorm(n, 0, 1)
    y <- ifelse(w == 0, a11 + (b11 * x2) + epsilon, a12 + (b12 * x2) + epsilon)
    dat <- list(w = w, y = y, x1 = x1, x2 = x2, n = n)
    
    ## monitors
    params <- c('a0', 'b0' ,'a1', 'b1', 'sigma', 'y.hat', 'test')
    
    ## run it
    require(R2jags)
    out <- jags(model.file = 'tmp.jags',
                data = dat,
                parameters.to.save = params,
                n.chains = 3,
                n.iter = 10000,
                n.burnin = 1000,
                n.thin = 10,
                jags.seed = 123)
    
    ## print medians
    print(out$BUGSoutput$median[params])
    

    Can anyone suggest a way to combine a Bernoulli and simple linear regression for this problem?

    Thanks in advance for any help.
    Stephen

     
    • Stephen Gregory

      Stephen Gregory - 2014-08-25

      Perhaps the answer is to split sigma into a vector of two elements, as in:

      model{
        for(i in 1:N){
          w[i] ~ dbern(p[i])
          logit(p[i]) <- a0 + (b0 * x1[i])
      
          test[i] <- step(w[i] - 1) + 1
      
          y[i] ~ dnorm(y.hat[i], tau[test[i]])
          y.hat[i] <- a1[test[i]] + (b1[test[i]] * x2[i])
      
        }
      
        a0 ~ dnorm(0, 1e-3)
        b0 ~ dnorm(0, 1e-3)
        a1[2] <- 0
        a1[1] ~ dnorm(0, 1e-3)
        b1[2] <- 0
        b1[1] ~ dnorm(0, 1e-3)
        tau[1] <- pow(sigma[1], -2)
        sigma[1] ~ dunif(0, 100)
        tau[2] <- pow(sigma[2], -2)
        sigma[2] ~ dunif(0, 100)
      
      }
      

      Does this sound reasonable?

      Thanks in advance for any help.
      Stephen

       
      • Martyn Plummer

        Martyn Plummer - 2014-08-25

        You want something more like this. As well as giving a prior distribution to sigma[1] you need to do the same for a1[2] and b1[2] instead of setting them to zero.

        model{
          for(i in 1:N){
            w[i] ~ dbern(p[i])
            logit(p[i]) <- a0 + (b0 * x1[i])
        
            test[i] <- step(w[i] - 1) + 1
        
            y[i] ~ dnorm(y.hat[i], tau[test[i]])
            y.hat[i] <- a1[test[i]] + (b1[test[i]] * x2[i])
        
          }
        
          a0 ~ dnorm(0, 1e-3)
          b0 ~ dnorm(0, 1e-3)
        
          for (j in 1:2) {
             a1[j] ~ dnorm(0, 1e-3)
             b1[j] ~ dnorm(0, 1e-3)
             tau[j] <- pow(sigma[j], -2)
             sigma[j] ~ dunif(0, 100)
          }
        }
        
         
  • Stephen Gregory

    Stephen Gregory - 2014-08-26

    Thanks, Martyn. Very kind of you to respond.

    So that I'm completely clear: even though I am not interested in parameters a1[2], b1[2] and sigma[2], I should nevertheless give them normal priors. Is that right? If so, why? Perhaps it is to help the model run?

     
  • Martyn Plummer

    Martyn Plummer - 2014-08-27

    Yes they are nuisance parameters. But you still need them. At each iteration, the model has to classify each fish into one of two groups (w=0 or w=1). The probability that a fish belongs to one group or another is based on the likelihood ratio between the two possibilities. To calculate this likelihood ratio, you do need to include a plausible sub-model for the lengths of the older fish as well as the younger fish.

    For example, in your previous model you set all older fish to have length zero by setting a1[2] and b1[2] to zero. This is an implausible model, so in practice the model will classify most fish of non-zero length in group w=0.

     
  • Stephen Gregory

    Stephen Gregory - 2014-08-28

    I see. That makes good sense. Thanks v much for explaining!

    Stephen

     

Log in to post a comment.