Russell Almond - 2014-03-22

I'm trying to fit a heirarchical mixture model to some data which consists of pause times during typing for a number of students who are writing essays. My theory is that there are some rare components in this distribution corresponding to high level events in the writing process that I can pick up with a hierarchical model, but not by looking at any individual.

The problem is that the eta0 and nu0 parameters in this model; basically the ones related to the precision of the higher-level part of the model, mix very slowly. Here is my first model:

############################################################
## General K component mixture model, no covariates.
############################################################
##
## Fixed Structural Parameters
## I -- number of essays, i -- index for essays
## J[I] -- number of events per essay, j -- index for events,
## Jmax <- max(J)
## K -- number of mixture components
##
## Data
## Y[I,Jmax] -- log pause time
## Note that this is a ragged array, so that Y[i,j] for j>J[i] will be NA
##
## Latent Variables
## Z[I,Jmax] -- indicator for mixture components
##
## 1st Level parameters
## pi[I,K] -- mixture probabilities
## mu[I,K] -- component means
## tau[I,K] -- component precisions
## sigma[I,K] <- 1/sqrt(tau[I,K])
##
## 2nd Level hyperparameters
## alpha[K] -- mixture component proportions
## mu0[K] -- grand means for components
## eta0[K] -- precisions for components
## tau0[K] -- scale factors for components
## nu0[K] -- degrees of freedom for scaled components
##
## 3rd level fixed hyperparameters
data {
  for (k in 1:K) {
    mu0m[k] <- 0                        # mu0 prior mean
    mu0p[k] <- .00001                   # mu0 prior precision
    eta0a[k] <- 0.1                     # eta0 prior shape
    eta0b[k] <- 0.1                     # eta0 prior scale
    tau0a[k] <- 1.1                     # tau0 prior shape
    tau0b[k] <- 1.1                     # tau0 prior scale
    nu0a[k] <- 1.1                      # nu0 prior shape
    nu0b[k] <- 1.1                      # nu0 prior scale
    alpha0a[k] <- 1.1                   # alpha prior shape
    alpha0b[k] <- 2                     # alpha prior scale
  }
}

model{

  ## Hyperpriors for mixtures parameters
  for (k in 1:K) {
    mu0[k]~dnorm(mu0m[k],mu0p[k])
    eta0[k]~dgamma(eta0a[k],eta0b[k])
    tau0[k] ~ dgamma(tau0a[k],tau0b[k])
    nu0[k] ~ dgamma(nu0a[k],nu0b[k])
    alpha[k]~dgamma(alpha0a[k],alpha0b[k])
  }

  ## Priors for individual components
  for(i in 1:I){
    pi[i,1:K] ~ ddirch(alpha[]+.01)
    for (k in 1:K) {
      mu[i,k]~dnorm(mu0[k],eta0[k])
      tau[i,k]~dgamma(tau0[k]+.01,nu0[k]+.01)
    }
  }

  ## Likelihood for data
  for (i in 1:I) {
    for (j in 1:J[i]) {
###     Z[i,j] ~ dcat(pi[i,1:K])
###     Y[i,j] ~dnorm(mu[i,Z[i,j]], tau[i,Z[i,j]])
      Y[i,j] ~ dnormmix(mu[i,],tau[i,],pi[i,])
    }
  }

}

As I mentioned before, I was getting slow mixing with this model. So I tried rewriting the distributions for mu and log tau so that they looked more like regressions.
In other words I set
theta ~ dnorm(0,1)
mu <- mu0+beta0theta
eta ~ dnorm(0,1)
log(tau) <- tau0+gamma0
eta

The hope was that it would be eaiser to estimate regression coefficients than precisions.

This is the rewritten model.

############################################################
## General K component mixture model, no covariates.
############################################################
##
## Fixed Structural Parameters
## I -- number of essays, i -- index for essays
## J[I] -- number of events per essay, j -- index for events,
## Jmax <- max(J)
## K -- number of mixture components
##
## Data
## Y[I,Jmax] -- log pause time
## Note that this is a ragged array, so that Y[i,j] for j>J[i] will be NA
##
## 1st Level parameters
## pi[I,K] -- mixture probabilities
## mu[I,K] -- component means
## tau[I,K] -- component precisions
## sigma[I,K] <- 1/sqrt(tau[I,K])
##
## 2nd level random effects  (allows reparameterizing as a regression
## instead of estimating precisions)
## theta[I,K] -- component effect for mean
## eta[I,K] -- component effect for log precision
##
## 2nd Level hyperparameters
## alpha[K] -- mixture component proportions
## mu0[K] -- grand means for components
## beta0[K] -- sd of mean distributions
## tau0[K] -- grand mean of log precisions
## gamma0[K] -- sd of log precisions
##
## 3rd level fixed hyperparameters
data {
  for (k in 1:K) {
    mu0m[k] <- 0                        # mu0 prior mean
    mu0p[k] <- .00001                   # mu0 prior precision
    beta0m[k] <- 1                      # beta0 prior shape
    beta0p[k] <- 0.001                  # beta0 prior scale
    tau0m[k] <- 0                       # log tau0 prior mean
    tau0p[k] <- 0.001                   # log tau0 prior precision
    gamma0m[k] <- 1                     # gamma0 prior mean
    gamma0p[k] <- 0.01                  # gamma0 prior precision
    alpha0mu[k] <- .5                   # alpha prior mean
  }
  alphaNr <- 6                          # alpha prior scale
  alphaNlam <- .001                     # alpha prior
}

model{

  ## Hyperpriors for mixtures parameters
  alpha0[1:K]~ddirch(alpha0mu[1:K])
  alpha0N ~dgamma(alphaNr,alphaNlam)
  for (k in 1:K) {
    mu0[k] ~ dnorm(mu0m[k],mu0p[k])
    beta0[k] ~ dnorm(beta0m[k],beta0p[k])
    tau0[k] ~ dnorm(tau0m[k],tau0p[k])
    gamma0[k] ~ dnorm(gamma0m[k],gamma0p[k])
    alpha[k] <- alpha0[k]*alpha0N
  }

  ## Priors for individual components
  for(i in 1:I){
    pi[i,1:K] ~ ddirch(alpha[]+.01)
    for (k in 1:K) {
      ## reparameterize mu[i,k] = mu0[k]+beta[k]*theta[i,k]
      theta[i,k] ~dnorm(0,1)
      mu[i,k] <- mu0[k] +beta0[k]*theta[i,k]
      eta[i,k] ~dnorm(0,1)
      tau[i,k] <- exp(tau0[k]+gamma0[k]*eta[i,k])+.01
    }
  }

  ## Likelihood for data
  for (i in 1:I) {
    for (j in 1:J[i]) {
###     Z[i,j] ~ dcat(pi[i,1:K])
###     Y[i,j] ~dnorm(mu[i,Z[i,j]], tau[i,Z[i,j]])
      Y[i,j] ~ dnormmix(mu[i,],tau[i,],pi[i,])
    }
  }

}

The problem with this model is that fairly often one of the nodes gets stuck at an infinite likelihood point, particularly with K=4 (the case I'm most interested in). Testing is also pretty slow because there J is several hundred for each student (I've cut the sample down to 100 students just to get reaonsable run times).

I've tried both constraining beta0 and gamma0 to reasonable ranges, e.g., T(.01,100) and letting them run free (as shown here) (taking the absolute value on the R side when I'm done).

Any suggestions for next steps to try would be appreciated.