Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.


Handling categorical values whose validity can change every iteration

Andrew R
  • Andrew R
    Andrew R


    I implemented LDA and ran into an issue with z, a categorical variable that indicates word n's topic in document d. (My model is below for reference.) The validity of z[d, n]'s value depends only on the other z values. For example, with two topics, the words in document 1 may be topic 1 in one iteration and topic 2 in another iteration. Both of these are equally valid as long as the words all belong to the same topic.

    The issue is that this makes the mean of each variable meaningless. Continuing with the above example, the mean topic of every word in every document might be approximately 1.5.

    # 2 topics
    # doc 1: 1, 2, 3, 4
    # doc 2: 1, 2, 5, 6
               Mean     SD Naive SE Time-series SE
    z[1,1]   1.4780 0.4996 0.009121       0.017988
    z[1,2]   1.4847 0.4998 0.009126       0.018407
    z[1,3]   1.4893 0.5000 0.009128       0.020094
    z[1,4]   1.4767 0.4995 0.009120       0.021691
    z[2,1]   1.4857 0.4999 0.009126       0.017175
    z[2,2]   1.4960 0.5001 0.009130       0.018065
    z[2,3]   1.5130 0.4999 0.009127       0.020807
    z[2,4]   1.5090 0.5000 0.009129       0.021343

    What is the best way to handle variables with this behavior? Should z[1,1] be the topic that appears in the most iterations? Should it be the topic that appears in the last iteration? Or is there a better way to handle it?

    Thanks in advance,

    # w, V, D, K, doclens, alpha, and beta are observed
    model {
          for (d in 1:D) {
              theta[d, 1:K] ~ ddirch(alpha)
              for (n in 1:doclens[d]) {
                  z[d, n] ~ dcat(theta[d, 1:K])
                  w[d, n] ~ dcat(phi[z[d, n], 1:V])
          for (k in 1:K) {
              phi[k, 1:V] ~ ddirch(beta)
  • Martyn Plummer
    Martyn Plummer

    This is called label switching, and is a common problem in mixture models. To make sense of the parameters you may need to use a relabelling algorithm. You can find a review of label switching here:

    On a positive note, the fact that all the posterior distributions look the same is a sign that your Markov chain has good mixing, which is not always the case with mixture models.

  • Andrew R
    Andrew R

    Thanks for the reference, Martyn. That's exactly what I was looking for.