Andrew R
2013-04-09
Hi,
I implemented LDA and ran into an issue with z, a categorical variable that indicates word n's topic in document d. (My model is below for reference.) The validity of z[d, n]'s value depends only on the other z values. For example, with two topics, the words in document 1 may be topic 1 in one iteration and topic 2 in another iteration. Both of these are equally valid as long as the words all belong to the same topic.
The issue is that this makes the mean of each variable meaningless. Continuing with the above example, the mean topic of every word in every document might be approximately 1.5.
# 2 topics # doc 1: 1, 2, 3, 4 # doc 2: 1, 2, 5, 6 Mean SD Naive SE Time-series SE z[1,1] 1.4780 0.4996 0.009121 0.017988 z[1,2] 1.4847 0.4998 0.009126 0.018407 z[1,3] 1.4893 0.5000 0.009128 0.020094 z[1,4] 1.4767 0.4995 0.009120 0.021691 z[2,1] 1.4857 0.4999 0.009126 0.017175 z[2,2] 1.4960 0.5001 0.009130 0.018065 z[2,3] 1.5130 0.4999 0.009127 0.020807 z[2,4] 1.5090 0.5000 0.009129 0.021343
What is the best way to handle variables with this behavior? Should z[1,1] be the topic that appears in the most iterations? Should it be the topic that appears in the last iteration? Or is there a better way to handle it?
Thanks in advance,
Andrew
# w, V, D, K, doclens, alpha, and beta are observed model { for (d in 1:D) { theta[d, 1:K] ~ ddirch(alpha) for (n in 1:doclens[d]) { z[d, n] ~ dcat(theta[d, 1:K]) w[d, n] ~ dcat(phi[z[d, n], 1:V]) } } for (k in 1:K) { phi[k, 1:V] ~ ddirch(beta) } }
Martyn Plummer
2013-04-11
This is called label switching, and is a common problem in mixture models. To make sense of the parameters you may need to use a relabelling algorithm. You can find a review of label switching here:
http://www.stats.ox.ac.uk/~cholmes/Reports/mcmc_label_switching_holmes.pdf
On a positive note, the fact that all the posterior distributions look the same is a sign that your Markov chain has good mixing, which is not always the case with mixture models.
Andrew R
2013-04-15
Thanks for the reference, Martyn. That's exactly what I was looking for.