Created by: Raphael Cohen
Date: Dec 2, 2012
This script aligns two topic models produced by MALLET (http://mallet.cs.umass.edu/)
Reciprocal topic pairs are reported with JS divergence measure.
Reciprocal pair (i,j) is defined when the distance of topic i from the first model (M1) and topic j from the second model (M2)
is minimal for all pairs (i,k) for k in M2 and (l,j) for l in M1 (best match for both topics).
This is useful for:
1. Qualitatively comparing different modeling parameters or algorithms
2. Identifying stable topics when running a few times
Input are two topic-state gz files produced by MALLET
for example after running:
[MALLET DIR]/bin/mallet train-topics --input data.mallet --num-topics 25 --num-iterations 2000 --output-state topic-state.gz
Feel free to use / change this code at your own risk.
USAGE:
%python JS-divergence.py topic-state1.gz topic-state2.gz
Option 2 - specify smoothing factor
%python JS-divergence.py topic-state1.gz topic-state2.gz 0.0000001
Result (example):
JS Divergence t1 t2
(0.5524645751867814, '15', '11')
(0.1312120698128315, '20', '10')
(0.06103903882230567, '24', '12')
(0.03749075669779891, '6', '20')
(0.09601937201648371, '18', '19')
(0.025672544059170105, '9', '18')
(0.1120611237407785, '2', '3')
(0.11165026591229285, '10', '24')
(0.05849937442765494, '3', '5')
(0.1523135314850376, '23', '6')
(0.15335010058956877, '22', '9')
(0.026982916171330196, '11', '8')