#103 Bugs(?) in GoodTuringProbDist and WittenBellProbDist

peter ljunglöf

When training and testing HMM tagging using different estimators, there are problems with GoodTuring and WittenBell.

GoodTuring gets extremely bad accuracy, and WittenBell gets a ZeroDivisionError. I'm not an expert on GoodTuring and WittenBell, so there's a small chance I have called them in a wrong manner, but my guess is that there are bugs in their implementations.

Attached is python file which I used for testing, and its output is shown below. NOTE: the test file needs the patched version of, which was submitted as patch #1997742.

Training 450 sentences, 10431 tokens

Training using estimator: Laplace
Testing 50 sentences, 1280 tokens
Test result: 67.6%

Training using estimator: ELE
Testing 50 sentences, 1280 tokens
Test result: 75.2%

Training using estimator: Lidstone 0.1
Testing 50 sentences, 1280 tokens
Test result: 82.6%

Training using estimator: GoodTuring
Testing 50 sentences, 1280 tokens
Test result: 13.1%

Training using estimator: WittenBell
Testing 50 sentences, 1280 tokens
Traceback (most recent call last):
File "/var/folders/Hl/Hldpn1ooEeSui0EMU5q1zE+++TM/-Tmp-/py1805CCK", line 34, in <module>
acc = nltk.tag.accuracy(hmm, testC)
File "/Library/Python/2.5/site-packages/nltk/tag/", line 82, in accuracy
test_tokens += list(tagger.tag(untag(sent)))
File "/Library/Python/2.5/site-packages/nltk/tag/", line 182, in tag
path = self.best_path(unlabeled_sequence)
File "/Library/Python/2.5/site-packages/nltk/tag/", line 226, in best_path
File "/Library/Python/2.5/site-packages/nltk/tag/", line 204, in _create_cache
X[i, j] = self._transitions[si].logprob(self._states[j])
File "/Library/Python/2.5/site-packages/nltk/", line 316, in logprob
p = self.prob(sample)
File "/Library/Python/2.5/site-packages/nltk/", line 897, in prob
return self._T / float(self._Z * (self._N + self._T))
ZeroDivisionError: float division


