From: Steven Bird <sb@cs...> - 2008-10-25 06:33:54
I propose to change the names of Brown Corpus categories in our corpus
API to use meaningful words in addition to meaningless letters
(keeping the existing access for backwards compatibility?)
At present we have:
>>> from nltk.corpus import brown
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n', 'p', 'r']
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
>>> brown.sents(categories=['a', 'b', 'c'])
[['The', 'Fulton', 'County'...], ['The', 'jury', 'further'...], ...]
The only way to find out what one of these letters stands for is to
look it up somewhere, e.g. in the NLTK book, or here:
http://icame.uib.no/brown/bcm.html. Here's the suggested terms:
a Press: Reportage -> "news"
b Press: Editorial -> "editorial"
c Press: Reviews -> "reviews"
d Religion -> "religion"
e Skill and Hobbies -> "hobbies"
f Popular Lore -> "lore"
g Belles-Lettres -> "literature"
h Government -> "government"
j Learned -> "learned"
k Fiction: General -> "fiction"
l Fiction: Mystery -> "mystery"
m Fiction: Science -> "science_fiction"
n Fiction: Adventure -> "adventure"
p Fiction: Romance -> "romance"
r Humor -> "humor"
Please let me know what you think.