This page lists the datasets and models which are available at:
https://sourceforge.net/projects/jobimtext/files/data/
We also computed Models for various time slices on Google Books data.
Following available that can be used for calculating a new thesaurus. The format of the files is that each line contains one sentence.
Dataset | Download | Holing System | Word Count | Feature Count | Word Feature Count | Word Feature Significances | Similarities |
---|---|---|---|---|---|---|---|
en_wikipedia | I II | Malt Parser & Lemmatized | all | all | all | all | top 200 |
en_wikipedia | I | Malt Parser & Lemmatized | none | none | none | top 1000/word | top 100 |
en_news10M | I | Malt Parser & Lemmatized | all | all | all | all | top 200 |
en_news120M | I II III | Stanford Parser & Lemmatized | all | all | all | all | top 200 |
en_news120M pruned mysql | I | StanfordParser & Lemma | all | none | none | top 1000/word for 100k frequent words | top 200 for 100k frequent words |
en_news120M pruned | I | StanfordParser & Lemma | all | none | none | top 1000/word for 100k frequent words | top 200 for 100k frequent words |
en_news120M pruned | I | 3gram w. hole at pos. 2 | all | none | none | top 1000/word for 100k frequent words | top 200 for 100k frequent words |
de_news70M pruned | I | 3gram w. hole at pos. 2 | all | none | none | top 1000/word for 100k frequent words | top 200 for 100k frequent words |
google_books | I | dependency parses | all | none | none | top 1000/word for words with wordcount > 100 | top 200 for words with wordcount > 100 |
Dataset | Download | Holing System | Clustering |
---|---|---|---|
news120M | I | Stanford Parser & Lemma | Chinese Whispers |