Menu

models

Datasets and Precomputed Models

This page lists the datasets and models which are available at:
https://sourceforge.net/projects/jobimtext/files/data/

We also computed Models for various time slices on Google Books data.

Datasets

Following available that can be used for calculating a new thesaurus. The format of the files is that each line contains one sentence.

Distributional Thesaurus Models

Dataset Download Holing System Word Count Feature Count Word Feature Count Word Feature Significances Similarities
en_wikipedia I II Malt Parser & Lemmatized all all all all top 200
en_wikipedia I Malt Parser & Lemmatized none none none top 1000/word top 100
en_news10M I Malt Parser & Lemmatized all all all all top 200
en_news120M I II III Stanford Parser & Lemmatized all all all all top 200
en_news120M pruned mysql I StanfordParser & Lemma all none none top 1000/word for 100k frequent words top 200 for 100k frequent words
en_news120M pruned I StanfordParser & Lemma all none none top 1000/word for 100k frequent words top 200 for 100k frequent words
en_news120M pruned I 3gram w. hole at pos. 2 all none none top 1000/word for 100k frequent words top 200 for 100k frequent words
de_news70M pruned I 3gram w. hole at pos. 2 all none none top 1000/word for 100k frequent words top 200 for 100k frequent words
google_books I dependency parses all none none top 1000/word for words with wordcount > 100 top 200 for words with wordcount > 100

Sense Clusters

Dataset Download Holing System Clustering
news120M I Stanford Parser & Lemma Chinese Whispers

Related

Wiki: Home