JoBimText Wiki

Linking Language to Knowledge with Distributional Semantics

Status: Beta

Brought to you by: apanchenko, biem-tuda, coppolab, eugenso, and 4 others

models

This page lists the datasets and models which are available at:
https://sourceforge.net/projects/jobimtext/files/data/

We also computed Models for various time slices on Google Books data.

Following available that can be used for calculating a new thesaurus. The format of the files is that each line contains one sentence.

en_news10M: This dataset is taken from LCC. It consists of 10 million English sentences taken from news web pages.
en_wikipedia: This dataset is constructed using English Wikipedia. It consists of 35.9 million sentences.
en_google_books: This dataset is constructed by Yoav Goldberg (A Dataset of Syntactc-Ngrams over Time from a Very Large Corpus of English Books, *SEM 2013) and can be downloaded here.

Dataset	Download	Holing System	Word Count	Feature Count	Word Feature Count	Word Feature Significances	Similarities
en_wikipedia	I II	Malt Parser & Lemmatized	all	all	all	all	top 200
en_wikipedia	I	Malt Parser & Lemmatized	none	none	none	top 1000/word	top 100
en_news10M	I	Malt Parser & Lemmatized	all	all	all	all	top 200
en_news120M	I II III	Stanford Parser & Lemmatized	all	all	all	all	top 200
en_news120M pruned mysql	I	StanfordParser & Lemma	all	none	none	top 1000/word for 100k frequent words	top 200 for 100k frequent words
en_news120M pruned	I	StanfordParser & Lemma	all	none	none	top 1000/word for 100k frequent words	top 200 for 100k frequent words
en_news120M pruned	I	3gram w. hole at pos. 2	all	none	none	top 1000/word for 100k frequent words	top 200 for 100k frequent words
de_news70M pruned	I	3gram w. hole at pos. 2	all	none	none	top 1000/word for 100k frequent words	top 200 for 100k frequent words
google_books	I	dependency parses	all	none	none	top 1000/word for words with wordcount > 100	top 200 for words with wordcount > 100

Dataset	Download	Holing System	Clustering
news120M	I	Stanford Parser & Lemma	Chinese Whispers