Download Latest Version collocations.zip (25.7 MB)
Email in envelope

Get an email when there's a new version of Corpus redundancy manager

Home
Name Modified Size InfoDownloads / Week
readme.txt 2012-06-18 973 Bytes
lda.zip 2012-04-29 228.1 MB
collocations.zip 2012-04-29 25.7 MB
lineBasedFingerPrint.py 2011-05-09 2.8 kB
Totals: 4 Items   253.8 MB 0
Created 4/29/2012 by Raphael Cohen

Corpus redundancy manager

Redundancy due to cut-paste operations in text creates bias in machine learning for NLP.
This module takes a directory and produces a subset of the files in that directory (in a list) with an upper bound on similarity between two files.

##### Installation #####

Download lineBasedFingerPrint.py
Requires python 2.6

##### Usage #####
python lineBasedFingerPrint.py [directory] [MaxSimilarity:default-0.2] [FingerPrintLen:default-30]

%python lineBasedFingerPrint.py trainRed5-8 0.5 30
will produce a list of a non-redundant subset of files in directory  trainRed5-8 (supplied in collocations.zip) with maximum similarity of 0.5

##### Experiments included in project #####
collocations.zip contains the experiments with synthetic corpora pertaining to collocation identification with Text-NSP.
lda.zip contains the experiments with synthetic corpora pertaining to topic model with Mallet.
Source: readme.txt, updated 2012-06-18