Name | Modified | Size | Downloads / Week |
---|---|---|---|
readme.txt | 2012-06-18 | 973 Bytes | |
lda.zip | 2012-04-29 | 228.1 MB | |
collocations.zip | 2012-04-29 | 25.7 MB | |
lineBasedFingerPrint.py | 2011-05-09 | 2.8 kB | |
Totals: 4 Items | 253.8 MB | 0 |
Created 4/29/2012 by Raphael Cohen Corpus redundancy manager Redundancy due to cut-paste operations in text creates bias in machine learning for NLP. This module takes a directory and produces a subset of the files in that directory (in a list) with an upper bound on similarity between two files. ##### Installation ##### Download lineBasedFingerPrint.py Requires python 2.6 ##### Usage ##### python lineBasedFingerPrint.py [directory] [MaxSimilarity:default-0.2] [FingerPrintLen:default-30] %python lineBasedFingerPrint.py trainRed5-8 0.5 30 will produce a list of a non-redundant subset of files in directory trainRed5-8 (supplied in collocations.zip) with maximum similarity of 0.5 ##### Experiments included in project ##### collocations.zip contains the experiments with synthetic corpora pertaining to collocation identification with Text-NSP. lda.zip contains the experiments with synthetic corpora pertaining to topic model with Mallet.