Menu

FAQ

Similarity Calculations using Hadoop

We recommend to use only features that occur with less then 1000 words and specify to use only positive significance scores. We recommend, using the LMI significance measure and keep only the top 1000 features per term with the highest significance scores. Normally it should be sufficient to keep the top 200 most similar terms for each term. This is achieved with:

python generateHadoopScript.py dataset 1000 0 0 1000 LMI 200 

The computations of the similarities stop with an exception e.g. ERROR 2997: Encountered IOException. File pig/FreqSig1000.pig does not exist.

The reason for that error is a wrong parameter, when generating the script (generateHadoopScript.py) to run the Hadoop pipeline. Instead of a significance measure a number was given to the script. Following parameters could be used for generating the script:

python generateHadoopScript.py dataset 1000 0 0 1000 LMI 200 

If the error still remains, you might use a version before 0.0.6 and should change to the latest version. If you don't want to change, follow the documentation found here.


Related

Wiki: Home