JoBimText Wiki

Linking Language to Knowledge with Distributional Semantics

Status: Beta

Brought to you by: apanchenko, biem-tuda, coppolab, eugenso, and 4 others

FAQ

Similarity Calculations using Hadoop
- Which are the recommended parameter to generate the Hadoop similarity pipeline?
- The computations of the similarities stop with an exception e.g. ERROR 2997: Encountered IOException. File pig/FreqSig1000.pig does not exist.

Similarity Calculations using Hadoop

Which are the recommended parameter to generate the Hadoop similarity pipeline?

We recommend to use only features that occur with less then 1000 words and specify to use only positive significance scores. We recommend, using the LMI significance measure and keep only the top 1000 features per term with the highest significance scores. Normally it should be sufficient to keep the top 200 most similar terms for each term. This is achieved with:

python generateHadoopScript.py dataset 1000 0 0 1000 LMI 200

The computations of the similarities stop with an exception e.g. ERROR 2997: Encountered IOException. File pig/FreqSig1000.pig does not exist.

The reason for that error is a wrong parameter, when generating the script (generateHadoopScript.py) to run the Hadoop pipeline. Instead of a significance measure a number was given to the script. Following parameters could be used for generating the script:

python generateHadoopScript.py dataset 1000 0 0 1000 LMI 200

If the error still remains, you might use a version before 0.0.6 and should change to the latest version. If you don't want to change, follow the documentation found here.