Download Latest Version semantics.zip (94.2 MB)
Email in envelope

Get an email when there's a new version of Calculate Semantic Similarity

Home
Name Modified Size InfoDownloads / Week
semantics.arj 2014-07-02 54.5 MB
semantics.zip 2014-07-02 94.2 MB
Readme.txt 2014-07-02 2.6 kB
Totals: 3 Items   148.6 MB 0
CHANGES:
The bug regardig the call to a dictionary folder has been fixed.
I have added two more folders with different dictionaries: 
collocation_dict
category_dict
-Should you use these dictionaries instead of a default WORDS dictionary, make the proper adjustments in ./semantics/Compare.java
-lets assume you want to use collocation_dict instead. Do the following changes
Line 32:
File f = new File("./collocation_dict/"+Character.toString(a.charAt(0))+"/"+Character.toString(a.charAt(1))+"/"+a+".txt");

Line 36:
File f = new File("./collocation_dict/"+Character.toString(b.charAt(0))+"/"+Character.toString(b.charAt(1))+"/"+b+".txt");

Line 81:
String filename = new File("./collocation_dict/"+Character.toString(g.charAt(0))+"/"+Character.toString(g.charAt(1))+"/"+g+".txt");


PLEASE BE AWARE: There are 98,376 words in a collection, and each word has the unique directory and a TXT file, which is great for JAVA's speed. 
However, this may cause some of the developing environments such as Eclipse or Netbeans to slow-down, freeze or crash, so use it at your own risk!



Here is an example run of the USAGE.java (which should be simple and self-explanatory for any JAVA developer, no javadocs necessary)

Similarity between the sentences
-Pete and Rob have found a dog near the station.
-Pete and Rob have never found a dog near the station.
 is: 1.0000000000000002

Similarity between the sentences
-Patricia found a dog near the station.
-It was a dog who found Pete and Rob under the snow.
 is: 0.7319250547113999


Similarity between the sentences
-Patricia found a dog near the station.
-I am fine, thanks!
 is: 0.0


Similarity between the sentences
-Hello there, how are you?
-I am fine, thanks!
 is: 0.28819520885211747



This program is made to find the semantic similarities between the sentences, according to categories of their words.
It is an enhancement of the Vector-Space analysis found withing the Classifier4j, which does not take into account the semantic meanings of the words.
Furthermore, the Vector-Space analysis of the Classifier4J does not work well with the short sentences, while this enhancement does.
A new dictionary of categories based on the EOWL list of words was developed, while the categories for each word from the DISCO's semantics were calculated.
The semantic categories came from: 

en-BNC-20080721		119 million tokens	122,000	1.7 GB	
en-PubMedOA-20070501	181 million tokens	60,000	864 MB	
en-wikipedia-20080101	267 million tokens	220,000	5.9 GB

The result is a tool that is stable and several gigabytes smaller than DISCO, yet more powerful than the Classifier4j's Vector-Space analysis. 





Source: Readme.txt, updated 2014-07-02