| Name | Modified | Size | Downloads / Week |
|---|---|---|---|
| semantics.arj | 2014-07-02 | 54.5 MB | |
| semantics.zip | 2014-07-02 | 94.2 MB | |
| Readme.txt | 2014-07-02 | 2.6 kB | |
| Totals: 3 Items | 148.6 MB | 0 |
CHANGES:
The bug regardig the call to a dictionary folder has been fixed.
I have added two more folders with different dictionaries:
collocation_dict
category_dict
-Should you use these dictionaries instead of a default WORDS dictionary, make the proper adjustments in ./semantics/Compare.java
-lets assume you want to use collocation_dict instead. Do the following changes
Line 32:
File f = new File("./collocation_dict/"+Character.toString(a.charAt(0))+"/"+Character.toString(a.charAt(1))+"/"+a+".txt");
Line 36:
File f = new File("./collocation_dict/"+Character.toString(b.charAt(0))+"/"+Character.toString(b.charAt(1))+"/"+b+".txt");
Line 81:
String filename = new File("./collocation_dict/"+Character.toString(g.charAt(0))+"/"+Character.toString(g.charAt(1))+"/"+g+".txt");
PLEASE BE AWARE: There are 98,376 words in a collection, and each word has the unique directory and a TXT file, which is great for JAVA's speed.
However, this may cause some of the developing environments such as Eclipse or Netbeans to slow-down, freeze or crash, so use it at your own risk!
Here is an example run of the USAGE.java (which should be simple and self-explanatory for any JAVA developer, no javadocs necessary)
Similarity between the sentences
-Pete and Rob have found a dog near the station.
-Pete and Rob have never found a dog near the station.
is: 1.0000000000000002
Similarity between the sentences
-Patricia found a dog near the station.
-It was a dog who found Pete and Rob under the snow.
is: 0.7319250547113999
Similarity between the sentences
-Patricia found a dog near the station.
-I am fine, thanks!
is: 0.0
Similarity between the sentences
-Hello there, how are you?
-I am fine, thanks!
is: 0.28819520885211747
This program is made to find the semantic similarities between the sentences, according to categories of their words.
It is an enhancement of the Vector-Space analysis found withing the Classifier4j, which does not take into account the semantic meanings of the words.
Furthermore, the Vector-Space analysis of the Classifier4J does not work well with the short sentences, while this enhancement does.
A new dictionary of categories based on the EOWL list of words was developed, while the categories for each word from the DISCO's semantics were calculated.
The semantic categories came from:
en-BNC-20080721 119 million tokens 122,000 1.7 GB
en-PubMedOA-20070501 181 million tokens 60,000 864 MB
en-wikipedia-20080101 267 million tokens 220,000 5.9 GB
The result is a tool that is stable and several gigabytes smaller than DISCO, yet more powerful than the Classifier4j's Vector-Space analysis.