Name | Modified | Size | Downloads / Week |
---|---|---|---|
semantics.arj | 2014-07-02 | 54.5 MB | |
semantics.zip | 2014-07-02 | 94.2 MB | |
Readme.txt | 2014-07-02 | 2.6 kB | |
Totals: 3 Items | 148.6 MB | 0 |
CHANGES: The bug regardig the call to a dictionary folder has been fixed. I have added two more folders with different dictionaries: collocation_dict category_dict -Should you use these dictionaries instead of a default WORDS dictionary, make the proper adjustments in ./semantics/Compare.java -lets assume you want to use collocation_dict instead. Do the following changes Line 32: File f = new File("./collocation_dict/"+Character.toString(a.charAt(0))+"/"+Character.toString(a.charAt(1))+"/"+a+".txt"); Line 36: File f = new File("./collocation_dict/"+Character.toString(b.charAt(0))+"/"+Character.toString(b.charAt(1))+"/"+b+".txt"); Line 81: String filename = new File("./collocation_dict/"+Character.toString(g.charAt(0))+"/"+Character.toString(g.charAt(1))+"/"+g+".txt"); PLEASE BE AWARE: There are 98,376 words in a collection, and each word has the unique directory and a TXT file, which is great for JAVA's speed. However, this may cause some of the developing environments such as Eclipse or Netbeans to slow-down, freeze or crash, so use it at your own risk! Here is an example run of the USAGE.java (which should be simple and self-explanatory for any JAVA developer, no javadocs necessary) Similarity between the sentences -Pete and Rob have found a dog near the station. -Pete and Rob have never found a dog near the station. is: 1.0000000000000002 Similarity between the sentences -Patricia found a dog near the station. -It was a dog who found Pete and Rob under the snow. is: 0.7319250547113999 Similarity between the sentences -Patricia found a dog near the station. -I am fine, thanks! is: 0.0 Similarity between the sentences -Hello there, how are you? -I am fine, thanks! is: 0.28819520885211747 This program is made to find the semantic similarities between the sentences, according to categories of their words. It is an enhancement of the Vector-Space analysis found withing the Classifier4j, which does not take into account the semantic meanings of the words. Furthermore, the Vector-Space analysis of the Classifier4J does not work well with the short sentences, while this enhancement does. A new dictionary of categories based on the EOWL list of words was developed, while the categories for each word from the DISCO's semantics were calculated. The semantic categories came from: en-BNC-20080721 119 million tokens 122,000 1.7 GB en-PubMedOA-20070501 181 million tokens 60,000 864 MB en-wikipedia-20080101 267 million tokens 220,000 5.9 GB The result is a tool that is stable and several gigabytes smaller than DISCO, yet more powerful than the Classifier4j's Vector-Space analysis.