Using Other Knowledge Bases
S-Match uses two interfaces to extract knowledge out of natural language labels:
- ILinguisticOracle provides access to linguistic knowledge, such as lemmas and senses
- ISenseMatcher provides access to background knowledge, such as relations between senses
To access a knowledge base, one should provide an implementation of these two interfaces.
The default configuration file s-match.properties provides access to a default linguistic oracle and background knowledge which uses WordNet 2.1. WordNet 2.1 is accessed using extJWNL library, which is configured using file_properties.xml configuration file.
Using Other Wordnets
S-Match uses extJWNL to access WordNet-like databases. extJWNL has several options for accessing the database files, with two of them of our interest being:
- file-based access: dictionary files are accessed as they are. This method is slower, but requires little memory.
- map-based access: dictionary files are converted first into HashMaps, serialized, and then accessed. This method is faster, but requires more memory.
S-Match uses WordNet-like dictionaries as linguistic knowledge (during preprocessing, or "offline" processing, via ILinguisticOracle) and as background knowledge (during reasoning, or "online" processing via ISenseMatcher). For the second interface, we provide two implementations:
- WordNet, which uses extJWNL via file-based or map-based access.
- InMemoryWordNetBinaryArray, which uses internal cache.
The second implementation significantly speeds up the "online" processing. These implementations give flexibility in choosing between speed and memory requirements.
The GeoWordNet
GeoWordNet provides a large and rich knowledge base in WordNet format. And it is possible to use it with S-Match.
Using InMemoryWordNetBinaryArray
Configuring S-Match
Here we provide a step-by-step guide and sample configuration files for configuring S-Match to use GeoWordNet.
- Edit bin\match-manager.cmd or .sh script and change -Xmx256M -Xms256M to allow more memory: -Xmx6G -Xms6G.
- Create the data\wordnet\geowordnet folder where the knowledge base will be stored. Create the following subfolders:
- dict for dictionary files,
- cache for InMemoryWordNetBinaryArray cache files, and
- map for map-based version of dictionary files
- Download full version of GeoWordNet in dict format: geowordnet-dict-full-20110330.zip and unpack it to the data\wordnet\geowordnet\dict folder. Alternatively, you might you a smaller compat version, which contains less data, but also has smaller memory requirements.
- In the conf folder create a file_properties-gwn.xml configuration file for the extJWNL. This file provides file-based access to dictionary files.
- In the conf folder create a s-match-gwn.properties configuration file for S-Match. This file will point S-Match to the geowordnet knowledge base.
- In the conf folder create a s-match-create-wn-caches-gwn.properties configuration file for S-Match. This file will be used to create a cache for the geowordnet knowledge base.
- To cache new knowledge base run the following command in the bin folder. This should create several files in the data\wordnet\geowordnet\cache folder:
match-manager.cmd wntoflat -config=..\conf\s-match-create-wn-caches-gwn.properties
Running the matching
Now, to run the matching, execute bin\match-manager.cmd as usual, adding -config=..\conf\s-match-gwn.properties for S-Match to use the new knowledge base. Remember that for matching to work correctly, matching and preprocessing should be done using the same knowledge base. For example, to match the example classifications c.txt and w.txt using geowordnet knowledge base run:
- to convert the files into XML format which stores the preprocessing information:
match-manager.cmd convert ..\test-data\cw\c.txt ..\test-data\cw\c.xml -config=..\conf\s-match-Tab2XML.properties match-manager.cmd convert ..\test-data\cw\w.txt ..\test-data\cw\w.xml -config=..\conf\s-match-Tab2XML.properties
- to preprocess the contexts using geowordnet knowledge base:
match-manager.cmd offline ..\test-data\cw\c.xml ..\test-data\cw\c-gwn.xml -config=..\conf\s-match-gwn.properties match-manager.cmd offline ..\test-data\cw\w.xml ..\test-data\cw\w-gwn.xml -config=..\conf\s-match-gwn.properties
- to match the contexts using geowordnet knowledge base:
match-manager.cmd online ..\test-data\cw\c-gwn.xml ..\test-data\cw\w-gwn.xml ..\test-data\cw\result-cw-gwn.txt -config=..\conf\s-match-gwn.properties
Using WordNet
This configuration does not require as much memory for conversion, as previous one, but it is slower during matching.
- Follow steps 2-6 from Configuring S-Match section above.
- Ensure the file MatchManager.java contains the following lines. Notice the order, multiwords cache is created first:
private void convertWordNetToFlat(Properties properties) throws SMatchException { DefaultContextPreprocessor.createWordNetCaches(CONTEXT_PREPROCESSOR_KEY, properties); InMemoryWordNetBinaryArray.createWordNetCaches(GLOBAL_PREFIX + SENSE_MATCHER_KEY, properties); } - Run ant jar in the main folder of the distribution to compile the sources and update the s-match.jar. See HowToBuild for details.
- Run the partial conversion to create multiword cache:
match-manager.cmd wntoflat -config=..\conf\s-match-create-wn-caches-gwn.properties
- Stop the conversion after the multiword cache (usually stored in data/geowordnet/cache/multiwords.hash) is created:
Creating WordNet caches... Creating multiword hash... Multiwords: xxx Done
- In the conf folder create a s-match-gwn-wn.properties configuration file for S-Match. This file will point S-Match to the geowordnet knowledge base.
- When running the matching (see steps b and c above), use s-match-gwn-wn.properties configuration file:
- the same as above
- to preprocess the contexts using geowordnet knowledge base:
match-manager.cmd offline ..\test-data\cw\c.xml ..\test-data\cw\c-gwn.xml -config=..\conf\s-match-gwn-wn.properties match-manager.cmd offline ..\test-data\cw\w.xml ..\test-data\cw\w-gwn.xml -config=..\conf\s-match-gwn-wn.properties
- to match the contexts using geowordnet knowledge base:
match-manager.cmd online ..\test-data\cw\c-gwn.xml ..\test-data\cw\w-gwn.xml ..\test-data\cw\result-cw-gwn.txt -config=..\conf\s-match-gwn-wn.properties
The Stanford Wordnet Project
The Stanford Wordnet Project provides several automatically created knowledge bases in WordNet format, including sense-clustered and augmented wordnets. It is possible to use these wordnets with S-Match in a similar way to GeoWordNet.
The MultiWordNet
MultiWordNet provides WordNet-like semantic knowledge bases in several languages. It is possible to use it with S-Match via extJWNL, there is an import procedure. If you already have a MultiWordNet license and database files, please, contact us.
Attachments
-
s-match-create-wn-caches-gwn.properties
(1.9 KB) - added by autayeu
8 months ago.
S-Match configuration to cache GeoWordNet
-
file_properties-gwn.xml
(2.9 KB) - added by autayeu
8 months ago.
extJWNL configuration for file-based access to GeoWordNet
-
s-match-gwn.properties
(8.0 KB) - added by autayeu
8 months ago.
S-Match configuration to use GeoWordNet
-
s-match-gwn-wn.properties
(7.1 KB) - added by autayeu
8 months ago.
S-Match config file to matching with GeoWordNet using WordNet.java as !ISenseMatcher implementation