|Version 82 (modified by autayeu, 3 years ago)|
Semantic Matching Algorithms
Generic Semantic Matching
S-Match is a semantic matching algorithm. S-Match takes two input files which contain tree-like structures and computes the semantic relation for each pair of nodes between two trees. The computation is performed in four steps:
- Step 1: for all labels L in two trees, compute concepts of labels, C@L. Preprocessor component executes this step.
- Step 2: for all nodes N in two trees, compute concepts at nodes, C@N. Classifier component executes this step.
- Step 3: for all pairs of labels in two trees, compute relations among C@Ls. Element level matcher library component executes this step.
- Step 4: for all pairs of nodes in two trees, compute relations among C@Ns. Structure level matcher component executes this step.
Minimal Semantic Matching
There may be many redundant links in the resulting mapping returned by the generic S-Match algorithm. The minimal mapping algorithm filters the redundant links. No further filtering of the resulting mapping is possible without loosing its property of minimality. As an example, conf/s-match-minimal.properties file contains the settings for the minimal matching. Minimal mapping can be computed from scratch by using proper tree matcher or it can be computed by filtering the mapping using filters. TreeMatcher configuration file key specifies the tree matcher for the matching process. It should point to the implementation of ITreeMatcher interface, such as OptimizedStageTreeMatcher class which computes the minimal mapping.
Alternatively to using minimal tree matcher, a mapping filter can be used to minimize a mapping. MappingFilter configuration file key configures the filter to use for filtering the mapping. There are different filters implemented in S-Match. They all implement IMappingFilter interface, and include RedundantMappingFilterEQ for removing redundant relations, RetainRelationsMappingFilter for retaining only relations of a specific type, and RandomSampleMappingFilter for sampling a mapping.
Structure Preserving Semantic Matching
In many cases it is desirable to match structurally identical elements of both the source and the target parts of the input. This is specially the case when comparing signatures of functions such as web service descriptions or APIs. Structure Preserving Semantic Matching (SPSM) is a variant of the basic semantic matching algorithm. SPSM can be useful for facilitating the process of automatic web services composition, returning a set of possible mappings between the functions and their parameters.
Currently, SPSM is implemented by SPSMMappingFilter which uses the mapping results of DefaultNodeMatcher for computing the similarity score between the input lightweight ontologies, then filtering the links in order to comply with the SPSM rules.
The configuration file conf/s-match-spsm.properties contains an example setting for SPSM. TreeMatcher configuration file key specifies the tree matcher for the matching process. It should point to the implementation of ITreeMatcher interface, in this case, SPSMTreeMatcher class which computes the Structure Preseving mappings. The key TreeMatcher.SPSMTreeMatcher.nodeMatcher specifies the default node matcher to be used, for example DefaultNodeMatcher. The TreeMatcher.SPSMTreeMatcher.spsmFilter key should point to SPSMMappingFilter.
Command Line Interface (CLI)
S-Match command line interface consists of several parts: configuration files, commands, their arguments and options. Broadly speaking, the processing follows this scheme, implemented by the MatchManager.java, which:
- loads the configuration file (default or specified),
- overrides configuration properties from command line, if any,
- sets the properties (and thus initializes the components),
- takes in the specified command,
- takes in the command arguments,
- executes the command by calling the appropriate API methods.
To run S-Match sample matching task, execute bin/all-cw.cmd (Windows) or all-cw.sh (Linux). S-Match assumes that it is invoked with the bin as the working (current) directory and all relative paths in the configuration files are written assuming this. However, one can change the paths, they are not hardcoded.
The following commands are available in S-Match CLI:
|wntoflat||Converts the WordNet dictionary into internal binary format and saves it locally to speed up element-level matching. bin/create-wn-caches script is a sample of using this command.|
|convert||input-file output-file||Converts the input-file in the input format into the output-file in the output format. The input and output file names are passed as parameters for this command.|
|offline||input-file output-file||Reads the input-file, computes the concepts of label and concepts at node, that is the 1st and 2nd steps of S-Match algorithm and renders the output into the output-file .|
|online||source-file target-file output-file||Reads the source-file and the target-file and performs the semantic matching between labels and nodes, that is the 3rd and 4th steps of S-Match algorithm. Writes the results into the output-file.|
|filter||source-file target-file input-file output-file||Reads the mapping between the source-file and the target-file from the input-file, filters it and writes it into the output-file.|
The following options are available in S-Match CLI:
|-config=||file.properties||Specifies the properties file which configures the components and the run-time parameters. See example configuration files, such as conf\s-match.properties and conf\s-match-Tab2XML.properties|
|-property=||key=value||Overrides the property key from the current properties file with the value value. It is possible to use it multiple times.|
A configuration file is essential to proper S-Match operation. S-Match has a default configuration file which is read when no -config= option is specified. The default configuration file is loaded from ../conf/s-match.properties.
Configuration files provide a convenient way to configure the system. However, at times it might be useful to override one or two out of many options in the configuration file with another value. For example, if one conducts a series of experiments and wants to run a threshold value of a matcher through a certain range, then -property= command line option is a convenient way to accomplish this. It is possible to use the option multiple times, thus override several parameters in a configuration file.
The bin/match-manager script provides a convenient way to call S-Match via the command line. It configures the classpath, Log4J, and JVM memory options. So, for example, to run a convert command one can execute:
./match-manager.sh convert ../test-data/cw/c.txt ../test-data/cw/c.xml -config=../conf/s-match-Tab2XML.properties
For further details see MatchManager.java, which provides S-Match command line interface, and sample configuration files in the conf folder of the distribution.
Graphical User Interface (GUI)
S-Match provides a graphical user interface for editing contexts and mappings. This interface can be launched using bin/runGUI.sh or bin/runGUI.cmd.
The functionality of the GUI depends on the currently selected configuration file. Drop-down box 1 lists available configuration files and on selection reconfigures S-Match accordingly. For example, which files will be listed in the open context dialog depends on which context loader is configured in the currently selected configuration file. The same applies to the matching procedure and saving: s-match.properties contains default configuration which loads and saves contexts into *.xml files and creates mappings using default algorithm, s-match-minimal.properties is useful for creating minimal mappings, s-match-Tab2XML.properties is useful for loading files in tab-delimited format.
Text field 2 shows the location of currently loaded mapping and, if the mapping has been modified and has not been saved yet, the * symbol is added to the end of mapping location. Similar text fields are located above the left and right trees, for the source and the target (Text field 5) contexts respectively.
Toolbar 3 contains controls for managing the mapping, such as loading, saving and link creation. To create a new link one should select the source node in the source context, the target node in the target context and should execute "Add Link" action, by clicking respective button or pressing Ctrl+Alt+Ins key combination.
Toolbar 4 contains controls for managing the source context, such as loading, saving, adding and deleting nodes. Similar toolbar is available on the right side for the target context.
The links between contexts are displayed using link nodes with respective relation icons 8. Such link node is located under the node where link starts and after selecting such a link node, the target of the link is located and visualized in the opposite tree. Link nodes allow link relation to be edited by choosing respective relation from the link relation drop down box 9.
If during tree navigation one selects a link, the corresponding node in the target is located and displayed. Sometimes due to lack of screen space and size of the context the complete path to the root node is not visible and in this case the GUI coalesces the intermediate nodes, as shown by the coalesced nodes icon 6. The names of the coalesced nodes are still displayed (abbreviated when necessary) using coalesced nodes tool tip 7. To uncoalesce the node one can double click the icon or press Uncoalesce all button 10.
Let us walk through actions necessary to create a mapping between included in the distribution example classifications test-data/c.txt and test-data/w.txt:
- c.txt and w.txt are in tab-delimited format, therefore one needs to select the configuration s-match-Tab2XML.properties which contains tab-delimited format context loader
- To load c.txt as source context, press Open Source context button, navigate to and select test-data/c.txt
- To load w.txt as target context, press Open Target context button, navigate to and select test-data/w.txt
- To create a default mapping, change the configuration to s-match.properties
- Select Mapping->Create to preprocess the contexts, run the matching and create the mapping
- Select Mapping->Save to save the mapping
If one wants to create a minimal mapping, one should select the s-match-minimal.properties configuration and then run the matching process by selecting Mapping->Create. Similarly, if one needs to save the mapping or context in a different format, or load from a different format, or use another matcher, such as SPSM - one should first select appropriate configuration and the execute the action.
Application Program Interface (API)
HowToCallAPI page explains how to work with S-Match API. The API itself is described in the S-Match API Javadocs. The following sections provide an overview of the data structures and the components of S-Match.
S-Match framework operates on tree-like data structures. Many inputs for matching, such as database schemas or classifications can be converted into a tree. We call such inputs contexts. Context information can be accessed from the IContext interface. A context contains nodes, each of which has a parent node and children nodes. Basic node information can be accessed from the INode interface. In addition to the structural information provided by the INode interface, a node contains the information about original label and formulas. This information is provided by the INodeData interface. It includes the concept of label cLabFormula, the concept at node cNodeFormula, the node id and the set of atomic concepts of label ACoLs. Atomic concept of label approximately corresponds to one token and the information about it is provided by the IAtomicConceptOfLabel interface. An atomic concept of label may contain several senses, each represented by an instance of the ISense interface. A sense is identified by its part of speech (POS) and an id in a dictionary (provided by a linguistic oracle, see ILinguisticOracle).
The result of the matching process is a mapping between two contexts, source and target. A mapping is represented by the IMapping interface. A mapping is basically a set of mapping elements, represented by the IMappingElement interface. Mappings are produced by a mapping factory, represented by the IMappingFactory interface. Two types of mapping implementations are supplied with S-Match: matrix-based and hash-based. Matrix-based implementation MatrixMapping.java is backed by a matrix (see IMatchMatrix and MatchMatrix.java) and provides fast access to the mapping. Hash-based implementation HashMapping.java is backed by a HashSet and provides memory-efficient option for large mappings. An alternative solution is to use a matrix mapping with a sparse matrix implementation (see JavaSparseArray.java).
S-Match framework is built out of components. Components can be instantiated and configured by the MatchManager and other components. A component should implement the IConfigurable interface. A base class for components is provided in the Configurable.java. Each component has its own configuration file key, which is used as a prefix for other component configuration keys.
Components can be normal and global. Global components are introduced to support services reusable by several other components. For example, mappings are produced in several places (3rd and 4th step), therefore it makes sense to make a mapping factory a global component. Global components have the Global. prefix before the name of their configuration key.
S-Match contains the following components:
Classifiers perform the 2nd step of the algorithm. They build the concept at node formulas and convert them into into the conjunctive normal form (CNF). A classifier should implement the IContextClassifier interface. See IContextClassifier for a list of available implementations.
Matchers compute the semantic relations between concepts of label and concepts at node. There are two types of matchers in S-Match:
Element level matchers compute the relations between concepts of label, that is, implement step 3 of S-Match algorithm. Element Level matchers are usually bundled in a Library. An element-level matcher library should implement the IMatcherLibrary interface. See IMatcherLibrary for a list a available implementations.
Structure Level or Tree matchers compute the relations between concepts at node, that is, implement step 4 of S-Match algorithm. A tree matcher should implement the ITreeMatcher interface. See ITreeMatcher for a list of available implementations.
This component loads contexts and mappings from file or database.
Oracles and Background Knowledge
This component provides access to linguistic knowledge and background knowledge. The component contains linguistic oracles, which provide access to linguistic knowledge, such as base forms and senses and should implement the ILinguisticOracle interface. See ILinguisticOracle for a list of available implementations.
The background knowledge is accessed via the component which implements ISenseMatcher interface. The background knowledge components provides access to relations between the senses of words. The background knowledge component should accept and provide relations between the senses given out by the linguistic oracle. See ISenseMatcher for a list of available implementations.
The preprocessor creates concepts of label out of natural language labels, that is, performs the 1st step of S-Match algorithm. A preprocessor should implement the IContextPreprocessor interface. See IContextPreprocessor for a list of available implementations.
The renderer renders the context or mapping into file or database. There are two types of renderers in S-Match:
They render the mapping which is the result of a matching, into a file or a database. A mapping renderer should implement the IMappingRenderer interface. See IMappingRenderer for a list of available implementations.
Configuration files define the implementations of the components and constants used to run the program. There are three types of configuration in S-Match.
The conf/s-match.properties, conf/s-match-minimal.properties and conf/s-match-Tab2XML.properties files are the examples of configuration settings for different matching scenarios. They configure S-Match components such as classifier, loader, renderer, preprocessor, matcher, filter and decider.
MatchManager instantiates and configures the initial list of components, which, in turn, read their configuration and might instantiate their own subcomponents. Only those components described in the configuration file are instantiated. Thus it is possible to avoid loading of relatively heavy components when they are not needed or to manage a memory footprint of S-Match by using appropriate components and trading memory for speed or vice versa. The s-match.properties configuration file is the most complete general-purpose configuration file.
A configuration file is a set of key=value pairs. A configuration file is divided into sections, a section per component. A section is a set of keys with a common prefix. A component configuration section has the following format:
Components prefixed with Global. can be reused in configuration of other components. These are instantiated once and are accessible to be reused. For example, declare the global component
and reuse it as a subcomponent
The initial list of components that MatchManager tries to instantiate includes:
- Linguistic Oracle, configuration prefix LinguisticOracle, see ILinguisticOracle
- Sense Matcher, configuration prefix SenseMatcher, see ISenseMatcher
- Mapping Factory, configuration prefix MappingFactory, see IMappingFactory
- Context Loader, configuration prefix ContextLoader, see IContextLoader
- Context Renderer, configuration prefix ContextRenderer, see IContextRenderer
- Mapping Loader, configuration prefix MappingLoader, see IMappingLoader
- Mapping Renderer, configuration prefix MappingRenderer, see IMappingRenderer
- Mapping Filter, configuration prefix MappingFilter, see IMappingFilter
- Context Preprocessor, configuration prefix ContextPreprocessor, see IContextPreprocessor
- Context Classifier, configuration prefix ContextClassifier, see IContextClassifier
- Matcher Library, configuration prefix MatcherLibrary, see IMatcherLibrary
- Tree Matcher, configuration prefix TreeMatcher, see ITreeMatcher
For configuration options available for a specific component see comments in s-match.properties or component javadoc or source file.
S-Match uses JWNL (Java WordNet Library) to accesss WordNet. JWNL is configured by the conf/file_properties.xml. It configures the dictionary class path, WordNet version, local path of WordNet dictionary etc.
S-Match uses Log4J to log information, debug and error messages. conf/log4j.properties configures the Log4J. It configures the logging properties such as date time pattern, maximum file size, file path etc. See Log4J documentation for information about how to edit this configuration file.