Background Knowledge Datasets
Background knowledge is the set of true facts used by semantic tools to draw their conclusions. For instance it may contain that dog is an animal or that Rome is a city and it is part of Italy.
Recent evaluations of matching systems show that lack of background knowledge, most often domain specific knowledge, is one of the key problems of matching systems these days. In fact, most state of the art systems, for the tasks of matching thousands of nodes show low values of recall (<30%), while with toy examples, the recall they demonstrated was most often around 90%.
WordNet, even if not specifically designed for this, is de facto used as background knowledge in many semantic applications. Unfortunately, its coverage of geographic information (and in general of domain specific knowledge) is very limited. In addition, WordNet does not provide latitude and longitude coordinates as well as other relevant information which is of fundamental importance in geo-spatial applications.
To overcome these limitations we created GeoWordNet.
A geo-spatial ontology is an ontology consisting of geo-spatial classes (e.g. lake, city), entities (e.g., Lago di Molveno, Trento), their metadata (e.g. latitude and longitude coordinates) and relations between them (e.g., part-of, instance-of). GeoWordNet is a multilingual geo-spatial ontology built from the full integration of WordNet, GeoNames and the Italian part of MultiWordNet.
Database Version Details
GeoWordNet has been tested with PostgreSQL 8.3. Imported GeoWordNet requires about 361 MB of disk space. GeoWordNet takes about 2.5 minutes to import using the procedure described below.
- Create a new database for GeoWordNet called geowordnet. Choose UTF-8 encoding.
- Create GeoWordNet tables using the following script:
-- Table: concept -- DROP TABLE concept; CREATE TABLE concept ( con_id integer NOT NULL, "name" text, gloss text, lang text NOT NULL, provenance text ); -- Table: relation -- DROP TABLE relation; CREATE TABLE relation ( src_con_id integer NOT NULL, trg_con_id integer NOT NULL, "name" text NOT NULL, lang text ); -- Table: entity -- DROP TABLE entity; CREATE TABLE entity ( entity_id integer NOT NULL, "name" text, con_id integer, lang text, latitude real, longitude real, provenance text ); -- Table: part_of -- DROP TABLE part_of; CREATE TABLE part_of ( src_entity_id integer NOT NULL, trg_entity_id integer NOT NULL ); -- Table: alternative_name_eng -- DROP TABLE alternative_name_eng; CREATE TABLE alternative_name_eng ( entity_id integer, "name" text ); -- Table: alternative_name_ita -- DROP TABLE alternative_name_ita; CREATE TABLE alternative_name_ita ( entity_id integer, "name" text )
- Download the GeoWordNet: http://sourceforge.net/projects/s-match/files/datasets/
- Extract the GeoWordNet into c:\geowordnet folder
- Ensure client_encoding is set to UTF-8 and execute the following commands in the SQL Query interface:
copy concept from 'C:/geowordnet/concept.csv' header csv; copy relation from 'C:/geowordnet/relation.csv' header csv; copy entity from 'C:/geowordnet/entity.csv' header csv; copy part_of from 'C:/geowordnet/part_of.csv' header csv; copy alternative_name_eng from 'C:/geowordnet/alternative_name_eng.csv' header csv; copy alternative_name_ita from 'C:/geowordnet/alternative_name_ita.csv' header csv;
WordNet or Dict Version Details
GeoWordNet has two dict versions: full and compatible.
Compatible version is a version that follows Princeton WordNet dict format strictly. This version is compatible with the original binary wnb.exe from WordNet 2.1 for Windows and should be compatible with UNIX versions too. However, the limitations of the format have transformed into the following limitations of the release:
- only ASCII names. Names that contain non-ASCII characters were excluded
- contains a maximum of 780 relations per synset
- contains a maximum of 16 lemmas per synset
- contains only 687200 entities
This version has been tested with:
Full version is a version that follows the spirit of the Princeton WordNet format. It overcomes several limitations, while remaining easily readable by most libraries (possibly with a patch). This version breaks the following limitations of the format:
- 8 digit offsets: due to the file size 9 digits are necessary
- relations limit: some synsets have than more than 999 relations
- lemmas limit: some synsets more than 16 lemmas
- ASCII limit: it contains Unicode characters in UTF-8 encoding
Wherever possible, the compatibility with WordNet 3.0 is kept. For example, in the ordering of senses or lemmas, or assignment of lexicographer file names. How compatible this version will be with existing libraries depends on how well they support UTF-8 encoding and how strictly they follow WordNet dict format. This version has been tested with the following libraries:
- extJWNL (with configuration file (changed lines 21-22 compared to standard one): gwn_properties.xml)
- JWI (with a patch: patch-edu.mit.jwi-2.1.5.txt)
- URCS WordNet Browser (with a patch: patch-urcs-wordnet-browser-1.0.txt)
Offset maps are provided. For both (compatible and full) versions:
- geonamesid_geowordnetid.txt: GeoNames location id -> GeoWordNet entity id
- concepts.txt: GeoWordNet concept id -> WordNet 3.0 synset offset
- entities.txt: GeoWordNet entity id -> WordNet 3.0 synset offset
Separately for each version:
- concept-offsets.txt: GeoWordNet concept id -> GeoWordNet 3.0 synset offset
- entity-offsets.txt: GeoWordNet entity id -> GeoWordNet 3.0 synset offset
- offsetmap.noun: WordNet 3.0 synset offset -> GeoWordNet 3.0 synset offset
RDF Version Details
RDF version of GeoWordNet follows the design of WordNet 3.0 in RDF and can be used in combination with its schema and content. It is linked to GeoNames Ontology. The geowordnet-geonames.rdf file contains these links encoded using rdfs:seeAlso and dc:source from RDF Schema and Dublin Core Metadata Initiative, respectively.
This version can be downloaded or used online. For online use, the URIs of this version are dereferencable through http://geowordnet.semanticmatching.org. Semantic Web browsers can get an RDF/XML rendering of the symmetric concise bounded description of the resource by using the HTTP request header to explictly ask for application/rdf+xml type. You can also override the request headers of your browser by adding a .rdf suffix to the URL. For offline use the .rdf files of the distribution can be loaded similarly to (and together with) those from WordNet 3.0 in RDF, for example, by merging them using Jena.
(1.7 KB) - added by autayeu
2 years ago.
extJWNL configuration file for GeoWordNet? dict full
(6.4 KB) - added by autayeu
2 years ago.
Patches for URCS WordNet? Browser
(6.9 KB) - added by autayeu
2 years ago.
Patches for JWI