For my next project I think I will try and bring WordNet http://www.cogsci.princeton.edu/~wn/ into the opennlp framework by translating the C code into Java.
I suggest a package name of opennlp.dictionary.wordnet
Although WordNet contains a good general dictionary/lexicon/thesaurus it does not contain all the information needed for other NLP tasks. Particular NLP algorithms require additional information (e.g. cat for grok), so I will try and provide ways of merging the extra information. Particular domains have specialist words and MWE which are not in WordNet again I will provide ways of adding these words and MWEs.
As a first step in merging the Gate http://gate.ac.uk/ project, which is focused towards Information Extraction (IE), I will use the gate interfaces packages (gate, gate.event, gate.util).
Is what I am suggesting sensible, are there any other features you would like added?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
This is interesting indeed. What isd the license of their code? They should probably be contacted before doing anything like this to make sure it is okay. Note that it doesn't have to be under the LGPL --- just needs to be OSI certified.
Given your contribution of the multiword expression package to Grok, I'm giving you developer access to OpenNLP. You can set up a new cvs module for this and get working on it as soon as you like. Call it "Dictionary" or "dictionary" if that suits you. Are you familiar enough with CVS to do this?
I'm curious to see how working with Gate will be. I like what I've seen, but I had to stop looking at it because other things became more pressing. Hamish and Kalina gave a presentation at Edinburgh in February, which got me interested in Gate again (the previous time Gann and I had looked at it, Gate wasn't open source). But it looks great, and I think the OpenNLP stuff can be greatly improved by tagging on to their effort and learning from how they do stuff.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have translated all the 'C' code into Java and got it to compile. The code runs can look up a word in the index files correctly but fails when looking up the word in the main database. It should be working in a week or two, but there is a fair amount of tidying up I wish to do before releasing it.
I have not started on the interface and factory methods yet. Neither have I worked out how to add speciallised words and additional dictionaries to the main dictionary.
There is no separate license on their source code, but this license is included in hicense.h
"This software and database is being provided to you, the LICENSEE, by \n",
"Princeton University under the following license. By obtaining, using \n",
"and/or copying this software and database, you agree that you have \n",
"read, understood, and will comply with these terms and conditions.: \n \n",
"Permission to use, copy, modify and distribute this software and \n",
"database and its documentation for any purpose and without fee or \n",
"royalty is hereby granted, provided that you agree to comply with \n",
"the following copyright notice and statements, including the disclaimer, \n",
"and that the same appear on ALL copies of the software, database and \n",
"documentation, including modifications that you make for internal \n",
"use or for distribution. \n \n",
"WordNet 1.7 Copyright 2001 by Princeton University. All rights reserved. \n \n",
"THIS SOFTWARE AND DATABASE IS PROVIDED \"AS IS\" AND PRINCETON \n",
"UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR \n",
"IMPLIED. BY WAY OF EXAMPLE, BUT NOT LIMITATION, PRINCETON \n",
"UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES OF MERCHANT- \n",
"ABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE \n",
"OF THE LICENSED SOFTWARE, DATABASE OR DOCUMENTATION WILL NOT \n",
"INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR \n",
"OTHER RIGHTS. \n \n",
"The name of Princeton University or Princeton may not be used in \n",
"advertising or publicity pertaining to distribution of the software \n",
"and/or database. Title to copyright in this software, database and \n",
"any associated documentation shall at all times remain with \n",
"Princeton University and LICENSEE agrees to preserve same. \n", NULL};
Mike Atkinson
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
JSemCor is a pure java interface to the Brown semantic concordance corpus used
in conjunction with Wordnet. The purpose of these libraries is to provide basic
semantic concordance functionality for word sense disambiguation.
WNL is a Java API for accessing the WordNet relational dictionary. WordNet is
widely used for developing NLP applications, and a Java API such as JWNL will
allow developers to more easily use Java for building NLP applications.
Converts to a object database (which can be serialized separately, and restored).
Uses property files (for internationalisation) and XML to define implementations
of interface.
I've completed an initial implemenation which includes command line and gui interfaces. It is a fairly straightforward translation of the C code. I have only really changed it to speed it up (by an order of magnitude). It works with the WordNet 1.7 database, and probably with WordNet 1.6 (although I have not tested this).
The next stage is to word on a general dictory interface. Some of the projectect in my last post will act as inputs. I will contact the various leaders, developers and users of these libraries to see about combining our efforts.
But all that is a couple of weeks away, first I am going to take a break until the second week in April.
In the meantime, if someone could check that the code builds and works on Unix, Linux or Mac, it would be much appreciated.
Mike Atkinson
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The only thing is that I actually prefer that the "Dictionary" module be outside of the "opennlp" module just like the "Hylo" module is.
As per the post I have just made about release management, I would like to have things separated in that manner for projects of significantly different functionality.
Let me know when I can go ahead and make this move so that I don't disrupt your development, and then I'll test out the code and add scripts for Unix.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I've split the Dictionary source off from the "opennlp" module and created a new module called "Dictionary". I've filled out the build structure with the jar files needed to compile, so give it a whirl and make sure it works for you to compile.
I've also put gui.bat and run.bat into the Dictionary/bin directory. I noticed that they use directories which are specific to your system and wondered if you could use some more general system properties for these. For example:
Can we instead have a property OPENNLP_DICTIONARY_HOME and let these bits of information be found from that path? Looks like some things from WordNet might need to be added to the Dictionary/src/data directory?
Once you have that set up, I'll whip up a few unix scripts to put in the bin directory. Also, could you rename gui.bat and run.bat to be names relating to the package?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
For my next project I think I will try and bring WordNet http://www.cogsci.princeton.edu/~wn/ into the opennlp framework by translating the C code into Java.
I suggest a package name of opennlp.dictionary.wordnet
Although WordNet contains a good general dictionary/lexicon/thesaurus it does not contain all the information needed for other NLP tasks. Particular NLP algorithms require additional information (e.g. cat for grok), so I will try and provide ways of merging the extra information. Particular domains have specialist words and MWE which are not in WordNet again I will provide ways of adding these words and MWEs.
As a first step in merging the Gate http://gate.ac.uk/ project, which is focused towards Information Extraction (IE), I will use the gate interfaces packages (gate, gate.event, gate.util).
Is what I am suggesting sensible, are there any other features you would like added?
This is interesting indeed. What isd the license of their code? They should probably be contacted before doing anything like this to make sure it is okay. Note that it doesn't have to be under the LGPL --- just needs to be OSI certified.
Given your contribution of the multiword expression package to Grok, I'm giving you developer access to OpenNLP. You can set up a new cvs module for this and get working on it as soon as you like. Call it "Dictionary" or "dictionary" if that suits you. Are you familiar enough with CVS to do this?
I'm curious to see how working with Gate will be. I like what I've seen, but I had to stop looking at it because other things became more pressing. Hamish and Kalina gave a presentation at Edinburgh in February, which got me interested in Gate again (the previous time Gann and I had looked at it, Gate wasn't open source). But it looks great, and I think the OpenNLP stuff can be greatly improved by tagging on to their effort and learning from how they do stuff.
I have translated all the 'C' code into Java and got it to compile. The code runs can look up a word in the index files correctly but fails when looking up the word in the main database. It should be working in a week or two, but there is a fair amount of tidying up I wish to do before releasing it.
I have not started on the interface and factory methods yet. Neither have I worked out how to add speciallised words and additional dictionaries to the main dictionary.
There is no separate license on their source code, but this license is included in hicense.h
"This software and database is being provided to you, the LICENSEE, by \n",
"Princeton University under the following license. By obtaining, using \n",
"and/or copying this software and database, you agree that you have \n",
"read, understood, and will comply with these terms and conditions.: \n \n",
"Permission to use, copy, modify and distribute this software and \n",
"database and its documentation for any purpose and without fee or \n",
"royalty is hereby granted, provided that you agree to comply with \n",
"the following copyright notice and statements, including the disclaimer, \n",
"and that the same appear on ALL copies of the software, database and \n",
"documentation, including modifications that you make for internal \n",
"use or for distribution. \n \n",
"WordNet 1.7 Copyright 2001 by Princeton University. All rights reserved. \n \n",
"THIS SOFTWARE AND DATABASE IS PROVIDED \"AS IS\" AND PRINCETON \n",
"UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR \n",
"IMPLIED. BY WAY OF EXAMPLE, BUT NOT LIMITATION, PRINCETON \n",
"UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES OF MERCHANT- \n",
"ABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE \n",
"OF THE LICENSED SOFTWARE, DATABASE OR DOCUMENTATION WILL NOT \n",
"INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR \n",
"OTHER RIGHTS. \n \n",
"The name of Princeton University or Princeton may not be used in \n",
"advertising or publicity pertaining to distribution of the software \n",
"and/or database. Title to copyright in this software, database and \n",
"any associated documentation shall at all times remain with \n",
"Princeton University and LICENSEE agrees to preserve same. \n", NULL};
Mike Atkinson
Ha, now I find out! I thought I had done a search for WordNet implementations, but somehow completely missed the ones on SourceForge!
Java implementations of WordNet
http://sourceforge.net/projects/javawn/
A mixture of java and C++, has SQL interface to a WordNet database.
WordNet 1.6?
--------------------------------------------------------------------------------
http://sourceforge.net/projects/jwn/
JWordNet is a pure Java standalone object-oriented interface to the WordNet
database of lexical relationships.
Has a GUI interface.
WordNet 1.6?
--------------------------------------------------------------------------------
http://sourceforge.net/projects/jsemcor/
JSemCor is a pure java interface to the Brown semantic concordance corpus used
in conjunction with Wordnet. The purpose of these libraries is to provide basic
semantic concordance functionality for word sense disambiguation.
Active development, but immmature.
--------------------------------------------------------------------------------
http://sourceforge.net/projects/jwordnet/
WNL is a Java API for accessing the WordNet relational dictionary. WordNet is
widely used for developing NLP applications, and a Java API such as JWNL will
allow developers to more easily use Java for building NLP applications.
Converts to a object database (which can be serialized separately, and restored).
Uses property files (for internationalisation) and XML to define implementations
of interface.
Good structure.
WordNet 1.6
Last update 2001-12-10?
--------------------------------------------------------------------------------
http://sourceforge.net/projects/wnjn/
JNI Java native interface to the WordNet lexicographic database
(www.cogsci.princeton.edu/~wn.)
Thin wrapper round the WordNet C implementation.
last update 2002-1-14
--------------------------------------------------------------------------------
http://sourceforge.net/projects/opennlp/
My project, straigt forward translation from the C code, fast GUI, but lacking in
features.
I've completed an initial implemenation which includes command line and gui interfaces. It is a fairly straightforward translation of the C code. I have only really changed it to speed it up (by an order of magnitude). It works with the WordNet 1.7 database, and probably with WordNet 1.6 (although I have not tested this).
The next stage is to word on a general dictory interface. Some of the projectect in my last post will act as inputs. I will contact the various leaders, developers and users of these libraries to see about combining our efforts.
But all that is a couple of weeks away, first I am going to take a break until the second week in April.
In the meantime, if someone could check that the code builds and works on Unix, Linux or Mac, it would be much appreciated.
Mike Atkinson
Cool stuff!
The only thing is that I actually prefer that the "Dictionary" module be outside of the "opennlp" module just like the "Hylo" module is.
As per the post I have just made about release management, I would like to have things separated in that manner for projects of significantly different functionality.
Let me know when I can go ahead and make this move so that I don't disrupt your development, and then I'll test out the code and add scripts for Unix.
Yes, go ahead. I put it there by mistake.
I've split the Dictionary source off from the "opennlp" module and created a new module called "Dictionary". I've filled out the build structure with the jar files needed to compile, so give it a whirl and make sure it works for you to compile.
I've also put gui.bat and run.bat into the Dictionary/bin directory. I noticed that they use directories which are specific to your system and wondered if you could use some more general system properties for these. For example:
-DWNHOME=C:\nlp\opennlp\opennlp\dictionary\src\ -DWNSEARCHDIR=C:\nlp\WordNet\dict\
Can we instead have a property OPENNLP_DICTIONARY_HOME and let these bits of information be found from that path? Looks like some things from WordNet might need to be added to the Dictionary/src/data directory?
Once you have that set up, I'll whip up a few unix scripts to put in the bin directory. Also, could you rename gui.bat and run.bat to be names relating to the package?