You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(5) |
Sep
|
Oct
(14) |
Nov
(37) |
Dec
(13) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(14) |
Feb
|
Mar
|
Apr
(15) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(3) |
Dec
(2) |
2003 |
Jan
(4) |
Feb
|
Mar
(1) |
Apr
(2) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(4) |
2004 |
Jan
(1) |
Feb
(3) |
Mar
|
Apr
|
May
(4) |
Jun
(3) |
Jul
(1) |
Aug
(6) |
Sep
|
Oct
|
Nov
|
Dec
|
2005 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(17) |
Nov
(3) |
Dec
|
2006 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(23) |
Dec
|
2007 |
Jan
|
Feb
|
Mar
(7) |
Apr
(17) |
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2008 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
|
Aug
(3) |
Sep
(20) |
Oct
|
Nov
(15) |
Dec
(2) |
2009 |
Jan
(38) |
Feb
(4) |
Mar
(20) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2010 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(4) |
Jul
|
Aug
(17) |
Sep
(26) |
Oct
|
Nov
(2) |
Dec
|
From: Jason B. <jas...@us...> - 2001-11-14 17:39:59
|
Update of /cvsroot/maxent/maxent/src/java/opennlp/maxent In directory usw-pr-cvs1:/tmp/cvs-serv5243/src/java/opennlp/maxent Modified Files: TrainEval.java Log Message: Fixed minor bug in which TrainEval was calling the GIS trainer with the iteration and cutoff arguments swapped. Index: TrainEval.java =================================================================== RCS file: /cvsroot/maxent/maxent/src/java/opennlp/maxent/TrainEval.java,v retrieving revision 1.1.1.1 retrieving revision 1.2 diff -C2 -d -r1.1.1.1 -r1.2 *** TrainEval.java 2001/10/23 14:06:53 1.1.1.1 --- TrainEval.java 2001/11/14 17:39:56 1.2 *************** *** 61,65 **** public static MaxentModel train(EventStream events, int cutoff) { ! return GIS.trainModel(events, cutoff, 100); } --- 61,65 ---- public static MaxentModel train(EventStream events, int cutoff) { ! return GIS.trainModel(events, 100, cutoff); } |
From: Jason B. <jas...@us...> - 2001-11-14 17:39:25
|
Update of /cvsroot/maxent/maxent/lib In directory usw-pr-cvs1:/tmp/cvs-serv5119/lib Modified Files: java-getopt.jar Log Message: Added proper version of getopt jar. Index: java-getopt.jar =================================================================== RCS file: /cvsroot/maxent/maxent/lib/java-getopt.jar,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 Binary files /tmp/cvsCJ3W7d and /tmp/cvs4cOU4h differ |
From: Jason B. <jas...@us...> - 2001-11-06 15:59:46
|
Update of /cvsroot/maxent/maxent In directory usw-pr-cvs1:/tmp/cvs-serv25995 Modified Files: CHANGES build.xml Log Message: Changed version in CVS to 1.2.3 Index: CHANGES =================================================================== RCS file: /cvsroot/maxent/maxent/CHANGES,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** CHANGES 2001/11/06 15:04:59 1.1 --- CHANGES 2001/11/06 15:59:43 1.2 *************** *** 1,5 **** 1.2.2 ----- ! Easier to keep homepage up-to-date. Added IntegerPool, which manages a pool of shareable, read-only Integer --- 1,7 ---- 1.2.2 ----- ! Added "exe" to build.xml to create stand alone jar files. Also added ! structure and "homepage" target so that it is easier to keep homepage ! up-to-date. Added IntegerPool, which manages a pool of shareable, read-only Integer Index: build.xml =================================================================== RCS file: /cvsroot/maxent/maxent/build.xml,v retrieving revision 1.6 retrieving revision 1.7 diff -C2 -d -r1.6 -r1.7 *** build.xml 2001/11/06 15:04:59 1.6 --- build.xml 2001/11/06 15:59:43 1.7 *************** *** 10,14 **** <property name="Name" value="Maxent"/> <property name="name" value="maxent"/> ! <property name="version" value="1.2.2"/> <property name="year" value="2001"/> --- 10,14 ---- <property name="Name" value="Maxent"/> <property name="name" value="maxent"/> ! <property name="version" value="1.2.3"/> <property name="year" value="2001"/> *************** *** 139,144 **** <tar tarfile="${name}-${version}-src.tar" basedir="../" ! includes="${name}/**" ! excludes="**/CVS" /> <gzip src="${name}-${version}-src.tar" zipfile="../${name}-${version}-src.tgz" /> --- 139,146 ---- <tar tarfile="${name}-${version}-src.tar" basedir="../" ! includes="${name}/**" > ! <exclude name="${name}/docs/api/**"/> ! <exclude name="**/CVS"/> ! </tar> <gzip src="${name}-${version}-src.tar" zipfile="../${name}-${version}-src.tgz" /> |
From: Jason B. <jas...@us...> - 2001-11-06 15:05:55
|
Update of /cvsroot/maxent/maxent/lib In directory usw-pr-cvs1:/tmp/cvs-serv27939/lib Added Files: jakarta-ant-optional.jar Log Message: --- NEW FILE: jakarta-ant-optional.jar --- PK [...1724 lines suppressed...] Û$ Ü ¶ |
From: Jason B. <jas...@us...> - 2001-11-06 15:05:02
|
Update of /cvsroot/maxent/maxent In directory usw-pr-cvs1:/tmp/cvs-serv27381 Modified Files: build.xml Added Files: CHANGES Removed Files: RELEASE Log Message: Added target exe for building a stand alone jar file. --- NEW FILE: CHANGES --- 1.2.2 ----- Easier to keep homepage up-to-date. Added IntegerPool, which manages a pool of shareable, read-only Integer objects in a fixed range. This is used in several places where Integer objects were previously created and then GCed. In various places, used Collections API features to speed things up. For example, java.util.List.toArray() will do a System.arraycopy, if given a big enough array to copy into. This is, therefore, much faster. Added getAllOutcomes() method in GISModel to show the String names of all outcomes matched up with their normalized probabilities. 1.2.0 ----- Changed license to LGPL. Added build.xml file to be used with the build tool Jakarta Ant. Work sponsored by Electric Knowledge: Added BasicEnglishAffixes class to perform basic morphological stemming for English. Fixed a bug pointed out by Chieu Hai Leong in which the model would not train properly in situations where the number of features was constant (and therefore, no correction features need to be computed). The top level package name has changed from quipu to opennlp. Thus, what was "quipu.maxent" is now "opennlp.maxent". See http://www.opennlp.com for more details. Lots of little tweaks to reduce memory consumption. Several changes sponsored by eTranslate Inc: * The new opennlp.maxent.io subpackage. The input/output system for models is now designed to facilitate storing and retrieving of models in different formats. At present, GISModelReaders and GISModelWriters have been supplied for reading and writing plain text and binary files and either can be gzipped or uncompressed. The OldFormatGISModelReader can be used to read old models (from Maxent 1.0), and also provides command line functionality to convert old models to a new format. Also, SuffixSensitiveGISModelReader and SuffixSensitiveGISModelWriter have been provided to allow models to be stored and retrieved appropriated based on the suffixes on their file names. * Model training no longer automatically persists the new model to disk. Instead it returns a GISModel object which can then be persisted using an object from one of the GISModelWriter classes. * Model training now relies on EventStreams rather than EventCollectors. An EventStream is fed to the DataIndexer directly without the developer needing to invoke the DataIndex class explicitly. A good way to feed an EventStream the data it needs to form events is to use a DataStream that can return Objects from a data source in a format and os-independent manner. An implementation of the DataStream interface for reading plain text files line by line is provide in the opennlp.maxent.PlainTextByLineDataStream class. In order to retain backwards compatability, the EventCollectorAsStream class is provided as a wrapper over the EventCollectors used in Maxent 1.0. * GISModel is now thread-safe. Thus, one maxent application can service multiple evaluations in parallel with a single model. * The opennlp.maxent.ModelReplacementManager class has been added to allow a maxent application to replace its current maxent model with a newly trained one in a thread-safe manner without stopping the servicing of requests. An alternative to the ModelReplacementManager is to use a DomainToModelMap object to record the mapping between different data domains to models which are optimized for them. This class allows models to be swapped in a thread-safe manner as well. * The GIS class now is a factory which invokes a new GISTrainer whenever a new model is being trained. Since GISTrainer has only local variables and methods, multiple models can be trained simultaneously on different threads. With the previous implementation, requests to train new models were forced to queue up. 1.0 _____ Reworked the GIS algorithm to use efficient data structures * Tables matching things like predicates, probabilities, correction values to their outcomes now use OpenIntDoubleHashMaps. * Several functions over OpenIntDoubleHashMaps are now defined, and most of the work of the iteration loop is in fact done by these. Events with the same outcome and contextual predicates are collapsed to reduce the number of tokens which must be iterated through in several loops. The number of times each event "type" is seen is then stored in an array (numTimesEventSeen) to provide the proper values. GISModel uses less memory for models with many outcomes, and is much faster on them as well. Performance is roughly the same for models with only two outcomes. More code documentation. Fully compatible with models built using version 0.2.0. 0.2.0 _____ Initial release of fully functional maxent package. Index: build.xml =================================================================== RCS file: /cvsroot/maxent/maxent/build.xml,v retrieving revision 1.5 retrieving revision 1.6 diff -C2 -d -r1.5 -r1.6 *** build.xml 2001/10/28 01:33:35 1.5 --- build.xml 2001/11/06 15:04:59 1.6 *************** *** 10,14 **** <property name="Name" value="Maxent"/> <property name="name" value="maxent"/> ! <property name="version" value="1.2.1"/> <property name="year" value="2001"/> --- 10,14 ---- <property name="Name" value="Maxent"/> <property name="name" value="maxent"/> ! <property name="version" value="1.2.2"/> <property name="year" value="2001"/> *************** *** 112,115 **** --- 112,133 ---- basedir="${build.dest}" /> + </target> + + + <!-- =================================================================== --> + <!-- Creates Jar file with all other needed jars built in. --> + <!-- =================================================================== --> + <target name="exe" depends="compile"> + <jar jarfile="${build.dir}/${name}-${DSTAMP}.jar" + basedir="${build.dest}" /> + <jlink outfile="${build.dir}/${name}-exe-${version}.jar"> + <mergefiles> + <pathelement path="${build.dir}/${name}-${DSTAMP}.jar"/> + <pathelement location="${lib.dir}/gnu-regexp.jar"/> + <pathelement location="${lib.dir}/java-getopt.jar"/> + <pathelement location="${lib.dir}/colt.jar"/> + </mergefiles> + </jlink> + <delete file="${build.dir}/${name}-${DSTAMP}.jar" /> </target> --- RELEASE DELETED --- |
From: Jason B. <jas...@us...> - 2001-11-06 15:03:53
|
Update of /cvsroot/maxent/maxent/src/java/opennlp/maxent In directory usw-pr-cvs1:/tmp/cvs-serv26703/java/opennlp/maxent Modified Files: MaxentModel.java Log Message: Added getAllOutcomes() method to MaxentModel interface. Index: MaxentModel.java =================================================================== RCS file: /cvsroot/maxent/maxent/src/java/opennlp/maxent/MaxentModel.java,v retrieving revision 1.1.1.1 retrieving revision 1.2 diff -C2 -d -r1.1.1.1 -r1.2 *** MaxentModel.java 2001/10/23 14:06:53 1.1.1.1 --- MaxentModel.java 2001/11/06 15:03:49 1.2 *************** *** 41,45 **** * containing the highest probability in the double[]. * ! * @param outcomes array of probabilities obtained from the eval method. * @return the String name of the best outcome * --- 41,47 ---- * containing the highest probability in the double[]. * ! * @param outcomes A <code>double[]</code> as returned by the ! * <code>eval(String[] context)</code> ! * method. * @return the String name of the best outcome * *************** *** 47,50 **** --- 49,68 ---- public String getBestOutcome (double[] outcomes); + + /** + * Return a string matching all the outcome names with all the + * probabilities produced by the <code>eval(String[] context)</code> + * method. + * + * @param outcomes A <code>double[]</code> as returned by the + * <code>eval(String[] context)</code> + * method. + * @return String containing outcome names paired with the normalized + * probability (contained in the <code>double[] ocs</code>) + * for each one. + */ + public String getAllOutcomes (double[] outcomes); + + /** * Gets the String name of the outcome associated with the index i. |
From: Jason B. <jas...@us...> - 2001-11-06 12:17:01
|
Update of /cvsroot/maxent/maxent/src/java/opennlp/maxent In directory usw-pr-cvs1:/tmp/cvs-serv28147 Modified Files: GISModel.java Log Message: Added method getAllOutcomes which returns a string matching up the outcome names with the normalized probabilities found for a given call to eval(). Index: GISModel.java =================================================================== RCS file: /cvsroot/maxent/maxent/src/java/opennlp/maxent/GISModel.java,v retrieving revision 1.1.1.1 retrieving revision 1.2 diff -C2 -d -r1.1.1.1 -r1.2 *** GISModel.java 2001/10/23 14:06:53 1.1.1.1 --- GISModel.java 2001/11/06 12:16:58 1.2 *************** *** 137,140 **** --- 137,173 ---- /** + * Return a string matching all the outcome names with all the + * probabilities produced by the <code>eval(String[] context)</code> + * method. + * + * @param ocs A <code>double[]</code> as returned by the + * <code>eval(String[] context)</code> + * method. + * @return String containing outcome names paired with the normalized + * probability (contained in the <code>double[] ocs</code>) + * for each one. + */ + public final String getAllOutcomes (double[] ocs) { + if (ocs.length != ocNames.length) { + return "The double array sent as a parameter to GISModel.getAllOutcomes() must not have been produced by this model."; + } + else { + StringBuffer sb = new StringBuffer(ocs.length*2); + String d = Double.toString(ocs[0]); + if (d.length() > 6) + d = d.substring(0,7); + sb.append(ocNames[0]).append("[").append(d).append("]"); + for (int i = 1; i<ocs.length; i++) { + d = Double.toString(ocs[i]); + if (d.length() > 6) + d = d.substring(0,7); + sb.append(" ").append(ocNames[i]).append("[").append(d).append("]"); + } + return sb.toString(); + } + } + + + /** * Return the name of an outcome corresponding to an int id. * *************** *** 145,148 **** --- 178,182 ---- return ocNames[i]; } + |
From: Jason B. <jas...@us...> - 2001-11-06 12:14:39
|
Update of /cvsroot/maxent/maxent/src/java/opennlp/maxent In directory usw-pr-cvs1:/tmp/cvs-serv26707 Modified Files: GIS.java Log Message: Added constructor which takes only an EventStream and assumes cutoff of 0 and uses 100 iterations. Index: GIS.java =================================================================== RCS file: /cvsroot/maxent/maxent/src/java/opennlp/maxent/GIS.java,v retrieving revision 1.1.1.1 retrieving revision 1.2 diff -C2 -d -r1.1.1.1 -r1.2 *** GIS.java 2001/10/23 14:06:53 1.1.1.1 --- GIS.java 2001/11/06 12:14:35 1.2 *************** *** 34,37 **** --- 34,50 ---- /** + * Train a model using the GIS algorithm, assuming 100 iterations and no + * cutoff. + * + * @param eventStream The EventStream holding the data on which this model + * will be trained. + * @return The newly trained model, which can be used immediately or saved + * to disk using an opennlp.maxent.io.GISModelWriter object. + */ + public static GISModel trainModel(EventStream eventStream) { + return trainModel(eventStream, 100, 0, printMessages); + } + + /** * Train a model using the GIS algorithm. * |
From: Eric F. <er...@us...> - 2001-10-31 23:18:12
|
Update of /cvsroot/maxent/maxent/lib In directory usw-pr-cvs1:/tmp/cvs-serv32393 Modified Files: colt.jar Log Message: version of colt 1.0.1 which does not include RngPack or Corejava.Format, both of which have licenses that are incompatible with the LGPL and commercial applications. Index: colt.jar =================================================================== RCS file: /cvsroot/maxent/maxent/lib/colt.jar,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 Binary files /tmp/cvs2OiYPF and /tmp/cvsWmFdFl differ |
From: Jason B. <jas...@us...> - 2001-10-30 11:40:28
|
Update of /cvsroot/maxent/maxent/docs In directory usw-pr-cvs1:/tmp/cvs-serv30201/docs Added Files: .cvsignore Log Message: Block the api directory from being complained about by CVS. --- NEW FILE: .cvsignore --- api |
From: Eric F. <er...@us...> - 2001-10-28 03:17:06
|
Update of /cvsroot/maxent/maxent/src/java/opennlp/maxent In directory usw-pr-cvs1:/tmp/cvs-serv21891/src/java/opennlp/maxent Modified Files: DataIndexer.java Added Files: IntegerPool.java Removed Files: build.xml Log Message: performance enhancements in DataIndexer added IntegerPool class for pooling Integer objects. removed deprecated build file. --- NEW FILE: IntegerPool.java --- /////////////////////////////////////////////////////////////////////////////// // Copyright (c) 2001, Eric D. Friedman All Rights Reserved. // // This library is free software; you can redistribute it and/or // modify it under the terms of the GNU Lesser General Public // License as published by the Free Software Foundation; either // version 2.1 of the License, or (at your option) any later version. // // This library is distributed in the hope that it will be useful, // but WITHOUT ANY WARRANTY; without even the implied warranty of // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the // GNU General Public License for more details. // // You should have received a copy of the GNU Lesser General Public // License along with this program; if not, write to the Free Software // Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. /////////////////////////////////////////////////////////////////////////////// package opennlp.maxent; /** * A pool of read-only, unsigned Integer objects within a fixed, * non-sparse range. Use this class for operations in which a large * number of Integer wrapper objects will be created. * * Created: Sat Oct 27 10:59:11 2001 * * @author Eric Friedman * @version $Id: IntegerPool.java,v 1.1 2001/10/28 03:17:03 ericdf Exp $ */ public class IntegerPool { private Integer[] _table; /** * Creates an IntegerPool with 0..size Integer objects. * * @param size the size of the pool. */ public IntegerPool (int size) { _table = new Integer[size]; for (int i = 0; i < size; i++) { _table[i] = new Integer(i); } // end of for (int i = 0; i < size; i++) } /** * Returns the shared Integer wrapper for <tt>value</tt> if it is * inside the range managed by this pool. if <tt>value</tt> is * outside the range, a new Integer instance is returned. * * @param value an <code>int</code> value * @return an <code>Integer</code> value */ public Integer get(int value) { if (value < _table.length && value >= 0) { return _table[value]; } else { return new Integer(value); } } }// IntegerPool Index: DataIndexer.java =================================================================== RCS file: /cvsroot/maxent/maxent/src/java/opennlp/maxent/DataIndexer.java,v retrieving revision 1.1.1.1 retrieving revision 1.2 diff -C2 -d -r1.1.1.1 -r1.2 *** DataIndexer.java 2001/10/23 14:06:53 1.1.1.1 --- DataIndexer.java 2001/10/28 03:17:03 1.2 *************** *** 20,24 **** import java.util.*; - /** * An indexer for maxent model data which handles cutoffs for uncommon --- 20,23 ---- *************** *** 36,39 **** --- 35,39 ---- public String[] predLabels; public String[] outcomeLabels; + private static final IntegerPool intPool = new IntegerPool(50); /** *************** *** 45,50 **** */ public DataIndexer(EventStream eventStream) { ! this(eventStream, 0); } /** --- 45,151 ---- */ public DataIndexer(EventStream eventStream) { ! this(eventStream, 0); ! } ! ! private List computeEventCounts(EventStream eventStream, ! Map count) { ! List events = new ArrayList(); ! while (eventStream.hasNext()) { ! Event ev = eventStream.nextEvent(); ! events.add(ev); ! String[] ec = ev.getContext(); ! for (int j=0; j<ec.length; j++) { ! Counter counter = (Counter)count.get(ec[j]); ! if (counter!=null) { ! counter.increment(); ! } else { ! count.put(ec[j], new Counter()); ! } ! } ! } ! return events; ! } ! ! private void applyCutoff(Map count, int cutoff) { ! if (cutoff == 0) { ! return; // nothing to do ! } ! ! for (Iterator cit=count.keySet().iterator(); cit.hasNext();) { ! String pred = (String)cit.next(); ! if (! ((Counter)count.get(pred)).passesCutoff(cutoff)) { ! cit.remove(); ! } ! } } + + private ComparableEvent[] index(List events, + Map count) { + Map omap = new HashMap(), pmap = new HashMap(); + + int numEvents = events.size(); + int outcomeCount = 0; + int predCount = 0; + int[] uncompressedOutcomeList = new int[numEvents]; + List uncompressedContexts = new ArrayList(); + + for (int eventIndex=0; eventIndex<numEvents; eventIndex++) { + Event ev = (Event)events.get(eventIndex); + String[] econtext = ev.getContext(); + + Integer predID, ocID; + String oc = ev.getOutcome(); + + if (omap.containsKey(oc)) { + ocID = (Integer)omap.get(oc); + } else { + ocID = intPool.get(outcomeCount++); + omap.put(oc, ocID); + } + + List indexedContext = new ArrayList(); + for (int i=0; i<econtext.length; i++) { + String pred = econtext[i]; + if (count.containsKey(pred)) { + if (pmap.containsKey(pred)) { + predID = (Integer)pmap.get(pred); + } else { + predID = intPool.get(predCount++); + pmap.put(pred, predID); + } + indexedContext.add(predID); + } + } + uncompressedContexts.add(indexedContext); + uncompressedOutcomeList[eventIndex] = ocID.intValue(); + } + outcomeLabels = new String[omap.size()]; + for (Iterator i=omap.keySet().iterator(); i.hasNext();) { + String oc = (String)i.next(); + outcomeLabels[((Integer)omap.get(oc)).intValue()] = oc; + } + omap = null; + + predLabels = new String[pmap.size()]; + for (Iterator i = pmap.keySet().iterator(); i.hasNext();) { + String n = (String)i.next(); + predLabels[((Integer)pmap.get(n)).intValue()] = n; + } + pmap = null; + + ComparableEvent[] eventsToCompare = new ComparableEvent[numEvents]; + + for (int i=0; i<numEvents; i++) { + List ecLL = (List)uncompressedContexts.get(i); + int[] ecInts = new int[ecLL.size()]; + for (int j=0; j<ecInts.length; j++) { + ecInts[j] = ((Integer)ecLL.get(j)).intValue(); + } + eventsToCompare[i] = + new ComparableEvent(uncompressedOutcomeList[i], ecInts); + } + + return eventsToCompare; + } /** *************** *** 57,200 **** */ public DataIndexer(EventStream eventStream, int cutoff) { ! System.out.println("Indexing events"); ! ! System.out.print("\tComputing event counts... "); ! LinkedList events = new LinkedList(); ! HashMap count = new HashMap(); ! //for(int tid=0; tid<events.length; tid++) { ! while (eventStream.hasNext()) { ! Event ev = eventStream.nextEvent(); ! events.addLast(ev); ! String[] ec = ev.getContext(); ! for (int j=0; j<ec.length; j++) { ! Counter counter = (Counter)count.get(ec[j]); ! if (counter!=null) { ! counter.increment(); ! } else { ! count.put(ec[j], new Counter()); ! } ! } ! } ! System.out.println("done."); ! ! System.out.print("\tPerforming cutoff of " + cutoff + "... "); ! HashSet useablePreds = new HashSet(); ! String pred; ! for (Iterator cit=count.keySet().iterator(); cit.hasNext();) { ! pred = (String)cit.next(); ! if (((Counter)count.get(pred)).passesCutoff(cutoff)) ! useablePreds.add(pred); ! } ! System.out.println("done."); ! ! System.out.print("\tIndexing... "); ! HashMap omap = new HashMap(); ! HashMap pmap = new HashMap(); ! ! int numEvents = events.size(); ! int[] uncompressedOutcomeList = new int[numEvents]; ! LinkedList uncompressedContexts = new LinkedList(); ! int outcomeCount = 0; ! int predCount = 0; ! for (int eventIndex=0; eventIndex<numEvents; eventIndex++) { ! Event ev = (Event)events.removeFirst(); ! String[] econtext = ev.getContext(); ! ! Integer predID, ocID; ! String oc = ev.getOutcome(); ! ! if (omap.containsKey(oc)) { ! ocID = (Integer)omap.get(oc); ! } else { ! ocID = new Integer(outcomeCount++); ! omap.put(oc, ocID); ! } ! LinkedList indexedContext = new LinkedList(); ! for (int i=0; i<econtext.length; i++) { ! pred = econtext[i]; ! if (useablePreds.contains(pred)) { ! if (pmap.containsKey(pred)) { ! predID = (Integer)pmap.get(pred); ! } else { ! predID = new Integer(predCount++); ! pmap.put(pred, predID); ! } ! indexedContext.add(predID); ! } ! } ! uncompressedContexts.addLast(indexedContext); ! uncompressedOutcomeList[eventIndex] = ocID.intValue(); ! } ! useablePreds = null; ! ! outcomeLabels = new String[omap.size()]; ! for (Iterator i=omap.keySet().iterator(); i.hasNext();) { ! String oc = (String)i.next(); ! outcomeLabels[((Integer)omap.get(oc)).intValue()] = oc; ! } ! omap = null; ! ! predLabels = new String[pmap.size()]; ! for (Iterator i = pmap.keySet().iterator(); i.hasNext();) { ! String n = (String)i.next(); ! predLabels[((Integer)pmap.get(n)).intValue()] = n; ! } ! pmap = null; ! ComparableEvent[] eventsToCompare = new ComparableEvent[numEvents]; ! for (int i=0; i<numEvents; i++) { ! LinkedList ecLL = (LinkedList)uncompressedContexts.removeFirst(); ! int[] ecInts = new int[ecLL.size()]; ! for (int j=0; j<ecInts.length; j++) ! ecInts[j] = ((Integer)ecLL.removeFirst()).intValue(); ! eventsToCompare[i] = ! new ComparableEvent(uncompressedOutcomeList[i], ecInts); ! } ! uncompressedContexts = null; ! uncompressedOutcomeList = null; ! ! System.out.println("done."); ! System.out.print("Sorting and merging events... "); ! Arrays.sort(eventsToCompare); ! ComparableEvent ce = eventsToCompare[0]; ! LinkedList uniqueEvents = new LinkedList(); ! LinkedList newGroup = new LinkedList(); ! for (int i=0; i<numEvents; i++) { ! if (ce.compareTo(eventsToCompare[i]) == 0) { ! newGroup.addLast(eventsToCompare[i]); ! } else { ! ce = eventsToCompare[i]; ! uniqueEvents.addLast(newGroup); ! newGroup = new LinkedList(); ! newGroup.addLast(eventsToCompare[i]); ! } ! } ! uniqueEvents.addLast(newGroup); ! int numUniqueEvents = uniqueEvents.size(); ! System.out.println("done. Reduced " + eventsToCompare.length ! + " events to " + numUniqueEvents + "."); ! contexts = new int[numUniqueEvents][]; ! outcomeList = new int[numUniqueEvents]; ! numTimesEventsSeen = new int[numUniqueEvents]; ! for (int i=0; i<numUniqueEvents; i++) { ! LinkedList group = (LinkedList)uniqueEvents.removeFirst(); ! numTimesEventsSeen[i] = group.size(); ! ComparableEvent nextCE = (ComparableEvent)group.getFirst(); ! outcomeList[i] = nextCE.outcome; ! contexts[i] = nextCE.predIndexes; ! } ! System.out.println("Done indexing."); } - } --- 158,222 ---- */ public DataIndexer(EventStream eventStream, int cutoff) { ! Map count; ! List events; ! System.out.println("Indexing events"); ! System.out.print("\tComputing event counts... "); ! count = new HashMap(); ! events = computeEventCounts(eventStream,count); ! //for(int tid=0; tid<events.length; tid++) { ! System.out.println("done."); ! System.out.print("\tPerforming cutoff of " + cutoff + "... "); ! applyCutoff(count, cutoff); ! System.out.println("done."); ! System.out.print("\tIndexing... "); ! ComparableEvent[] eventsToCompare = index(events,count); ! // done with event list ! events = null; ! // done with predicate counts ! count = null; ! System.out.println("done."); ! System.out.print("Sorting and merging events... "); ! Arrays.sort(eventsToCompare); ! ComparableEvent ce = eventsToCompare[0]; ! List uniqueEvents = new ArrayList(); ! List newGroup = new ArrayList(); ! int numEvents = eventsToCompare.length; ! for (int i=0; i<numEvents; i++) { ! if (ce.compareTo(eventsToCompare[i]) == 0) { ! newGroup.add(eventsToCompare[i]); ! } else { ! ce = eventsToCompare[i]; ! uniqueEvents.add(newGroup); ! newGroup = new ArrayList(); ! newGroup.add(eventsToCompare[i]); ! } ! } ! uniqueEvents.add(newGroup); ! int numUniqueEvents = uniqueEvents.size(); ! System.out.println("done. Reduced " + eventsToCompare.length ! + " events to " + numUniqueEvents + "."); ! contexts = new int[numUniqueEvents][]; ! outcomeList = new int[numUniqueEvents]; ! numTimesEventsSeen = new int[numUniqueEvents]; ! for (int i=0; i<numUniqueEvents; i++) { ! List group = (List)uniqueEvents.get(i); ! numTimesEventsSeen[i] = group.size(); ! ComparableEvent nextCE = (ComparableEvent)group.get(0); ! outcomeList[i] = nextCE.outcome; ! contexts[i] = nextCE.predIndexes; ! } ! System.out.println("Done indexing."); } } --- build.xml DELETED --- |
From: Eric F. <er...@us...> - 2001-10-28 01:33:37
|
Update of /cvsroot/maxent/maxent In directory usw-pr-cvs1:/tmp/cvs-serv31674 Modified Files: build.xml Log Message: fixed build.classpath Index: build.xml =================================================================== RCS file: /cvsroot/maxent/maxent/build.xml,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** build.xml 2001/10/28 01:25:56 1.4 --- build.xml 2001/10/28 01:33:35 1.5 *************** *** 41,45 **** <path id="build.classpath"> ! <pathelement location="${lib.dir}/xerces.jar"/> <pathelement location="${lib.dir}/java-getopt.jar"/> <pathelement location="${lib.dir}/colt.jar"/> --- 41,45 ---- <path id="build.classpath"> ! <pathelement location="${lib.dir}/gnu-regexp.jar"/> <pathelement location="${lib.dir}/java-getopt.jar"/> <pathelement location="${lib.dir}/colt.jar"/> |
From: Eric F. <er...@us...> - 2001-10-28 01:25:59
|
Update of /cvsroot/maxent/maxent/src/java/opennlp/maxent/io In directory usw-pr-cvs1:/tmp/cvs-serv30238/src/java/opennlp/maxent/io Modified Files: GISModelReader.java Log Message: removed GNURegex dependency in GISModelReader by dropping dependency on "PerlHelp" class. GNURegex is still needed to compile the module, because PerlHelp is still here, but it's not needed at runtime anymore (for sentence detection, at least). Perhaps PerlHelp should go away? Or be moved elsewhere? added build.classpath to build.xml added .cvsignore for build output Index: GISModelReader.java =================================================================== RCS file: /cvsroot/maxent/maxent/src/java/opennlp/maxent/io/GISModelReader.java,v retrieving revision 1.1.1.1 retrieving revision 1.2 diff -C2 -d -r1.1.1.1 -r1.2 *** GISModelReader.java 2001/10/23 14:06:53 1.1.1.1 --- GISModelReader.java 2001/10/28 01:25:56 1.2 *************** *** 18,23 **** package opennlp.maxent.io; - import opennlp.maxent.*; import cern.colt.map.*; /** --- 18,24 ---- package opennlp.maxent.io; import cern.colt.map.*; + import java.util.StringTokenizer; + import opennlp.maxent.*; /** *************** *** 72,154 **** */ public GISModel getModel () throws java.io.IOException { ! checkModelType(); ! int correctionConstant = getCorrectionConstant(); ! double correctionParam = getCorrectionParameter(); ! String[] outcomeLabels = getOutcomes(); ! int[][] outcomePatterns = getOutcomePatterns(); ! String[] predLabels = getPredicates(); ! OpenIntDoubleHashMap[] params = getParameters(outcomePatterns); ! return new GISModel(params, ! predLabels, ! outcomeLabels, ! correctionConstant, ! correctionParam); } protected void checkModelType () throws java.io.IOException { ! String modelType = readUTF(); ! if (!modelType.equals("GIS")) ! System.out.println("Error: attempting to load a "+modelType+ ! " model as a GIS model."+ ! " You should expect problems."); } protected int getCorrectionConstant () throws java.io.IOException { ! return readInt(); } protected double getCorrectionParameter () throws java.io.IOException { ! return readDouble(); } protected String[] getOutcomes () throws java.io.IOException { ! int numOutcomes = readInt(); ! String[] outcomeLabels = new String[numOutcomes]; ! for (int i=0; i<numOutcomes; i++) outcomeLabels[i] = readUTF(); ! return outcomeLabels; } protected int[][] getOutcomePatterns () throws java.io.IOException { ! int numOCTypes = readInt(); ! int[][] outcomePatterns = new int[numOCTypes][]; ! for (int i=0; i<numOCTypes; i++) { ! String[] info = PerlHelp.split(readUTF()); ! int[] infoInts = new int[info.length]; ! for (int j=0; j<info.length; j++) ! infoInts[j] = Integer.parseInt(info[j]); ! outcomePatterns[i] = infoInts; ! } ! return outcomePatterns; } protected String[] getPredicates () throws java.io.IOException { ! NUM_PREDS = readInt(); ! String[] predLabels = new String[NUM_PREDS]; ! for (int i=0; i<NUM_PREDS; i++) ! predLabels[i] = readUTF(); ! return predLabels; } protected OpenIntDoubleHashMap[] getParameters (int[][] outcomePatterns) ! throws java.io.IOException { ! OpenIntDoubleHashMap[] params = new OpenIntDoubleHashMap[NUM_PREDS]; ! int pid=0; ! for (int i=0; i<outcomePatterns.length; i++) { ! for (int j=0; j<outcomePatterns[i][0]; j++) { ! params[pid] = new OpenIntDoubleHashMap(); ! for (int k=1; k<outcomePatterns[i].length; k++) { ! double d = readDouble(); ! params[pid].put(outcomePatterns[i][k], d); ! } ! params[pid].trimToSize(); ! pid++; ! } ! } ! return params; } - } --- 73,155 ---- */ public GISModel getModel () throws java.io.IOException { ! checkModelType(); ! int correctionConstant = getCorrectionConstant(); ! double correctionParam = getCorrectionParameter(); ! String[] outcomeLabels = getOutcomes(); ! int[][] outcomePatterns = getOutcomePatterns(); ! String[] predLabels = getPredicates(); ! OpenIntDoubleHashMap[] params = getParameters(outcomePatterns); ! return new GISModel(params, ! predLabels, ! outcomeLabels, ! correctionConstant, ! correctionParam); } protected void checkModelType () throws java.io.IOException { ! String modelType = readUTF(); ! if (!modelType.equals("GIS")) ! System.out.println("Error: attempting to load a "+modelType+ ! " model as a GIS model."+ ! " You should expect problems."); } protected int getCorrectionConstant () throws java.io.IOException { ! return readInt(); } protected double getCorrectionParameter () throws java.io.IOException { ! return readDouble(); } protected String[] getOutcomes () throws java.io.IOException { ! int numOutcomes = readInt(); ! String[] outcomeLabels = new String[numOutcomes]; ! for (int i=0; i<numOutcomes; i++) outcomeLabels[i] = readUTF(); ! return outcomeLabels; } protected int[][] getOutcomePatterns () throws java.io.IOException { ! int numOCTypes = readInt(); ! int[][] outcomePatterns = new int[numOCTypes][]; ! for (int i=0; i<numOCTypes; i++) { ! StringTokenizer tok = new StringTokenizer(readUTF(), " "); ! int[] infoInts = new int[tok.countTokens()]; ! for (int j = 0; tok.hasMoreTokens(); j++) { ! infoInts[j] = Integer.parseInt(tok.nextToken()); ! } ! outcomePatterns[i] = infoInts; ! } ! return outcomePatterns; } protected String[] getPredicates () throws java.io.IOException { ! NUM_PREDS = readInt(); ! String[] predLabels = new String[NUM_PREDS]; ! for (int i=0; i<NUM_PREDS; i++) ! predLabels[i] = readUTF(); ! return predLabels; } protected OpenIntDoubleHashMap[] getParameters (int[][] outcomePatterns) ! throws java.io.IOException { ! OpenIntDoubleHashMap[] params = new OpenIntDoubleHashMap[NUM_PREDS]; ! int pid=0; ! for (int i=0; i<outcomePatterns.length; i++) { ! for (int j=0; j<outcomePatterns[i][0]; j++) { ! params[pid] = new OpenIntDoubleHashMap(); ! for (int k=1; k<outcomePatterns[i].length; k++) { ! double d = readDouble(); ! params[pid].put(outcomePatterns[i][k], d); ! } ! params[pid].trimToSize(); ! pid++; ! } ! } ! return params; } } |
From: Eric F. <er...@us...> - 2001-10-28 01:25:59
|
Update of /cvsroot/maxent/maxent In directory usw-pr-cvs1:/tmp/cvs-serv30238 Modified Files: build.xml Added Files: .cvsignore Log Message: removed GNURegex dependency in GISModelReader by dropping dependency on "PerlHelp" class. GNURegex is still needed to compile the module, because PerlHelp is still here, but it's not needed at runtime anymore (for sentence detection, at least). Perhaps PerlHelp should go away? Or be moved elsewhere? added build.classpath to build.xml added .cvsignore for build output --- NEW FILE: .cvsignore --- build Index: build.xml =================================================================== RCS file: /cvsroot/maxent/maxent/build.xml,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** build.xml 2001/10/23 15:24:24 1.3 --- build.xml 2001/10/28 01:25:56 1.4 *************** *** 40,43 **** --- 40,49 ---- <filter token="verbose" value="true"/> + <path id="build.classpath"> + <pathelement location="${lib.dir}/xerces.jar"/> + <pathelement location="${lib.dir}/java-getopt.jar"/> + <pathelement location="${lib.dir}/colt.jar"/> + </path> + </target> *************** *** 94,97 **** --- 100,104 ---- destdir="${build.dest}" debug="${debug}" + classpathref="build.classpath" optimize="${optimize}"/> </target> |
From: Jason B. <jas...@us...> - 2001-10-24 16:44:46
|
Update of /cvsroot/maxent/maxent/docs In directory usw-pr-cvs1:/tmp/cvs-serv19269/docs Modified Files: index.html Log Message: Changed last modified date. Index: index.html =================================================================== RCS file: /cvsroot/maxent/maxent/docs/index.html,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** index.html 2001/10/23 14:30:33 1.1 --- index.html 2001/10/24 16:44:43 1.2 *************** *** 44,48 **** <hr> <A href="http://sourceforge.net"> <IMG src="http://sourceforge.net/sflogo.php?group_id=5961&type=1" width="88" ! height="31" border="0"></A> <br>2000-12-06 </body> </html> --- 44,48 ---- <hr> <A href="http://sourceforge.net"> <IMG src="http://sourceforge.net/sflogo.php?group_id=5961&type=1" width="88" ! height="31" border="0"></A> <br>2001 October 23 </body> </html> |
From: Jason B. <jas...@us...> - 2001-10-23 15:24:27
|
Update of /cvsroot/maxent/maxent In directory usw-pr-cvs1:/tmp/cvs-serv24670 Modified Files: build.xml Log Message: Improved Javadoc info. Removed filtering on prepare-src target. Index: build.xml =================================================================== RCS file: /cvsroot/maxent/maxent/build.xml,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** build.xml 2001/10/23 14:21:46 1.2 --- build.xml 2001/10/23 15:24:24 1.3 *************** *** 1,4 **** <!-- $Id$ --> - <!-- This build file based on the JDOM build.xml file. --> <project default="package" basedir="."> --- 1,3 ---- *************** *** 81,85 **** <!-- copy src files --> ! <copy todir="${build.src}" filtering="on"> <fileset dir="${src.dir}"/> </copy> --- 80,84 ---- <!-- copy src files --> ! <copy todir="${build.src}" > <fileset dir="${src.dir}"/> </copy> *************** *** 132,136 **** excludes="**/CVS" /> <gzip src="${name}-homepage.tar" ! zipfile="${name}-homepage.tgz" /> <delete file="${name}-homepage.tar" /> </target> --- 131,135 ---- excludes="**/CVS" /> <gzip src="${name}-homepage.tar" ! zipfile="${build.dir}/${name}-homepage.tgz" /> <delete file="${name}-homepage.tar" /> </target> *************** *** 150,155 **** splitindex="true" noindex="false" ! windowtitle="${Name} API" ! doctitle="${Name}" bottom="Copyright © ${year} Jason Baldridge and Gann Bierner. All Rights Reserved." /> --- 149,154 ---- splitindex="true" noindex="false" ! windowtitle="opennlp.${name}" ! doctitle="The <a href="http://opennlp.sf.net">OpenNLP</a> ${Name} API v${version}" bottom="Copyright © ${year} Jason Baldridge and Gann Bierner. All Rights Reserved." /> |
From: Jason B. <jas...@us...> - 2001-10-23 14:41:24
|
Update of /cvsroot/maxent/maxent In directory usw-pr-cvs1:/tmp/cvs-serv3052 Modified Files: RELEASE Log Message: Index: RELEASE =================================================================== RCS file: /cvsroot/maxent/maxent/RELEASE,v retrieving revision 1.1.1.1 retrieving revision 1.2 diff -C2 -d -r1.1.1.1 -r1.2 *** RELEASE 2001/10/23 14:06:52 1.1.1.1 --- RELEASE 2001/10/23 14:41:22 1.2 *************** *** 1,2 **** --- 1,8 ---- + 1.2.1 + ----- + + Easier to keep homepage up-to-date. + + 1.2.0 ----- |
From: Jason B. <jas...@us...> - 2001-10-23 14:30:36
|
Update of /cvsroot/maxent/maxent/docs In directory usw-pr-cvs1:/tmp/cvs-serv31023 Added Files: details.html howto.html index.html whatismaxent.html Log Message: Updated the homepage. --- NEW FILE: details.html --- <!doctype html public "-//w3c//dtd html 4.0 transitional//en"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> <meta name="GENERATOR" content="Mozilla/4.5 [en] (X11; I; SunOS 5.5.1 sun4u) [Netscape]"> </head> <body text="#000000" bgcolor="#FFFFFF" link="#0000EE" vlink="#551A8B" alink="#FF0000"> We have tried to make the quipu.maxent implementation easy to use. To create a model, one needs (of course) the training data, and then implementations of two interfaces in the quipu.maxent package, EventStream and ContextGenerator. These have fairly simple specifications, and example implementations can be found in the <a href="http://grok.sourceforge.net">OpenNLP Grok Library</a> preprocessing components. More details are given in the quipu.maxent <a href="howto.html">HOWTO</a>. <p>We have also set in place some interfaces and code to make it easier to automate the training and evaluation process (the Evalable interface and the TrainEval class). It is not necessary to use this functionality, but if you do you'll find it much easier to see how well your models are doing. The quipu.grok.preprocess.namefind package is an example of a maximum entropy component which uses this functionality. <p>We have managed to use several techniques to reduce the size of the models when writing them to disk, which also means that reading in a model for use is much quicker than with less compact encodings of the model. This was especially important to us since we use many maxent models in the Grok library, and we wanted the start up time and the physical size of the library to be as minimal as possible. As of version 1.2.0, maxent has an io package which greatly simplifies the process of loading and saving models in different formats. <p>The opennlp.maxent package was built by <a href="http://www.cogsci.ed.ac.uk/~jmb/">Jason Baldridge</a>, <a href="http://www.cis.upenn.edu/~tsmorton/">Tom Morton</a>, and <a href="http://www.cogsci.ed.ac.uk/~gbierner">Gann Bierner</a>. We owe a big thanks to <a href="http://www.research.ibm.com/people/a/adwaitr/">Adwait Ratnaparkhi</a> for his work on maximum entropy models for natural language processing applications. His <a href="ftp://ftp.cis.upenn.edu/pub/ircs/tr/97-08.ps.Z">introduction to maxent for NLP</a> and <a href="ftp://ftp.cis.upenn.edu/pub/ircs/tr/98-15/98-15.ps.gz">dissertation</a> are what really made opennlp.maxent and our Grok maxent components (POS tagger, end of sentence detector, tokenizer, name finder) possible! <br> </body> </html> --- NEW FILE: howto.html --- <!doctype html public "-//w3c//dtd html 4.0 transitional//en"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> <meta name="GENERATOR" content="Mozilla/4.5 [en] (X11; I; SunOS 5.5.1 sun4u) [Netscape]"> </head> <body text="#000000" bgcolor="#FFFFFF" link="#0000EE" vlink="#551A8B" alink="#FF0000"> <center><b><font size=+2>opennlp.maxent HOWTO</font></b></center> <br> [Note: this HOWTO is now out of date, though it still should be helpful for newcomers. To mention just briefly, you should have a look at the EventStream interface and consider using that instead of the EventCollector class. Actually, the opennlp.grok.preprocess.sentdetect package discussed in this HOWTO has been updated to work with the maxent 1.2.0 as of Grok version 0.5.2, so you can download <a href="http://grok.sf.net">Grok</a> and have a look at that to see what is different. I'll see if I can update this document sometime in the near future, but it isn't a high priority just now. If you have any questions, do not hesitate to post them on the <href="https://sourceforge.net/forum/forum.php?forum_id=18385">help forum</a>. -- Jason] <p>We've tried to make it fairly easy to build and use maxent models, but you need two things to start with: 1) an understanding of feature selection for maxent modeling, and 2) Java skills or the ability to read some example Java code and turn it into what you need. I'll write a very basic summary of what goes on with feature selection. For more details refer to some of the papers mentioned in <a href="whatismaxent.html">here.</a> <p>Features in maxent are functions from outcomes (classes) and contexts to true or false. To take an example from Adwait Ratnaparkhi's part of speech tagger, a useful feature might be: <p> feature (outcome, context) = { 1 if outcome=DETERMINER <br> { && currentword(context) = "that" <br> { 0 otherwise <p>Your job, as a person creating a model of a classification task, is to select the features that will be useful in making decisions. One thing to keep in mind, especially if you are reading any papers on maxent, is that the theoretical representation of these features is not the same as how they are represented in the implementation. (Actually, you really don't need to know the theoretical side to start selecting features with opennlp.maxent.) If you are familiar with feature selection for Adwait Ratnaparkhi's maxent implementation, you should have no problems since our implementation uses features in the same manner as his. Basically, features like the example above are reduced, for your purposes, to the <i>contextual predicate </i>portion of the feature, i.e. currentword(context)="that" (in the implementation this will further reduce to "current=that" or even just "that"). From this point on, I'll forget theory and discuss features from the perspective of the implementation, but for correctness I'll point out that whenever I say feature, I am actually talking about a contextual predicate which will expand into several features (however, this is entirely hidden from the user, so don't worry if you don't understand). <p>So, say you want to implement a program which uses maxent to find names in a text., such as: <blockquote><i>He succeeds Terrence D. Daniels, formerly a W.R. Grace vice chairman, who resigned.</i></blockquote> If you are currently looking at the word <i>Terrence</i> and are trying to decide if it is a name or not, examples of the kinds of features you might use are "previous=succeeds", "current=Terrence", "next=D.", and "currentWordIsCapitalized". You might even add a feature that says that "Terrence" was seen as a name before. <p>Here's how this information translates into the implementation. Let's assume that you already have a trained model for name finding available, that you have created an instance of the MaxentModel interface using that model, and that you are at currently looking at <i>Terrence</i> in the example sentence above. To ask the model whether it believes that <i>Terrence </i>is a name or not, you send a String[] with all of the features (such as those discussed above) to the model by calling the method: <blockquote><b>public double[] eval(String[] context);</b></blockquote> The double[] which you get back will contain the probabilities of the various outcomes which the model has assigned based on the features which you sent it. The indexes of the double[] are actually paired with outcomes. For example, the outcomes associated with the probabilites might be "TRUE" for index 0 and "FALSE" for index 1. To find the String name of a particular index outcome, call the method: <blockquote><b>public String getOutcome(int i);</b></blockquote> Also, if you have gotten back double[] after calling <b>eval </b>and are interested in only the outcome which the model assigns the highest probability, you can call the method: <blockquote><b>public String getBestOutcome(double[] outcomes);</b></blockquote> And this will return the String name of that most likely outcome. <br> <p>In order to make context collection process nicely modularized, you need to implement the ContextGenerator interface: <blockquote><b>public interface ContextGenerator {</b> <p><b> /**</b> <br><b> * Builds up the list of contextual predicates given an Object.</b> <br><b> */</b> <br><b> public String[] getContext(Object o);</b> <p><b>}</b></blockquote> In Grok, the Object that we usually pass is a opennlp.opennlp.util.Pair which contains a StringBuffer and the Integer index of the position we are currently at in the StringBuffer. However, you can pass whatever Object you like as long as your implementation of ContextGenerator can deal with it. and produce a String[] with all of the relevant features (contextual predicates) in it. An example is given from the opennlp.grok.preprocess.sentdetect.SDContextGenerator implementation of the opennlp.maxent.ContextGenerator interface. <blockquote><b> /**</b> <br><b> * </b>Builds up the list of features, anchored around a position within the <br> * StringBuffer. <br><b> */</b> <br><b> public String[] getContext(Object o) {</b> <br><b> StringBuffer sb = (StringBuffer)((Pair)o).a;</b> <br><b> int position = ((Integer)((Pair)o).b).intValue();</b> <p><b> int lastIndex = sb.length()-1;</b> <p><b> int prefixStart = PerlHelp.previousSpaceIndex(sb, position);</b> <br><b> int prevStart = PerlHelp.previousSpaceIndex(sb, prefixStart);</b> <p><b> int suffixEnd = PerlHelp.nextSpaceIndex(sb, position, lastIndex);</b> <br><b> int nextEnd = PerlHelp.nextSpaceIndex(sb, suffixEnd, lastIndex);</b> <p><b> String prefix, previous, suffix, next;</b> <p><b> prefix = sb.substring(prefixStart, position).trim();</b> <p><b> previous = sb.substring(prevStart, prefixStart).trim();</b> <p><b> if (position == lastIndex) {</b> <br><b> suffix = "";</b> <br><b> next = "";</b> <br><b> } else {</b> <br><b> suffix = sb.substring(position+1,suffixEnd).trim();</b> <br><b> next = sb.substring(suffixEnd, nextEnd).trim();</b> <br><b> }</b> <p><b> ArrayList collectFeats = new ArrayList();</b> <br><b> if (!prefix.equals("")) collectFeats.add("x="+prefix);</b> <br><b> if (PerlHelp.capRE.isMatch(prefix)) collectFeats.add("xcap");</b> <br><b> if (!previous.equals("")) collectFeats.add("v="+previous);</b> <br><b> if (!suffix.equals("")) collectFeats.add("s="+suffix);</b> <br><b> if (!next.equals("")) collectFeats.add("n="+next);</b> <p><b> String[] context = new String[collectFeats.size()];</b> <br><b> for (int i=0; i<collectFeats.size(); i++)</b> <br><b> context[i] = (String)collectFeats.get(i);</b> <p><b> return context;</b> <br><b>}</b></blockquote> Basically, it just runs around the StringBuffer collecting features that we thought would be useful for the end of sentence detection task. <br>You might notice some odd things such as "v=" and "n=" --- these are just abbreviations for "previous" and "next". It is a good idea to use abbreviations for such features since they are generated from the data, and when you train your model, there may be several thousand of features with the form "previous=X" where X is the word preceding a possible sentence ending punctuation mark in the training data. All of these feature names must then be saved to disk eventually, and if you use, for example, "v" instead of "previous", you'll save a significant amount of disk space. <p>The SDContextGenerator and the sentence detection model are then used by the method <b>sentDetect </b>in opennlp.grok.preprocess.SentenceDetectorME method as follows (the ContextGenerator has the name "cgen"): <blockquote><b> public String[] sentDetect(String s) {</b> <br><b> StringBuffer sb = new StringBuffer(s);</b> <br><b> REMatch[] enders = PerlHelp.peqRE.getAllMatches(sb);</b> <p><b> int index = 0;</b> <br><b> String sent;</b> <br><b> for (int i=0; i<enders.length; i++) {</b> <br><b> int j = enders[i].getStartIndex();</b> <br><b> probs = <u>model.eval(cgen.getContext(new Pair(sb,new Integer(j))));</u></b> <br><b> if (model.getBestOutcome(probs).equals("T")) {</b> <br><b> sent = sb.substring(index, j+1).trim();</b> <br><b> if (sent.length() > 0) sents.add(sent);</b> <br><b> index=j+1;</b> <br><b> }</b> <br><b> }</b> <p><b> if (index < sb.length()) {</b> <br><b> sent = sb.substring(index).trim();</b> <br><b> if (sent.length() > 0) sents.add(sent);</b> <br><b> }</b> <p><b> String[] sentSA = new String[sents.size()];</b> <br><b> for (int i=0; i<sents.size(); i++)</b> <br><b> sentSA[i] = ((String)sents.get(i)).trim();</b> <br><b> sents.clear();</b> <br><b> return sentSA;</b> <br><b>}</b></blockquote> So that is basically what you need to know to use models! Now, how do you train a new model? For this, you'll want to implement the EventCollector interface: <blockquote>p<b>ublic interface EventCollector {</b> <br><b> public Event[] getEvents();</b> <br><b> public Event[] getEvents(boolean evalMode);</b> <br><b>}</b></blockquote> A class which implements EventCollector should take the data (which it is organizing into events) as an argument to a constructor. For most packages in opennlp.grok.preprocess, we use java.io.Reader objects, as the following segment of the opennlp.grok.preprocess.SDEventCollector shows: <br><b></b> <blockquote><b>public class SDEventCollector implements EventCollector {</b> <br><b> private ContextGenerator cg = new SDContextGenerator();</b> <br><b> private BufferedReader br;</b> <br><b> </b> <br><b> public SDEventCollector(Reader data) {</b> <br><b> br = new BufferedReader(data);</b> <br><b> }</b></blockquote> <b> ...</b><b></b> <p>The <b>getEvents</b> methods required by the interface are then implemented as follows: <blockquote><b>public Event[] getEvents() {</b> <br><b> return getEvents(false);</b> <br><b> }</b><b></b> <p><b>public Event[] getEvents(boolean evalMode) {</b> <br><b> ArrayList elist = new ArrayList();</b> <br><b> int numMatches;</b> <br><b> </b> <br><b> try {</b> <br><b> String s = br.readLine();</b> <br><b> while (s != null) {</b> <br><b> StringBuffer sb = new StringBuffer(s);</b> <br><b> REMatch[] enders = PerlHelp.peqRE.getAllMatches(sb);</b> <br><b> numMatches = enders.length;</b> <br><b> for (int i=0; i<numMatches; i++) {</b> <br><b> int j = enders[i].getStartIndex();</b><b></b> <p><b> Event e;</b> <br><b> String[] context =</b> <br><b> cg.getContext(new Pair(sb, new Integer(j)));</b> <br><b> </b> <br><b> if (i == numMatches-1) {</b> <br><b> e = new Event("T", context);</b> <br><b> } else {</b> <br><b> e = new Event("F", context);</b> <br><b> }</b> <br><b> </b> <br><b> elist.add(e);</b> <br><b> }</b> <br><b> s = br.readLine();</b> <br><b> }</b> <br><b> } catch (Exception e) { e.printStackTrace(); }</b><b></b> <p><b> Event[] events = new Event[elist.size()];</b> <br><b> for(int i=0; i<events.length; i++)</b> <br><b> events[i] = (Event)elist.get(i);</b><b></b> <p><b> return events;</b> <br><b>}</b></blockquote> Basically, this just walks through the data, asks the ContextGenerator for contexts, and throws an event outcome onto it to create a opennlp.maxent.Event object. Notice that we ignore the boolean <b>evalMode</b> in this implementation, which is because the SentenceDetectorME has not yet been set up for the nice automatic evaluation stuff made possible by the Evalable interface and TrainEval class. See the opennlp.grok.preprocess.namefind and opennlp.grok.preprocess.postag packages for examples which take advantage of the evaluation code. <p>Once you have both your ContextGenerator and EventCollector implementations as well as your training data in hand, you can train up a model. opennlp.maxent has an implementation of Generalized Iterative Scaling (opennlp.maxent.GIS) which you can use for this purpose. Write some code somewhere to make a call to the method <b>GIS.trainModel</b>, which will ultimately save a model in a location which you have specified. <blockquote><b>public static void trainModel(String modelpath, String modelname, DataIndexer di, int iterations) { ... }</b></blockquote> The <i>modelpath</i> is the directory where you want the model saved, the <i>modelname</i> is however you want to call the model, and the <i>iterations</i> are the number of times the training procedure should iterate when finding the model's parameters. You shouldn't need more than 100 iterations, and when you are first trying to create your model, you'll probably want to use fewer so that you can iron out problems without waiting each time for all those iterations, which can be quite a while depending on the task. The DataIndexer is an object that pulls in all those events that your EventCollector has gathered and then manipulates them into a format that is much more efficient for the training procedure to work with. There is nothing complicated here --- you just need to create a DataIndexer with the events and an integer that is the cutoff for the number of times a feature must have been seen in order to be considered in the model. <blockquote><b>public DataIndexer(Event[] events, int cutoff) { ... }</b></blockquote> You can also call the constructor <b>DataIndexer(Event[] events)</b>, which assumes a cutoff of 0. An example of code which does all of these steps to create a model follows (from opennlp.grok.preprocess.sentdetect.SentenceDetectorME): <blockquote><b>public static void main(String[] args) {</b> <br><b> try {</b> <br><b> FileReader datafr = new FileReader(new File(args[0]));</b> <br><b> String outdir = args[1];</b> <br><b> String modelname = args[2];</b> <br><b> DataIndexer di = new DataIndexer(new SDEventCollector(datafr).getEvents(), 3);</b> <br><b> GIS.trainModel(outdir, modelname, di, 100);</b> <br><b> } catch (Exception e) {</b> <br><b> e.printStackTrace();</b> <br><b> }</b> <br><b> </b> <br><b>}</b></blockquote> Once the training is done, GIS dumps the model out as two files, one containing the model's parameters in binary format and the other containing information such as the different outcomes, the outcomes which have been associated with particular features, and the features themselves. They are saved (automatically gzipped) with the names <i>modelname</i>.mep.gz and <i>modelname</i>.mei.gz, respectively ("mep" for maxent parameters and "mei" for maxent info). <p>Now that you have your models dumped to disk, you can create an instance of opennlp.maxent.MaxentModel by called the constructor <b>GISModel(String modellocation, String modelname)</b>, which assumes that the two files for the model are gzipped with the mei and mep suffixes that GIS saved them as. So if you had just trained a model called "MyClassificationTask" which is saved in the directory /myproject/classify/ (as the files MyClassificationTask.mep.gz and MyClassificationTask.mep.gz) , you would create your model instance by calling <b>GISModel("/myproject/classify/", "MyClassificationTask").</b> Make sure that you have the trailing directory separator '/' on the location. (Note: we use a Unix example here, but it should work for other OS types as well.) Alternatively, you can create the model by using the constructor which takes InputStreams for the parameters and info files: <b>GISModel(InputStream modelinfo, InputStream modelparams)</b>. See opennlp.grok.preprocess.sentdetect.EnglishSentenceDetectorME for an example of this. <p>That's it! Hopefully, with this little HOWTO and the example implementations available in opennlp.grok.preprocess, you'll be able to get maxent models up and running without too much difficulty. Please let me know if any parts of this HOWTO are particularly confusing and I'll try to make things more clear. I would also welcome "patches" to this document if you feel like making changes yourself. Also, feel free to take the opennlp.grok.preprocess implementations of ContextGenerator and EventCollector and modify and use them for your own purposes (they are GPL'ed). </body> </html> --- NEW FILE: index.html --- <!doctype html public "-//w3c//dtd html 4.0 transitional//en"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> <meta name="GENERATOR" content="Mozilla/4.5 [en] (X11; I; SunOS 5.5.1 sun4u) [Netscape]"> </head> <body text="#000000" bgcolor="#FFFFFF" link="#0000EE" vlink="#551A8B" alink="#FF0000"> <center><b><font color="#333300"><font size=+4>The OpenNLP Maximum Entropy Package</font></font></b></center> </ul> <br> <br> The opennlp.maxent package is a mature Java package for training and using maximum entropy models. This web page contains some details about maximum entropy and using the opennlp.maxent package. It is updated periodically, so check out the <a href="https://sourceforge.net/project/?group_id=5961">Sourceforge page for Maxent</a> for the latest news. You can also ask questions and join in discussions on the <a href="https://sourceforge.net/forum/?group_id=5961">forums</a>. <br> <br> <a href="https://sourceforge.net/project/showfiles.php?group_id=5961">Download</a> the latest version of maxent. <br> <br> [<a href="details.html">Implementation details</a>] [<a href="whatismaxent.html">About Maxent</a>] [<a href="howto.html">opennlp.maxent HOWTO</a>] [<a href="api/index.html">opennlp.maxent API</a>] [<a href="http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/maxent/">CVS</a>] <br> <hr> <A href="http://sourceforge.net"> <IMG src="http://sourceforge.net/sflogo.php?group_id=5961&type=1" width="88" height="31" border="0"></A> <br>2000-12-06 </body> </html> --- NEW FILE: whatismaxent.html --- <!doctype html public "-//w3c//dtd html 4.0 transitional//en"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> <meta name="GENERATOR" content="Mozilla/4.5 [en] (X11; I; SunOS 5.5.1 sun4u) [Netscape]"> </head> <body text="#000000" bgcolor="#FFFFFF" link="#0000EE" vlink="#551A8B" alink="#FF0000"> <center><b><font size=+1>What is maximum entropy?</font></b></center> <p>It will be simplest to quote from <a href="http://www-nlp.Stanford.EDU/~manning/">Manning</a> and <a href="ftp://parcftp.xerox.com/pub/qca/schuetze.html">Schutze</a>* (p. 589): <blockquote>Maximum entropy modeling is a framework for integrating information from many heterogeneous information sources for classification. The data for a classification problem is described as a (potentially large) number of features. These features can be quite complex and allow the experimenter to make use of prior knowledge about what types of informations are expected to be important for classification. Each feature corresponds to a constraint on the model. We then compute the maximum entropy model, the model with the maximum entropy of all the models that satisfy the constraints. This term may seem perverse, since we have spent most of the book trying to minimize the (cross) entropy of models, but the idea is that we do not want to go beyond the data. If we chose a model with less entropy, we would add `information' constraints to the model that are not justified by the empirical evidence available to us. Choosing the maximum entropy model is motivated by the desire to preserve as much uncertainty as possible.</blockquote> So that gives a rough idea of what the maximum entropy framework is. Don't assume anything about your probability distribution other than what you have observed. <p>On the engineering level, using maxent is an excellent way of creating programs which perform very difficult classification tasks very well. For example, precision and recall figures for programs using maxent models have reached (or are) the state of the art on tasks like part of speech tagging, sentence detection, prepositional phrase attachment, and named entity recognition. On the engineering level, an added benefit is that the person creating a maxent model only needs to inform the training procedure of the event space, and need not worry about independence between features. <p>While I am only personally involved in the use of maxent models in natural language processing, the framework is certainly quite general and useful for a much wider variety of fields. In fact, maximum entropy modeling was originally developed for statistical physics. <p>For a very in-depth discussion of how maxent can be used in natural language processing, I recommend reading <a href="http://www.research.ibm.com/people/a/adwaitr/">Adwait Ratnaparkhi</a>'s <a href="ftp://ftp.cis.upenn.edu/pub/ircs/tr/98-15/98-15.ps.gz">dissertation. </a> Also, check out Berger, Della Pietra, and Della Pietra's paper <a href="http://citeseer.nj.nec.com/cs?profile=314155%2C67650%2C1%2C0.25%2CDownload&rd=http%3A//www.cs.cmu.edu/%7Eaberger/ps/compling.ps">A Maximum Entropy Approach to Natural Language Processing,</a> which provides an excellent introduction and discussion of the framework. <p><i>*<a href="http://www.sultry.arts.usyd.edu.au/fsnlp/promo/">Foundations of statistical natural language processing </a></i>. Christopher D. Manning, Hinrich Schutze. <br>Cambridge, Mass. : MIT Press, c1999. </body> </html> |
From: Jason B. <jas...@us...> - 2001-10-23 14:21:49
|
Update of /cvsroot/maxent/maxent In directory usw-pr-cvs1:/tmp/cvs-serv26689 Modified Files: build.xml Log Message: Added target for creating homepage easily. Updated version to 1.2.1. Index: build.xml =================================================================== RCS file: /cvsroot/maxent/maxent/build.xml,v retrieving revision 1.1.1.1 retrieving revision 1.2 diff -C2 -d -r1.1.1.1 -r1.2 *** build.xml 2001/10/23 14:06:52 1.1.1.1 --- build.xml 2001/10/23 14:21:46 1.2 *************** *** 11,15 **** <property name="Name" value="Maxent"/> <property name="name" value="maxent"/> ! <property name="version" value="1.2.0"/> <property name="year" value="2001"/> --- 11,15 ---- <property name="Name" value="Maxent"/> <property name="name" value="maxent"/> ! <property name="version" value="1.2.1"/> <property name="year" value="2001"/> *************** *** 28,32 **** <property name="build.src" value="./build/src"/> <property name="build.dest" value="./build/classes"/> ! <property name="build.javadocs" value="./build/apidocs"/> <property name="dist.root" value="./dist"/> --- 28,32 ---- <property name="build.src" value="./build/src"/> <property name="build.dest" value="./build/classes"/> ! <property name="build.javadocs" value="./docs/api"/> <property name="dist.root" value="./dist"/> *************** *** 120,123 **** --- 120,137 ---- zipfile="../${name}-${version}-src.tgz" /> <delete file="${name}-${version}-src.tar" /> + </target> + + + <!-- =================================================================== --> + <!-- Creates the homepage --> + <!-- =================================================================== --> + <target name="homepage" depends="init,javadoc"> + <tar tarfile="${name}-homepage.tar" + basedir="./docs/" + includes="**" + excludes="**/CVS" /> + <gzip src="${name}-homepage.tar" + zipfile="${name}-homepage.tgz" /> + <delete file="${name}-homepage.tar" /> </target> |
From: Jason B. <jas...@us...> - 2001-10-23 14:21:06
|
Update of /cvsroot/maxent/maxent/docs In directory usw-pr-cvs1:/tmp/cvs-serv26438/docs Log Message: Directory /cvsroot/maxent/maxent/docs added to the repository |
From: Jason B. <jas...@us...> - 2001-10-23 14:19:30
|
Update of /cvsroot/maxent/maxent In directory usw-pr-cvs1:/tmp/cvs-serv25658 Modified Files: build.sh Log Message: Classpath built automatically from all jars in lib directory. Index: build.sh =================================================================== RCS file: /cvsroot/maxent/maxent/build.sh,v retrieving revision 1.1.1.1 retrieving revision 1.2 diff -C2 -d -r1.1.1.1 -r1.2 *** build.sh 2001/10/23 14:06:52 1.1.1.1 --- build.sh 2001/10/23 14:19:27 1.2 *************** *** 20,24 **** fi ! LOCALCLASSPATH=./lib/ant.jar${PS}./lib/jaxp.jar${PS}./lib/crimson.jar${PS}$JAVA_HOME/lib/tools.jar${PS}./lib/colt.jar${PS}./lib/gnu-regexp.jar${PS}./lib/java-getopt.jar ANT_HOME=./lib --- 20,32 ---- fi ! LOCALCLASSPATH=$JAVA_HOME/lib/tools.jar ! # add in the dependency .jar files ! DIRLIBS=lib/*.jar ! for i in ${DIRLIBS} ! do ! if [ "$i" != "${DIRLIBS}" ] ; then ! LOCALCLASSPATH=$LOCALCLASSPATH${PS}"$i" ! fi ! done ANT_HOME=./lib |
From: Jason B. <jas...@us...> - 2001-10-23 14:06:56
|
Update of /cvsroot/maxent/maxent In directory usw-pr-cvs1:/tmp/cvs-serv20177 Log Message: Reimporting sources after massive restructuring. Status: Vendor Tag: opennlp Release Tags: restructured N maxent/build.xml N maxent/AUTHORS N maxent/COMMANDLINE N maxent/build.sh N maxent/README N maxent/LICENSE N maxent/RELEASE N maxent/src/java/opennlp/maxent/AllEnglishAffixes.txt N maxent/src/java/opennlp/maxent/BasicEnglishAffixes.java N maxent/src/java/opennlp/maxent/BinToAscii.java N maxent/src/java/opennlp/maxent/ComparableEvent.java N maxent/src/java/opennlp/maxent/ComparablePredicate.java N maxent/src/java/opennlp/maxent/ContextGenerator.java N maxent/src/java/opennlp/maxent/Counter.java N maxent/src/java/opennlp/maxent/DataIndexer.java N maxent/src/java/opennlp/maxent/DataStream.java N maxent/src/java/opennlp/maxent/DomainToModelMap.java N maxent/src/java/opennlp/maxent/Evalable.java N maxent/src/java/opennlp/maxent/Event.java N maxent/src/java/opennlp/maxent/EventCollector.java N maxent/src/java/opennlp/maxent/EventCollectorAsStream.java N maxent/src/java/opennlp/maxent/EventStream.java N maxent/src/java/opennlp/maxent/GIS.java N maxent/src/java/opennlp/maxent/GISFormat N maxent/src/java/opennlp/maxent/GISModel.java N maxent/src/java/opennlp/maxent/GISTrainer.java N maxent/src/java/opennlp/maxent/MaxentModel.java N maxent/src/java/opennlp/maxent/ModelDomain.java N maxent/src/java/opennlp/maxent/ModelReplacementManager.java N maxent/src/java/opennlp/maxent/ModelSetter.java N maxent/src/java/opennlp/maxent/PerlHelp.java N maxent/src/java/opennlp/maxent/PlainTextByLineDataStream.java N maxent/src/java/opennlp/maxent/TrainEval.java N maxent/src/java/opennlp/maxent/build.xml N maxent/src/java/opennlp/maxent/io/BinToAscii.java N maxent/src/java/opennlp/maxent/io/BinaryGISModelReader.java N maxent/src/java/opennlp/maxent/io/BinaryGISModelWriter.java N maxent/src/java/opennlp/maxent/io/GISModelReader.java N maxent/src/java/opennlp/maxent/io/GISModelWriter.java N maxent/src/java/opennlp/maxent/io/OldFormatGISModelReader.java N maxent/src/java/opennlp/maxent/io/PlainTextGISModelReader.java N maxent/src/java/opennlp/maxent/io/PlainTextGISModelWriter.java N maxent/src/java/opennlp/maxent/io/SuffixSensitiveGISModelReader.java N maxent/src/java/opennlp/maxent/io/SuffixSensitiveGISModelWriter.java N maxent/lib/jaxp.license.html N maxent/lib/jaxp.readme N maxent/lib/jaxp.relnotes.html N maxent/lib/LIBNOTES N maxent/lib/ASL N maxent/lib/LGPL N maxent/lib/crimson.license-asf N maxent/lib/crimson.license-w3c.html N maxent/lib/crimson.readme N maxent/lib/gnu-regexp.license N maxent/lib/colt.license No conflicts created by this import ***** Bogus filespec: - Imported sources |
From: Gann B. <ga...@us...> - 2001-08-09 18:29:52
|
Update of /cvsroot/maxent/maxentc In directory usw-pr-cvs1:/tmp/cvs-serv30726 Modified Files: HashMap.h HashSet.h Log Message: Added lgpl license for HashMap.h and HashSet.h Index: HashMap.h =================================================================== RCS file: /cvsroot/maxent/maxentc/HashMap.h,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** HashMap.h 2001/08/09 18:09:10 1.1 --- HashMap.h 2001/08/09 18:29:49 1.2 *************** *** 1,2 **** --- 1,20 ---- + /////////////////////////////////////////////////////////////////////////////// + // Copyright (C) 2001 Electric Knowledge + // + // This library is free software; you can redistribute it and/or + // modify it under the terms of the GNU Lesser General Public + // License as published by the Free Software Foundation; either + // version 2.1 of the License, or (at your option) any later version. + // + // This library is distributed in the hope that it will be useful, + // but WITHOUT ANY WARRANTY; without even the implied warranty of + // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + // GNU General Public License for more details. + // + // You should have received a copy of the GNU Lesser General Public + // License along with this program; if not, write to the Free Software + // Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. + ////////////////////////////////////////////////////////////////////////////// + //////////////////////////////////////////////////////////////////////////////////////// // File: HashMap.h Index: HashSet.h =================================================================== RCS file: /cvsroot/maxent/maxentc/HashSet.h,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** HashSet.h 2001/08/09 18:09:10 1.1 --- HashSet.h 2001/08/09 18:29:49 1.2 *************** *** 1,2 **** --- 1,19 ---- + /////////////////////////////////////////////////////////////////////////////// + // Copyright (C) 2001 Electric Knowledge + // + // This library is free software; you can redistribute it and/or + // modify it under the terms of the GNU Lesser General Public + // License as published by the Free Software Foundation; either + // version 2.1 of the License, or (at your option) any later version. + // + // This library is distributed in the hope that it will be useful, + // but WITHOUT ANY WARRANTY; without even the implied warranty of + // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + // GNU General Public License for more details. + // + // You should have received a copy of the GNU Lesser General Public + // License along with this program; if not, write to the Free Software + // Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. + ////////////////////////////////////////////////////////////////////////////// #ifndef HASHSET_H #define HASHSET_H *************** *** 6,16 **** const float MAX_LOAD = .75; - - - ////////////////////////////////////////////////////////////////////////////////// - // HashSet.h: A generic HashSet template - // - // - ////////////////////////////////////////////////////////////////////////////////// #define HASH_SET_TEMPLATE template<class T, class HF> --- 23,26 ---- |
From: Gann B. <ga...@us...> - 2001-08-09 18:09:13
|
Update of /cvsroot/maxent/maxentc In directory usw-pr-cvs1:/tmp/cvs-serv26100 Added Files: HashMap.h HashSet.h Log Message: Forgot to add hash .h files. --- NEW FILE: HashMap.h --- //////////////////////////////////////////////////////////////////////////////////////// // File: HashMap.h // Author: Gann Bierner // Date: 4/20/01 // // This file implements a polymorphic map of keys to values. It essentially wraps // HashSet by combining the key and value within a single object to be added into the // set. // // Defining a HashMap of a particular type is done most easily with a typedef: // // typedef HashMap<key type, value type, functions class> // // The "key type" and "value type" are, respectively, the types (primitive or class) // of your keys and values. The "functions class" is a class of static methods that // are used for manipulating these types. This class must contain all of the methods // for HashSet (see HashSet.h) and also two more: "copyVal" and "delVal". These // functions should define how to make a copy of a value and delete a value. // If your key is of type "char*", then life is easier because you can simply extend // szHashFunctions (from HashSet.h) and add these two methods. // // See the definition of StringStringMap (a map from char* to char*) at the bottom of // this file for an example. //////////////////////////////////////////////////////////////////////////////////////// #ifndef HASH_MAP_H #define HASH_MAP_H #include "HashSet.h" #define HASH_MAP_TEMPLATE template <class K, class V, class HF> #define HASH_MAP HashMap<K, V, HF> HASH_MAP_TEMPLATE class HashMap { public: // an combined entry to insert into the HashSet struct HashMapEntry { HashMapEntry(K k, V v) { key=k; value=v; } K key; V value; }; private: // create a set of functions based on HF to be used on HashMapEntries withing the HashSet class HashMapFunctions { public: static bool bComp(HashMapEntry* a, HashMapEntry* b) { return HF::bComp(a->key, b->key); } static int nHash(HashMapEntry* n, int nSize) { return HF::nHash(n->key, nSize); } static HashMapEntry* copy(HashMapEntry* n) { return new HashMapEntry(HF::copy(n->key), HF::copyVal(n->value)); } static void del(HashMapEntry* n) { HF::del(n->key); HF::delVal(n->value); delete n; } }; // create a set composed of HashMapEntry references typedef HashSet<HashMapEntry*, HashMapFunctions> HashMapSet; HashMapSet *set; public: HashMap() { set = new HashMapSet; } HashMap(int nInitialSize) { set = new HashMapSet(nInitialSize); } ~HashMap() { delete set; } void put(K key, V value) { set->put(new HashMapEntry(HF::copy(key), HF::copyVal(value)), false); } int nTotalEntries() { return set->nTotalEntries; } bool bContains(const K key) { HashMapEntry n(key, NULL); return set->bContains(&n); } HashMapEntry* pGetNextEntry() { return set->pGetNextEntry(); } void Reset() { set->Reset(); } void Clear() { set->Clear(); } V get(K key) { HashMapEntry n(key, NULL); HashMapEntry* newn = set->get(&n); if(newn==NULL) return NULL; else return newn->value; } void remove(const K key) { HashMapEntry n(key, NULL); HashMapEntry* newn = set->get(&n); if(newn) set->remove(newn); } }; ////////////////////////////////////////////////////////////////////////////////////////////// // A String to String HashMap ////////////////////////////////////////////////////////////////////////////////////////////// class szszHashMapFunctions : public szHashFunctions { public: static char* copyVal(char* sz) { return strdup(sz); } static void delVal(char* sz) { free(sz); } }; typedef HashMap<char*, char*, szszHashMapFunctions> StringStringMap; #endif --- NEW FILE: HashSet.h --- #ifndef HASHSET_H #define HASHSET_H #include <stdlib.h> #include <assert.h> const float MAX_LOAD = .75; ////////////////////////////////////////////////////////////////////////////////// // HashSet.h: A generic HashSet template // // ////////////////////////////////////////////////////////////////////////////////// #define HASH_SET_TEMPLATE template<class T, class HF> #define HASH_SET HashSet<T, HF> HASH_SET_TEMPLATE class HashSet { private: struct hash_entry { hash_entry(const T d) { data = d; pNext = NULL; } T data; hash_entry* pNext; }; int m_nSize; int m_nEntries; hash_entry** m_aEntries; int m_nCurRow; // when iterating through elements hash_entry* m_pCurEntry; // " public: HashSet(); HashSet(int nInitialSize); ~HashSet(); void Clear(); T put(const T e, bool bAlloc=true); void remove(const T e); int nTotalEntries() { return nEntries; } bool bContains(const T e); T get(T e); T pGetNextEntry(); void Reset() { m_nCurRow = -1; m_pCurEntry = NULL; } private: void init(int nInitialSize); void push(const T e, int nPos); void grow(); }; HASH_SET_TEMPLATE HASH_SET::HashSet() { init(100); } HASH_SET_TEMPLATE HASH_SET::HashSet(int nInitialSize) { init(nInitialSize); } HASH_SET_TEMPLATE HASH_SET::~HashSet() { Clear(); delete[] m_aEntries; } HASH_SET_TEMPLATE void HASH_SET::Clear() { for(int i=0; i<m_nSize; i++) { hash_entry* pEntry = m_aEntries[i]; while(pEntry) { hash_entry* temp = pEntry; pEntry = pEntry->pNext; HF::del(temp->data); delete temp; } m_aEntries[i] = NULL; } m_nEntries = 0; Reset(); } HASH_SET_TEMPLATE void HASH_SET::init(int nInitialSize) { Reset(); m_nEntries = 0; m_nSize = nInitialSize; m_aEntries = new hash_entry*[m_nSize]; for(int i=0; i<m_nSize; i++) m_aEntries[i] = NULL; } HASH_SET_TEMPLATE T HASH_SET::pGetNextEntry() { // advance to the next entry if(m_pCurEntry) m_pCurEntry = m_pCurEntry->pNext; // if the entry does not exist, scan for a non empy row while(!m_pCurEntry && m_nCurRow < m_nSize-1) m_pCurEntry = m_aEntries[++m_nCurRow]; if(m_pCurEntry) return m_pCurEntry->data; else return NULL; } HASH_SET_TEMPLATE void HASH_SET::grow() { int nOldEntries = m_nEntries; int nOldSize = m_nSize; hash_entry** aOldEntries = m_aEntries; // create new storage twice as big as the old init(2*m_nSize); // rehash all of the entries // this can be more efficient-- right now it recreates all of the hash_entries. for(int i=0; i<nOldSize; i++) { hash_entry* pEntry = aOldEntries[i]; while(pEntry) { put(pEntry->data, false); hash_entry* temp = pEntry; pEntry = pEntry->pNext; delete temp; } } assert(nOldEntries==m_nEntries); delete[] aOldEntries; } HASH_SET_TEMPLATE void HASH_SET::push(const T e, int nPos) { hash_entry* new_entry = new hash_entry(e); hash_entry* temp = m_aEntries[nPos]; m_aEntries[nPos] = new_entry; new_entry->pNext = temp; m_nEntries++; if( ((float)m_nEntries) / ((float)m_nSize) > MAX_LOAD ) grow(); } HASH_SET_TEMPLATE T HASH_SET::put(const T _e, bool bAlloc) { T e = _e; if(bAlloc) e = HF::copy(e); int nPos = HF::nHash(e, m_nSize); if(!m_aEntries[nPos]) push(e, nPos); else { hash_entry* temp = m_aEntries[nPos]; while(temp->pNext && !HF::bComp(temp->pNext->data, e)) temp=temp->pNext; if(!temp->pNext) { // insert a new entry push(e, nPos); } else return(temp->pNext->data); } return(e); } HASH_SET_TEMPLATE bool HASH_SET::bContains(const T e) { int nPos = HF::nHash(e, m_nSize); hash_entry* entry = m_aEntries[nPos]; while(entry && !HF::bComp(entry->data, e)) entry = entry->pNext; return entry!=NULL; } HASH_SET_TEMPLATE T HASH_SET::get(T e) { int nPos = HF::nHash(e, m_nSize); hash_entry* entry = m_aEntries[nPos]; while(entry && !HF::bComp(entry->data, e)) entry = entry->pNext; if(entry==NULL) return NULL; else return entry->data; } HASH_SET_TEMPLATE void HASH_SET::remove(const T e) { int nPos = HF::nHash(e, m_nSize); hash_entry* entry = m_aEntries[nPos]; if(m_aEntries[nPos]) { // it's the first one if(HF::bComp(entry->data, e)) { m_aEntries[nPos] = entry->pNext; HF::del(entry->data); delete entry; } else { while(entry->pNext && !HF::bComp(entry->pNext->data, e)) entry = entry->pNext; // we found it if(entry->pNext) { hash_entry* temp = entry->pNext; entry->pNext = entry->pNext->pNext; HF::del(temp->data); delete temp; } } } } ////////////////////////////////////////////////////////////////////////////////// // A String based HashSet ////////////////////////////////////////////////////////////////////////////////// #include <math.h> #include <string> class szHashFunctions { public: static bool bComp(const char* a, const char* b) { return strcmp(a, b)==0; } static int nHash(char* e, int nSize){ int nLen=strlen(e); int sum=0; for(int i=0; i<nLen; i++) sum+=e[i]; double key=sum*.54319068095; key = key - floor(key); return (int)(key*nSize); } static char* copy(const char* sz) { return strdup(sz); } static void del(char* sz) { free(sz); } }; typedef HashSet<char*, szHashFunctions> StringHashSet; //////////////////////////////////////////////////////////////////////////////////// //// An int based HashSet //////////////////////////////////////////////////////////////////////////////////// class nHashFunctions { public: static bool bComp(const int a, const int b) { return a==b; } static int nHash(int e, int nSize){ return e%nSize; } static int copy(int n) { return n; } static void del(int n) { } }; typedef HashSet<int, nHashFunctions> IntHashSet; #endif |
From: Jason B. <jas...@us...> - 2001-08-09 18:06:25
|
Update of /cvsroot/maxent/maxent In directory usw-pr-cvs1:/tmp/cvs-serv25460 Modified Files: TODO Log Message: Reduced the TODO list for tasks already completed. Index: TODO =================================================================== RCS file: /cvsroot/maxent/maxent/TODO,v retrieving revision 1.1 retrieving revision 1.2 diff -C2 -d -r1.1 -r1.2 *** TODO 2001/04/30 14:03:55 1.1 --- TODO 2001/08/09 18:06:23 1.2 *************** *** 1,17 **** ! for v1.2 ------------------------ - * use short[][] for contexts variable of GIS.java and DataIndexer. - - * try to reduce the amount of memory needed! - - write some of the data to disk before algorithm starts - - see about the trade-off with processing time vs. storing big - arrays to hold onto info - * profile the GIS algorithm - - * see if we can write all the data to one file without increasing the - storage requirements. However, see if storage space can be reduced - by storing predicates as shorts instead of list of strings - corresponding to ints. --- 1,5 ---- ! for v1.3 ------------------------ * profile the GIS algorithm |