From: <lor...@us...> - 2011-05-26 09:23:37
|
Revision: 2824 http://dl-learner.svn.sourceforge.net/dl-learner/?rev=2824&view=rev Author: lorenz_b Date: 2011-05-26 09:23:31 +0000 (Thu, 26 May 2011) Log Message: ----------- Updated models for stanford POS tagger. Added models for other frameworks. Modified Paths: -------------- trunk/components-ext/src/main/resources/tbsl/models/README-Models.txt trunk/components-ext/src/main/resources/tbsl/models/bidirectional-distsim-wsj-0-18.tagger trunk/components-ext/src/main/resources/tbsl/models/bidirectional-distsim-wsj-0-18.tagger.props trunk/components-ext/src/main/resources/tbsl/models/left3words-wsj-0-18.tagger trunk/components-ext/src/main/resources/tbsl/models/left3words-wsj-0-18.tagger.props Added Paths: ----------- trunk/components-ext/src/main/resources/tbsl/models/lingpipe/ trunk/components-ext/src/main/resources/tbsl/models/lingpipe/pos-en-general-brown.HiddenMarkovModel trunk/components-ext/src/main/resources/tbsl/models/opennlp/ trunk/components-ext/src/main/resources/tbsl/models/stanford/ Modified: trunk/components-ext/src/main/resources/tbsl/models/README-Models.txt =================================================================== --- trunk/components-ext/src/main/resources/tbsl/models/README-Models.txt 2011-05-26 09:11:48 UTC (rev 2823) +++ trunk/components-ext/src/main/resources/tbsl/models/README-Models.txt 2011-05-26 09:23:31 UTC (rev 2824) @@ -1,5 +1,5 @@ -Stanford POS Tagger, v. 2.0 - 23 Dec 2009. -Copyright (c) 2002-2009 The Board of Trustees of +Stanford POS Tagger, v. 3.0.2 - 2011-05-15. +Copyright (c) 2002-2011 The Board of Trustees of The Leland Stanford Junior University. All Rights Reserved. This document contains (some) information about the models included in @@ -54,17 +54,25 @@ Arabic tagger --------------------------- -arabic.tagger -Trained on the train part of the ATB p1-3 split done for the 2005 JHU -Summer Workshop (Diab split), using (augmented) Bies tags. -(Augmented) Bies mapping of Penn Arabic Treebank tags +arabic-accurate.tagger +Trained on the *entire* ATB p1-3. +When trained on the train part of the ATB p1-3 split done for the 2005 +JHU Summer Workshop (Diab split), using (augmented) Bies tags, it gets +the following performance: Performance: -96.42% on dev portion according to Diab split -(80.45% on unknown words) +96.50% on dev portion according to Diab split +(80.59% on unknown words) +arabic-fast.tagger +4x speed improvement over "accurate". +Performance: +96.34% on dev portion according to Diab split +(80.28% on unknown words) + German tagger --------------------------- +german-accurate.tagger Trained on the first 80% of the Negra corpus, which uses the STTS tagset. The Stuttgart-Tübingen Tagset (STTS) is a set of 54 tags for annotating German text corpora with part-of-speech labels, which was jointly @@ -73,5 +81,10 @@ University of Tübingen. See: http://www.ims.uni-stuttgart.de/projekte/CQPDemos/Bundestag/help-tagset.html Performance: -96.91% on the first half of the remaining 20% of the Negra corpus (dev set) -(90.41% on unknown words) +96.90% on the first half of the remaining 20% of the Negra corpus (dev set) +(90.33% on unknown words) + +german-fast.tagger +8x speed improvement over "accurate". +Performance: +96.61% overall / 86.72% unknown. Modified: trunk/components-ext/src/main/resources/tbsl/models/bidirectional-distsim-wsj-0-18.tagger =================================================================== (Binary files differ) Modified: trunk/components-ext/src/main/resources/tbsl/models/bidirectional-distsim-wsj-0-18.tagger.props =================================================================== --- trunk/components-ext/src/main/resources/tbsl/models/bidirectional-distsim-wsj-0-18.tagger.props 2011-05-26 09:11:48 UTC (rev 2823) +++ trunk/components-ext/src/main/resources/tbsl/models/bidirectional-distsim-wsj-0-18.tagger.props 2011-05-26 09:23:31 UTC (rev 2824) @@ -1,35 +1,33 @@ -## tagger training invoked at Mon Dec 21 23:21:02 PST 2009 with arguments: - model = bidirectional-distsim-wsj-0-18.tagger - arch = bidirectional5words,naacl2003unknowns,wordshapes(-1,1),distsim(/u/nlp/data/pos_tags_are_useless/egw.bnc.200.pruned,-1,1),distsimconjunction(/u/nlp/data/pos_tags_are_useless/egw.bnc.200.pruned,-1,1) - trainFile = /u/nlp/data/pos-tagger/train-wsj-0-18 - closedClassTags = - closedClassTagThreshold = 40 - curWordMinFeatureThresh = 2 - debug = false - debugPrefix = - tagSeparator = _ - encoding = UTF-8 - initFromTrees = false - iterations = 100 - lang = english - learnClosedClassTags = false - minFeatureThresh = 2 - openClassTags = -rareWordMinFeatureThresh = 5 - rareWordThresh = 5 - search = owlqn - sgml = false - sigmaSquared = 0.5 - regL1 = 0.75 - tagInside = - tokenize = true - tokenizerFactory = - treeRange = - treeNormalizer = - treeTransformer = - verbose = false - veryCommonWordThresh = 250 - xmlInput = - outputFile = - outputFormat = slashTags - outputFormatOptions = +## tagger training invoked at Fri Apr 15 01:00:51 PDT 2011 with arguments: + model = bidirectional-distsim-wsj-0-18.tagger + arch = bidirectional5words,naacl2003unknowns,wordshapes(-1,1),distsim(../data/pos-tagger/training/english/egw.bnc.200.pruned,-1,1),distsimconjunction(../data/pos-tagger/training/english/egw.bnc.200.pruned,-1,1) + trainFile = ../data/pos-tagger/training/english/train-wsj-0-18 + closedClassTags = + closedClassTagThreshold = 40 + curWordMinFeatureThresh = 2 + debug = false + debugPrefix = + tagSeparator = _ + encoding = UTF-8 + iterations = 100 + lang = english + learnClosedClassTags = false + minFeatureThresh = 2 + openClassTags = +rareWordMinFeatureThresh = 5 + rareWordThresh = 5 + search = owlqn + sgml = false + sigmaSquared = 0.5 + regL1 = 0.75 + tagInside = + tokenize = true + tokenizerFactory = + tokenizerOptions = + verbose = false + verboseResults = true + veryCommonWordThresh = 250 + xmlInput = + outputFile = + outputFormat = slashTags + outputFormatOptions = Modified: trunk/components-ext/src/main/resources/tbsl/models/left3words-wsj-0-18.tagger =================================================================== (Binary files differ) Modified: trunk/components-ext/src/main/resources/tbsl/models/left3words-wsj-0-18.tagger.props =================================================================== --- trunk/components-ext/src/main/resources/tbsl/models/left3words-wsj-0-18.tagger.props 2011-05-26 09:11:48 UTC (rev 2823) +++ trunk/components-ext/src/main/resources/tbsl/models/left3words-wsj-0-18.tagger.props 2011-05-26 09:23:31 UTC (rev 2824) @@ -1,35 +1,33 @@ -## tagger training invoked at Mon Dec 21 23:25:07 PST 2009 with arguments: - model = left3words-wsj-0-18.tagger - arch = left3words,naacl2003unknowns,wordshapes(-1,1) - trainFile = /u/nlp/data/pos-tagger/train-wsj-0-18 - closedClassTags = - closedClassTagThreshold = 40 - curWordMinFeatureThresh = 2 - debug = false - debugPrefix = - tagSeparator = _ - encoding = UTF-8 - initFromTrees = false - iterations = 100 - lang = english - learnClosedClassTags = false - minFeatureThresh = 2 - openClassTags = -rareWordMinFeatureThresh = 10 - rareWordThresh = 5 - search = owlqn - sgml = false - sigmaSquared = 0.0 - regL1 = 0.75 - tagInside = - tokenize = true - tokenizerFactory = - treeRange = - treeNormalizer = - treeTransformer = - verbose = false - veryCommonWordThresh = 250 - xmlInput = - outputFile = - outputFormat = slashTags - outputFormatOptions = +## tagger training invoked at Fri Apr 15 07:35:09 PDT 2011 with arguments: + model = left3words-wsj-0-18.tagger + arch = left3words,naacl2003unknowns,wordshapes(-1,1) + trainFile = ../data/pos-tagger/training/english/train-wsj-0-18 + closedClassTags = + closedClassTagThreshold = 40 + curWordMinFeatureThresh = 2 + debug = false + debugPrefix = + tagSeparator = _ + encoding = UTF-8 + iterations = 100 + lang = english + learnClosedClassTags = false + minFeatureThresh = 2 + openClassTags = +rareWordMinFeatureThresh = 10 + rareWordThresh = 5 + search = owlqn + sgml = false + sigmaSquared = 0.0 + regL1 = 0.75 + tagInside = + tokenize = true + tokenizerFactory = + tokenizerOptions = + verbose = false + verboseResults = true + veryCommonWordThresh = 250 + xmlInput = + outputFile = + outputFormat = slashTags + outputFormatOptions = Added: trunk/components-ext/src/main/resources/tbsl/models/lingpipe/pos-en-general-brown.HiddenMarkovModel =================================================================== (Binary files differ) Property changes on: trunk/components-ext/src/main/resources/tbsl/models/lingpipe/pos-en-general-brown.HiddenMarkovModel ___________________________________________________________________ Added: svn:mime-type + application/octet-stream This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |