From: <lor...@us...> - 2011-05-26 09:23:37
|
Revision: 2824 http://dl-learner.svn.sourceforge.net/dl-learner/?rev=2824&view=rev Author: lorenz_b Date: 2011-05-26 09:23:31 +0000 (Thu, 26 May 2011) Log Message: ----------- Updated models for stanford POS tagger. Added models for other frameworks. Modified Paths: -------------- trunk/components-ext/src/main/resources/tbsl/models/README-Models.txt trunk/components-ext/src/main/resources/tbsl/models/bidirectional-distsim-wsj-0-18.tagger trunk/components-ext/src/main/resources/tbsl/models/bidirectional-distsim-wsj-0-18.tagger.props trunk/components-ext/src/main/resources/tbsl/models/left3words-wsj-0-18.tagger trunk/components-ext/src/main/resources/tbsl/models/left3words-wsj-0-18.tagger.props Added Paths: ----------- trunk/components-ext/src/main/resources/tbsl/models/lingpipe/ trunk/components-ext/src/main/resources/tbsl/models/lingpipe/pos-en-general-brown.HiddenMarkovModel trunk/components-ext/src/main/resources/tbsl/models/opennlp/ trunk/components-ext/src/main/resources/tbsl/models/stanford/ Modified: trunk/components-ext/src/main/resources/tbsl/models/README-Models.txt =================================================================== --- trunk/components-ext/src/main/resources/tbsl/models/README-Models.txt 2011-05-26 09:11:48 UTC (rev 2823) +++ trunk/components-ext/src/main/resources/tbsl/models/README-Models.txt 2011-05-26 09:23:31 UTC (rev 2824) @@ -1,5 +1,5 @@ -Stanford POS Tagger, v. 2.0 - 23 Dec 2009. -Copyright (c) 2002-2009 The Board of Trustees of +Stanford POS Tagger, v. 3.0.2 - 2011-05-15. +Copyright (c) 2002-2011 The Board of Trustees of The Leland Stanford Junior University. All Rights Reserved. This document contains (some) information about the models included in @@ -54,17 +54,25 @@ Arabic tagger --------------------------- -arabic.tagger -Trained on the train part of the ATB p1-3 split done for the 2005 JHU -Summer Workshop (Diab split), using (augmented) Bies tags. -(Augmented) Bies mapping of Penn Arabic Treebank tags +arabic-accurate.tagger +Trained on the *entire* ATB p1-3. +When trained on the train part of the ATB p1-3 split done for the 2005 +JHU Summer Workshop (Diab split), using (augmented) Bies tags, it gets +the following performance: Performance: -96.42% on dev portion according to Diab split -(80.45% on unknown words) +96.50% on dev portion according to Diab split +(80.59% on unknown words) +arabic-fast.tagger +4x speed improvement over "accurate". +Performance: +96.34% on dev portion according to Diab split +(80.28% on unknown words) + German tagger --------------------------- +german-accurate.tagger Trained on the first 80% of the Negra corpus, which uses the STTS tagset. The Stuttgart-Tübingen Tagset (STTS) is a set of 54 tags for annotating German text corpora with part-of-speech labels, which was jointly @@ -73,5 +81,10 @@ University of Tübingen. See: http://www.ims.uni-stuttgart.de/projekte/CQPDemos/Bundestag/help-tagset.html Performance: -96.91% on the first half of the remaining 20% of the Negra corpus (dev set) -(90.41% on unknown words) +96.90% on the first half of the remaining 20% of the Negra corpus (dev set) +(90.33% on unknown words) + +german-fast.tagger +8x speed improvement over "accurate". +Performance: +96.61% overall / 86.72% unknown. Modified: trunk/components-ext/src/main/resources/tbsl/models/bidirectional-distsim-wsj-0-18.tagger =================================================================== (Binary files differ) Modified: trunk/components-ext/src/main/resources/tbsl/models/bidirectional-distsim-wsj-0-18.tagger.props =================================================================== --- trunk/components-ext/src/main/resources/tbsl/models/bidirectional-distsim-wsj-0-18.tagger.props 2011-05-26 09:11:48 UTC (rev 2823) +++ trunk/components-ext/src/main/resources/tbsl/models/bidirectional-distsim-wsj-0-18.tagger.props 2011-05-26 09:23:31 UTC (rev 2824) @@ -1,35 +1,33 @@ -## tagger training invoked at Mon Dec 21 23:21:02 PST 2009 with arguments: - model = bidirectional-distsim-wsj-0-18.tagger - arch = bidirectional5words,naacl2003unknowns,wordshapes(-1,1),distsim(/u/nlp/data/pos_tags_are_useless/egw.bnc.200.pruned,-1,1),distsimconjunction(/u/nlp/data/pos_tags_are_useless/egw.bnc.200.pruned,-1,1) - trainFile = /u/nlp/data/pos-tagger/train-wsj-0-18 - closedClassTags = - closedClassTagThreshold = 40 - curWordMinFeatureThresh = 2 - debug = false - debugPrefix = - tagSeparator = _ - encoding = UTF-8 - initFromTrees = false - iterations = 100 - lang = english - learnClosedClassTags = false - minFeatureThresh = 2 - openClassTags = -rareWordMinFeatureThresh = 5 - rareWordThresh = 5 - search = owlqn - sgml = false - sigmaSquared = 0.5 - regL1 = 0.75 - tagInside = - tokenize = true - tokenizerFactory = - treeRange = - treeNormalizer = - treeTransformer = - verbose = false - veryCommonWordThresh = 250 - xmlInput = - outputFile = - outputFormat = slashTags - outputFormatOptions = +## tagger training invoked at Fri Apr 15 01:00:51 PDT 2011 with arguments: + model = bidirectional-distsim-wsj-0-18.tagger + arch = bidirectional5words,naacl2003unknowns,wordshapes(-1,1),distsim(../data/pos-tagger/training/english/egw.bnc.200.pruned,-1,1),distsimconjunction(../data/pos-tagger/training/english/egw.bnc.200.pruned,-1,1) + trainFile = ../data/pos-tagger/training/english/train-wsj-0-18 + closedClassTags = + closedClassTagThreshold = 40 + curWordMinFeatureThresh = 2 + debug = false + debugPrefix = + tagSeparator = _ + encoding = UTF-8 + iterations = 100 + lang = english + learnClosedClassTags = false + minFeatureThresh = 2 + openClassTags = +rareWordMinFeatureThresh = 5 + rareWordThresh = 5 + search = owlqn + sgml = false + sigmaSquared = 0.5 + regL1 = 0.75 + tagInside = + tokenize = true + tokenizerFactory = + tokenizerOptions = + verbose = false + verboseResults = true + veryCommonWordThresh = 250 + xmlInput = + outputFile = + outputFormat = slashTags + outputFormatOptions = Modified: trunk/components-ext/src/main/resources/tbsl/models/left3words-wsj-0-18.tagger =================================================================== (Binary files differ) Modified: trunk/components-ext/src/main/resources/tbsl/models/left3words-wsj-0-18.tagger.props =================================================================== --- trunk/components-ext/src/main/resources/tbsl/models/left3words-wsj-0-18.tagger.props 2011-05-26 09:11:48 UTC (rev 2823) +++ trunk/components-ext/src/main/resources/tbsl/models/left3words-wsj-0-18.tagger.props 2011-05-26 09:23:31 UTC (rev 2824) @@ -1,35 +1,33 @@ -## tagger training invoked at Mon Dec 21 23:25:07 PST 2009 with arguments: - model = left3words-wsj-0-18.tagger - arch = left3words,naacl2003unknowns,wordshapes(-1,1) - trainFile = /u/nlp/data/pos-tagger/train-wsj-0-18 - closedClassTags = - closedClassTagThreshold = 40 - curWordMinFeatureThresh = 2 - debug = false - debugPrefix = - tagSeparator = _ - encoding = UTF-8 - initFromTrees = false - iterations = 100 - lang = english - learnClosedClassTags = false - minFeatureThresh = 2 - openClassTags = -rareWordMinFeatureThresh = 10 - rareWordThresh = 5 - search = owlqn - sgml = false - sigmaSquared = 0.0 - regL1 = 0.75 - tagInside = - tokenize = true - tokenizerFactory = - treeRange = - treeNormalizer = - treeTransformer = - verbose = false - veryCommonWordThresh = 250 - xmlInput = - outputFile = - outputFormat = slashTags - outputFormatOptions = +## tagger training invoked at Fri Apr 15 07:35:09 PDT 2011 with arguments: + model = left3words-wsj-0-18.tagger + arch = left3words,naacl2003unknowns,wordshapes(-1,1) + trainFile = ../data/pos-tagger/training/english/train-wsj-0-18 + closedClassTags = + closedClassTagThreshold = 40 + curWordMinFeatureThresh = 2 + debug = false + debugPrefix = + tagSeparator = _ + encoding = UTF-8 + iterations = 100 + lang = english + learnClosedClassTags = false + minFeatureThresh = 2 + openClassTags = +rareWordMinFeatureThresh = 10 + rareWordThresh = 5 + search = owlqn + sgml = false + sigmaSquared = 0.0 + regL1 = 0.75 + tagInside = + tokenize = true + tokenizerFactory = + tokenizerOptions = + verbose = false + verboseResults = true + veryCommonWordThresh = 250 + xmlInput = + outputFile = + outputFormat = slashTags + outputFormatOptions = Added: trunk/components-ext/src/main/resources/tbsl/models/lingpipe/pos-en-general-brown.HiddenMarkovModel =================================================================== (Binary files differ) Property changes on: trunk/components-ext/src/main/resources/tbsl/models/lingpipe/pos-en-general-brown.HiddenMarkovModel ___________________________________________________________________ Added: svn:mime-type + application/octet-stream This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <lor...@us...> - 2013-02-18 14:17:03
|
Revision: 3901 http://dl-learner.svn.sourceforge.net/dl-learner/?rev=3901&view=rev Author: lorenz_b Date: 2013-02-18 14:16:54 +0000 (Mon, 18 Feb 2013) Log Message: ----------- Removed Stanford models because they can be loaded via Maven now. Removed Paths: ------------- trunk/components-ext/src/main/resources/tbsl/models/README-Models.txt trunk/components-ext/src/main/resources/tbsl/models/bidirectional-distsim-wsj-0-18.tagger trunk/components-ext/src/main/resources/tbsl/models/bidirectional-distsim-wsj-0-18.tagger.props trunk/components-ext/src/main/resources/tbsl/models/left3words-wsj-0-18.tagger trunk/components-ext/src/main/resources/tbsl/models/left3words-wsj-0-18.tagger.props Deleted: trunk/components-ext/src/main/resources/tbsl/models/README-Models.txt =================================================================== --- trunk/components-ext/src/main/resources/tbsl/models/README-Models.txt 2013-02-18 14:14:10 UTC (rev 3900) +++ trunk/components-ext/src/main/resources/tbsl/models/README-Models.txt 2013-02-18 14:16:54 UTC (rev 3901) @@ -1,90 +0,0 @@ -Stanford POS Tagger, v. 3.0.2 - 2011-05-15. -Copyright (c) 2002-2011 The Board of Trustees of -The Leland Stanford Junior University. All Rights Reserved. - -This document contains (some) information about the models included in -this release and that may be downloaded for the POS tagger website at -http://nlp.stanford.edu/software/tagger.shtml . If you have downloaded -the full tagger, all of the models mentioned in this document are in the -downloaded package in the same directory as this readme. Otherwise, -included in the download are two -English taggers, and the other taggers may be downloaded from the -website. All taggers are accompanied by the props files used to create -them; please examine these files for more detailed information about the -creation of the taggers. - -For English, the bidirectional taggers are slightly more accurate, but -tag much more slowly; choose the appropriate tagger based on your -speed/performance needs. - -English taggers ---------------------------- -bidirectional-distsim-wsj-0-18.tagger -Trained on WSJ sections 0-18 using a bidirectional architecture and -including word shape and distributional similarity features. -Penn Treebank tagset. -Performance: -97.28% correct on WSJ 19-21 -(90.46% correct on unknown words) - -left3words-wsj-0-18.tagger -Trained on WSJ sections 0-18 using the left3words architecture and -includes word shape features. Penn tagset. -Performance: -96.97% correct on WSJ 19-21 -(88.85% correct on unknown words) - -left3words-distsim-wsj-0-18.tagger -Trained on WSJ sections 0-18 using the left3words architecture and -includes word shape and distributional similarity features. Penn tagset. -Performance: -97.01% correct on WSJ 19-21 -(89.81% correct on unknown words) - - -Chinese tagger ---------------------------- -chinese.tagger -Trained on a combination of Chinese Treebank texts from Chinese and Hong -Kong sources. -LDC Chinese Treebank POS tag set. -Performance: -94.13% on a combination of Chinese and Hong Kong texts -(78.92% on unknown words) - -Arabic tagger ---------------------------- -arabic-accurate.tagger -Trained on the *entire* ATB p1-3. -When trained on the train part of the ATB p1-3 split done for the 2005 -JHU Summer Workshop (Diab split), using (augmented) Bies tags, it gets -the following performance: -Performance: -96.50% on dev portion according to Diab split -(80.59% on unknown words) - -arabic-fast.tagger -4x speed improvement over "accurate". -Performance: -96.34% on dev portion according to Diab split -(80.28% on unknown words) - - -German tagger ---------------------------- -german-accurate.tagger -Trained on the first 80% of the Negra corpus, which uses the STTS tagset. -The Stuttgart-Tübingen Tagset (STTS) is a set of 54 tags for annotating -German text corpora with part-of-speech labels, which was jointly -developed by the Institut für maschinelle Sprachverarbeitung of the -University of Stuttgart and the Seminar für Sprachwissenschaft of the -University of Tübingen. See: -http://www.ims.uni-stuttgart.de/projekte/CQPDemos/Bundestag/help-tagset.html -Performance: -96.90% on the first half of the remaining 20% of the Negra corpus (dev set) -(90.33% on unknown words) - -german-fast.tagger -8x speed improvement over "accurate". -Performance: -96.61% overall / 86.72% unknown. Deleted: trunk/components-ext/src/main/resources/tbsl/models/bidirectional-distsim-wsj-0-18.tagger =================================================================== (Binary files differ) Deleted: trunk/components-ext/src/main/resources/tbsl/models/bidirectional-distsim-wsj-0-18.tagger.props =================================================================== --- trunk/components-ext/src/main/resources/tbsl/models/bidirectional-distsim-wsj-0-18.tagger.props 2013-02-18 14:14:10 UTC (rev 3900) +++ trunk/components-ext/src/main/resources/tbsl/models/bidirectional-distsim-wsj-0-18.tagger.props 2013-02-18 14:16:54 UTC (rev 3901) @@ -1,33 +0,0 @@ -## tagger training invoked at Fri Apr 15 01:00:51 PDT 2011 with arguments: - model = bidirectional-distsim-wsj-0-18.tagger - arch = bidirectional5words,naacl2003unknowns,wordshapes(-1,1),distsim(../data/pos-tagger/training/english/egw.bnc.200.pruned,-1,1),distsimconjunction(../data/pos-tagger/training/english/egw.bnc.200.pruned,-1,1) - trainFile = ../data/pos-tagger/training/english/train-wsj-0-18 - closedClassTags = - closedClassTagThreshold = 40 - curWordMinFeatureThresh = 2 - debug = false - debugPrefix = - tagSeparator = _ - encoding = UTF-8 - iterations = 100 - lang = english - learnClosedClassTags = false - minFeatureThresh = 2 - openClassTags = -rareWordMinFeatureThresh = 5 - rareWordThresh = 5 - search = owlqn - sgml = false - sigmaSquared = 0.5 - regL1 = 0.75 - tagInside = - tokenize = true - tokenizerFactory = - tokenizerOptions = - verbose = false - verboseResults = true - veryCommonWordThresh = 250 - xmlInput = - outputFile = - outputFormat = slashTags - outputFormatOptions = Deleted: trunk/components-ext/src/main/resources/tbsl/models/left3words-wsj-0-18.tagger =================================================================== (Binary files differ) Deleted: trunk/components-ext/src/main/resources/tbsl/models/left3words-wsj-0-18.tagger.props =================================================================== --- trunk/components-ext/src/main/resources/tbsl/models/left3words-wsj-0-18.tagger.props 2013-02-18 14:14:10 UTC (rev 3900) +++ trunk/components-ext/src/main/resources/tbsl/models/left3words-wsj-0-18.tagger.props 2013-02-18 14:16:54 UTC (rev 3901) @@ -1,33 +0,0 @@ -## tagger training invoked at Fri Apr 15 07:35:09 PDT 2011 with arguments: - model = left3words-wsj-0-18.tagger - arch = left3words,naacl2003unknowns,wordshapes(-1,1) - trainFile = ../data/pos-tagger/training/english/train-wsj-0-18 - closedClassTags = - closedClassTagThreshold = 40 - curWordMinFeatureThresh = 2 - debug = false - debugPrefix = - tagSeparator = _ - encoding = UTF-8 - iterations = 100 - lang = english - learnClosedClassTags = false - minFeatureThresh = 2 - openClassTags = -rareWordMinFeatureThresh = 10 - rareWordThresh = 5 - search = owlqn - sgml = false - sigmaSquared = 0.0 - regL1 = 0.75 - tagInside = - tokenize = true - tokenizerFactory = - tokenizerOptions = - verbose = false - verboseResults = true - veryCommonWordThresh = 250 - xmlInput = - outputFile = - outputFormat = slashTags - outputFormatOptions = This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |