This patch adds NER for Portuguese
The training Corpus can be downloaded from here: http://www.linguateca.pt/floresta/corpus.html
More specifically, use the Amazonia 1.0: http://www.linguateca.pt/floresta/ficheiros/gz/amazonia.ad.gz
The entities extracted from this corpus are: person, organization, group, place, event, artprod, abstract, thing, time and numeric
The evaluation results are:
Precision: 0.8005071889818507
Recall: 0.7450581122145297
F-Measure: 0.7717879983140168
Patch description:
* portuguese_ner_20100923.txt
/opennlp/src/main/java/opennlp/tools/cmdline/namefind/TokenNameFinderConverterTool.java
- added "ad" option, AD stands for Arvores Deitadas, the name of the syntax used in Portuguese Corpus
/opennlp/src/main/java/opennlp/tools/formats/ADNameSampleStream.java
/opennlp/src/main/java/opennlp/tools/formats/ADNameSampleStreamFactory.java
- the NameSampleStream and the factory to read the AD format
/opennlp/src/main/java/opennlp/tools/formats/ADParagraphStream.java
- an auxiliary class to parse the AD corpus. This class can be shared
when we implement other streams using AD corpus
/opennlp/src/main/java/opennlp/tools/lang/portuguese/ContractionUtility.java
- a utility class to handle Portuguese contractions.
/opennlp/src/test/java/opennlp/tools/formats/ADParagraphStreamTest.java
- test for ADParagraphStream
/opennlp/src/test/java/opennlp/tools/formats/ADNameSampleStreamTest.java
- test for ADNameSampleStream
/opennlp/src/test/resources/opennlp/tools/formats/ad.sample
- AD corpus sample
Patch (Eclipse Workspace Patch 1.0)
Thanks for the patch, its applied now.
I created a small wiki page for this corpus, would be nice
if you can extend the page:
https://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Arvores_Deitadas