OpenNLP / Patches / #16 Add NER for Portuguese

Add NER for Portuguese

#16 Add NER for Portuguese

Status: closed-fixed

Owner: nobody

Labels: None

Priority: 5

Updated: 2010-09-24

Created: 2010-09-23

Creator: William Colen

Private: No

This patch adds NER for Portuguese
The training Corpus can be downloaded from here: http://www.linguateca.pt/floresta/corpus.html
More specifically, use the Amazonia 1.0: http://www.linguateca.pt/floresta/ficheiros/gz/amazonia.ad.gz

The entities extracted from this corpus are: person, organization, group, place, event, artprod, abstract, thing, time and numeric

The evaluation results are:
Precision: 0.8005071889818507
Recall: 0.7450581122145297
F-Measure: 0.7717879983140168

Patch description:
* portuguese_ner_20100923.txt
/opennlp/src/main/java/opennlp/tools/cmdline/namefind/TokenNameFinderConverterTool.java
- added "ad" option, AD stands for Arvores Deitadas, the name of the syntax used in Portuguese Corpus
/opennlp/src/main/java/opennlp/tools/formats/ADNameSampleStream.java
/opennlp/src/main/java/opennlp/tools/formats/ADNameSampleStreamFactory.java
- the NameSampleStream and the factory to read the AD format
/opennlp/src/main/java/opennlp/tools/formats/ADParagraphStream.java
- an auxiliary class to parse the AD corpus. This class can be shared
when we implement other streams using AD corpus
/opennlp/src/main/java/opennlp/tools/lang/portuguese/ContractionUtility.java
- a utility class to handle Portuguese contractions.
/opennlp/src/test/java/opennlp/tools/formats/ADParagraphStreamTest.java
- test for ADParagraphStream
/opennlp/src/test/java/opennlp/tools/formats/ADNameSampleStreamTest.java
- test for ADNameSampleStream
/opennlp/src/test/resources/opennlp/tools/formats/ad.sample
- AD corpus sample

Discussion

William Colen - 2010-09-23

Patch (Eclipse Workspace Patch 1.0)

portuguese_ner_20100923.txt

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Joern Kottmann - 2010-09-24

Thanks for the patch, its applied now.

I created a small wiki page for this corpus, would be nice
if you can extend the page:
https://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Arvores_Deitadas

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Joern Kottmann - 2010-09-24

status: open --> closed-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Add NER for Portuguese

Group

Searches

Help

#16 Add NER for Portuguese

Discussion