#16 Add NER for Portuguese

closed-fixed
nobody
None
5
2010-09-24
2010-09-23
No

This patch adds NER for Portuguese
The training Corpus can be downloaded from here: http://www.linguateca.pt/floresta/corpus.html
More specifically, use the Amazonia 1.0: http://www.linguateca.pt/floresta/ficheiros/gz/amazonia.ad.gz

The entities extracted from this corpus are: person, organization, group, place, event, artprod, abstract, thing, time and numeric

The evaluation results are:
Precision: 0.8005071889818507
Recall: 0.7450581122145297
F-Measure: 0.7717879983140168

Patch description:
* portuguese_ner_20100923.txt
/opennlp/src/main/java/opennlp/tools/cmdline/namefind/TokenNameFinderConverterTool.java
- added "ad" option, AD stands for Arvores Deitadas, the name of the syntax used in Portuguese Corpus
/opennlp/src/main/java/opennlp/tools/formats/ADNameSampleStream.java
/opennlp/src/main/java/opennlp/tools/formats/ADNameSampleStreamFactory.java
- the NameSampleStream and the factory to read the AD format
/opennlp/src/main/java/opennlp/tools/formats/ADParagraphStream.java
- an auxiliary class to parse the AD corpus. This class can be shared
when we implement other streams using AD corpus
/opennlp/src/main/java/opennlp/tools/lang/portuguese/ContractionUtility.java
- a utility class to handle Portuguese contractions.
/opennlp/src/test/java/opennlp/tools/formats/ADParagraphStreamTest.java
- test for ADParagraphStream
/opennlp/src/test/java/opennlp/tools/formats/ADNameSampleStreamTest.java
- test for ADNameSampleStream
/opennlp/src/test/resources/opennlp/tools/formats/ad.sample
- AD corpus sample

Discussion

  • Joern Kottmann

    Joern Kottmann - 2010-09-24
    • status: open --> closed-fixed
     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.





No, thanks