#16 Add NER for Portuguese


This patch adds NER for Portuguese
The training Corpus can be downloaded from here: http://www.linguateca.pt/floresta/corpus.html
More specifically, use the Amazonia 1.0: http://www.linguateca.pt/floresta/ficheiros/gz/amazonia.ad.gz

The entities extracted from this corpus are: person, organization, group, place, event, artprod, abstract, thing, time and numeric

The evaluation results are:
Precision: 0.8005071889818507
Recall: 0.7450581122145297
F-Measure: 0.7717879983140168

Patch description:
* portuguese_ner_20100923.txt
- added "ad" option, AD stands for Arvores Deitadas, the name of the syntax used in Portuguese Corpus
- the NameSampleStream and the factory to read the AD format
- an auxiliary class to parse the AD corpus. This class can be shared
when we implement other streams using AD corpus
- a utility class to handle Portuguese contractions.
- test for ADParagraphStream
- test for ADNameSampleStream
- AD corpus sample


  • Joern Kottmann

    Joern Kottmann - 2010-09-24
    • status: open --> closed-fixed

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.

No, thanks