Menu

#10 CoNLL 03 English/German Support

closed
nobody
None
5
2010-12-14
2010-10-06
James Kosin
No

Adding support to convert CoNLL 03 Reurters Support to NameFinder. And maybe more; since, it does have POS tags as well.

Discussion

  • James Kosin

    James Kosin - 2010-10-06

    Jorn,

    I got the CoNLL 03 data converters in place and working. Wow, so much easier to expand this... anyway, I hope you can get the data as well. I'd like some verification on the output of the data and that it is correct (fully).

    The CoNLL 03 data also has POS tags for the sentences. Would it be useful to also create a parser for the POS engine?

    James

     
  • James Kosin

    James Kosin - 2010-10-06

    Well, after training, I attempted evaluation and got these numbers. Are they any good?
    [code]
    Loading Token Name Finder model ... done (2.106s)
    current: 176.1 sent/s avg: 176.1 sent/s total: 185 sent
    current: 616.2 sent/s avg: 384.7 sent/s total: 774 sent
    current: 439.5 sent/s avg: 401.9 sent/s total: 1210 sent
    current: 604.2 sent/s avg: 452.2 sent/s total: 1813 sent
    current: 760.5 sent/s avg: 513.9 sent/s total: 2573 sent
    current: 505.5 sent/s avg: 512.5 sent/s total: 3078 sent

    Average: 510.8 sent/s
    Total: 3251 sent
    Runtime: 6.365s

    Precision: 0.9373834886817577
    Recall: 0.6596091205211726
    F-Measure: 0.7743388353801384
    [/code]

     
  • James Kosin

    James Kosin - 2010-10-08

    Okay, verified that the model is working. Just still having large problems with the detectors with the sample I sent Jorn.

    Anyway, I seem to have good data; and will assume so until I get some outside verification.

    I'll also look into the POS parser and see if maybe I can just use the ConllxPOS... parsers if they are the same. I almost felt bad just using Conll03... for the current since it doesn't differ by much from the older Conll02... set.

    James

     
  • James Kosin

    James Kosin - 2010-10-10

    I found a bug with the code I submitted... I did a little reading and found out the 'B-' prefix is being used. I also found an instance in the training set.

    I've fixed the bug and marked a todo item for the Conll 03 parser. If we want to train for multiple types in a single model, then there is a problem with multiple types comming next to each other; since the 'B-' prefix is only used for the same type.

     
  • Joern Kottmann

    Joern Kottmann - 2010-10-13

    I reviewed your wiki page, would it be possible to add your evaluation results to it ? In my opinion this would be really helpful for others because they can then compare their results (maybe after modifying the code) to your results.

    In the results you reported below the recall seems very low. Maybe we can compare the results against the other results reported for Conll03 to see where we stand.

    Jörn

     
  • James Kosin

    James Kosin - 2010-10-20
    • summary: CoNLL 03 English Support --> CoNLL 03 English/German Support
     
  • James Kosin

    James Kosin - 2010-10-20

    I'm adding the logic for the format for the German data. I'll leave the testing of this for someone who has the corpus for this to validate the model.

     
  • Joern Kottmann

    Joern Kottmann - 2010-12-14
     
  • Joern Kottmann

    Joern Kottmann - 2010-12-14
    • status: open --> closed
     

Log in to post a comment.