I got the CoNLL 03 data converters in place and working. Wow, so much easier to expand this... anyway, I hope you can get the data as well. I'd like some verification on the output of the data and that it is correct (fully).
The CoNLL 03 data also has POS tags for the sentences. Would it be useful to also create a parser for the POS engine?
James
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Well, after training, I attempted evaluation and got these numbers. Are they any good?
[code]
Loading Token Name Finder model ... done (2.106s)
current: 176.1 sent/s avg: 176.1 sent/s total: 185 sent
current: 616.2 sent/s avg: 384.7 sent/s total: 774 sent
current: 439.5 sent/s avg: 401.9 sent/s total: 1210 sent
current: 604.2 sent/s avg: 452.2 sent/s total: 1813 sent
current: 760.5 sent/s avg: 513.9 sent/s total: 2573 sent
current: 505.5 sent/s avg: 512.5 sent/s total: 3078 sent
Average: 510.8 sent/s
Total: 3251 sent
Runtime: 6.365s
Okay, verified that the model is working. Just still having large problems with the detectors with the sample I sent Jorn.
Anyway, I seem to have good data; and will assume so until I get some outside verification.
I'll also look into the POS parser and see if maybe I can just use the ConllxPOS... parsers if they are the same. I almost felt bad just using Conll03... for the current since it doesn't differ by much from the older Conll02... set.
James
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I found a bug with the code I submitted... I did a little reading and found out the 'B-' prefix is being used. I also found an instance in the training set.
I've fixed the bug and marked a todo item for the Conll 03 parser. If we want to train for multiple types in a single model, then there is a problem with multiple types comming next to each other; since the 'B-' prefix is only used for the same type.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I reviewed your wiki page, would it be possible to add your evaluation results to it ? In my opinion this would be really helpful for others because they can then compare their results (maybe after modifying the code) to your results.
In the results you reported below the recall seems very low. Maybe we can compare the results against the other results reported for Conll03 to see where we stand.
Jörn
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Jorn,
I got the CoNLL 03 data converters in place and working. Wow, so much easier to expand this... anyway, I hope you can get the data as well. I'd like some verification on the output of the data and that it is correct (fully).
The CoNLL 03 data also has POS tags for the sentences. Would it be useful to also create a parser for the POS engine?
James
Well, after training, I attempted evaluation and got these numbers. Are they any good?
[code]
Loading Token Name Finder model ... done (2.106s)
current: 176.1 sent/s avg: 176.1 sent/s total: 185 sent
current: 616.2 sent/s avg: 384.7 sent/s total: 774 sent
current: 439.5 sent/s avg: 401.9 sent/s total: 1210 sent
current: 604.2 sent/s avg: 452.2 sent/s total: 1813 sent
current: 760.5 sent/s avg: 513.9 sent/s total: 2573 sent
current: 505.5 sent/s avg: 512.5 sent/s total: 3078 sent
Average: 510.8 sent/s
Total: 3251 sent
Runtime: 6.365s
Precision: 0.9373834886817577
Recall: 0.6596091205211726
F-Measure: 0.7743388353801384
[/code]
Okay, verified that the model is working. Just still having large problems with the detectors with the sample I sent Jorn.
Anyway, I seem to have good data; and will assume so until I get some outside verification.
I'll also look into the POS parser and see if maybe I can just use the ConllxPOS... parsers if they are the same. I almost felt bad just using Conll03... for the current since it doesn't differ by much from the older Conll02... set.
James
I found a bug with the code I submitted... I did a little reading and found out the 'B-' prefix is being used. I also found an instance in the training set.
I've fixed the bug and marked a todo item for the Conll 03 parser. If we want to train for multiple types in a single model, then there is a problem with multiple types comming next to each other; since the 'B-' prefix is only used for the same type.
I reviewed your wiki page, would it be possible to add your evaluation results to it ? In my opinion this would be really helpful for others because they can then compare their results (maybe after modifying the code) to your results.
In the results you reported below the recall seems very low. Maybe we can compare the results against the other results reported for Conll03 to see where we stand.
Jörn
I'm adding the logic for the format for the German data. I'll leave the testing of this for someone who has the corpus for this to validate the model.
Issue is now migrated to ASF:
https://issues.apache.org/jira/browse/OPENNLP-15