#2 SentenceDetectorME.java improvement

closed-accepted
nobody
None
5
2010-06-11
2009-08-20
James Kosin
No

I've added a patch to maybe help complete what may have been started.
The patch makes both -lang and -encoding options a requirment. Usually enclosing any option in [] means the entire option is completely optional, this really isn't the case anymore.
The next half of the patch checks both lang and encoding for non-null values before continuing.
The last actually uses the lang value when calling the train function...

Simple enough.

Other usefull things may be to add a way to get the valid encoding names, and supported lang values... ie: "en", "es", etc...

Discussion

  • James Kosin
    James Kosin
    2009-08-20

    Patch for SentenceDetectorME.java on TRUNK

     
    Attachments
  • Joern Kottmann
    Joern Kottmann
    2009-08-20

    Thanks, for the patch. Its applied now.

     
  • Joern Kottmann
    Joern Kottmann
    2009-08-20

    Thanks, for the patch. Its applied now.

     
  • James Kosin
    James Kosin
    2009-08-20

    Thanks for taking.
    If you want to make the -lang and -encoding optional again, you only have to change the null assigment at the top of the main routine. I didn't want to biasly pick "en" and "US-ASCII" as the defaults.

     
  • Joern Kottmann
    Joern Kottmann
    2009-08-21

    Usually its a good idea to use the platform default encoding as default. Did the training of a sentence model worded for you ? We now also have an evaluator to measure the performance of the sentence detector, in case there are no test data we have built-in support for cross validation.

     
  • James Kosin
    James Kosin
    2009-08-22

    If they are using their native encoding then I may agree with you.
    However, this parameter really describes the encoding of the input file; which may or may not be in the native format.
    Maybe, if we kept a simple normal encoding for the supported languages, we could look up the default encoding based on the specified language. But, we would have to be careful not to overwrite the encoding they may be specifying on the command line for this.

    The model trained OK for me. I only had the small sample set with the source to test with.

     
  • Joern Kottmann
    Joern Kottmann
    2010-06-11

    Closed it because the patch is already applied.

     
  • Joern Kottmann
    Joern Kottmann
    2010-06-11

    • status: open --> closed-accepted