Menu

#9 GISModel Breakage

open-fixed
nobody
None
9
2010-08-11
2010-08-06
James Kosin
No

The latest updates in CVS break the models and training.

The issue is with line 174...

The patch below regresses the modification that breaks the models.

Discussion

  • James Kosin

    James Kosin - 2010-08-06

    GISModel patch

     
  • James Kosin

    James Kosin - 2010-08-06
    • priority: 5 --> 9
     
  • James Kosin

    James Kosin - 2010-08-06

    The old line:
    prior[oid] = Math.exp(prior[oid]*model.getConstantInverse());

    the new line in CVS that breaks:
    prior[oid] = Math.exp(prior[oid]);

    The comment says we are using constant inverse when we are not using correlation constant... However, this constant inverse helps scale the training... without it the models quickly diverge and the training stops with no training causing unusalbe models in most cases.
    With OpenNLP tools the models diverge in 2-3 iterations almost always or end up with NaN values for the loglikelihood values.

    Can we get an explaination as to why getConstantInverse() when getCorrectionParam() is 0, is not expected?

     
  • Joern Kottmann

    Joern Kottmann - 2010-08-07

    The reason for the change was a fix in the code path that is executed when gaussian prior is used during training, because that was broken before.

    So we have to wait until Jason can look into it.

    Jörn

     
  • James Kosin

    James Kosin - 2010-08-09

    I'm sorry for coming over as being harsh... and I do know there is a risk of things breaking when using CVS code.

    I'm willing to discuss the issue; however, for now I'm returning to the broken method that seems to work... Til we get a clear patch and explaination of any changes we may need to do to the OpenNLP code to acomidate the changes.

    What appears to be happening now is the train() function originally calls with 0 for the constant param and a inverse of the count for the constant inverse parameter. This second seems to serve two purposes.

    The current result however seems to be that the models are trained in one pass or two. This causes the models to behave erratically at best.

    Maybe, we should at least come up with a good model to test Maxent with in the future to prevent model breakage like this from affecting the outcome of the model.

    Thanks.

     
  • Jason Baldridge

    Jason Baldridge - 2010-08-10

    I've reviewed the code, and what appears to have happened is that the use of the correction constant has been moved from the parameter update portion (in GISTrainer) into the probability of the classes for each event (in GISModel). I frankly don't understand why that happened (the 1.2 version was correct). It should be used for the update:

    alpha_j^(n+1) = alpha_j^n (empirical_expectation_of_feature_j / model_expectation_of_feature_j )^(1/C)

    Where alpha_j is the parameter for feature_j and C is the correction constant. Raising to the power 1/C makes for smaller steps. It seems that the incorrect placement in GISModel was having a somewhat similar effect of making for less exuberant predictions for any given class; this appears to have roughly worked in terms of still finding good models (which I'm a bit surprised by, actually). However, I just put the correction constant back to where it is supposed to be, which meant making the following change in GISTrainer:

    params[pi].updateParameter(aoi,(Math.log(observed[aoi]) - Math.log(model[aoi])));

    becomes:

    params[pi].updateParameter(aoi,((Math.log(observed[aoi]) - Math.log(model[aoi]))/evalParams.getCorrectionConstant()));

    For the datasets I am working with, models trained with this fix perform more accurately. Note that you had divergence probably because it was taking overly aggressive steps for parameter updates and it shot into a lower-likelihood region.

    The gaussian update correctly uses the correction constant, which is why it was *not* working appropriately with the GISModel code that said:

    prior[oid] = Math.exp(prior[oid]*model.getConstantInverse());

    (You can actually use massive values for the gaussian sigma and it will find an okay model, but not the desired one.)

    I'm going to commit the updated files -- can you try training and see what happens with model convergence and performance on the datasets you are using?

    FWIW, I recommend using the gaussian update -- it means not having to fiddle with feature cutoffs or number of iterations. You do need to choose a reasonable sigma, but this is usually a much less sensitive choice.

    Jason

     
  • Jason Baldridge

    Jason Baldridge - 2010-08-10
    • status: open --> pending-fixed
     
  • Joern Kottmann

    Joern Kottmann - 2010-08-10

    After reviewing the referenced papers and our implementation I came to the same conclusion as Jason.

    In case we use GIS without smoothing do we then need the correction feature or not ? For me it looks like it has been used, and then it was disabled. Do you know more about that ?

    I will do testing with the training data I have and report here later.

    Thanks,
    Jörn

     
  • Joern Kottmann

    Joern Kottmann - 2010-08-10

    After testing a little with the Name Finder I could not see a difference in the tagged data between the version before your fix and after in case the model is re-trained. Sadly I think retraining will not be possible in time for the 1.5 release for thai and spanish models.

    We either drop spanish and thai support or we keep backward compatibility with already trained models. In the case we choose to keep backward compatibility it might be tricky to integrate your fix for the gaussian smoothing case.

    Jörn

     
  • Joern Kottmann

    Joern Kottmann - 2010-08-10

    After a short discussion with Tom I think we should not remove the correction constant from the calculations
    in the GISModel.eval method because it breaks backward compatibility with already trained models.

    New models trained with 3.0 can just write 1 for this constant into the model.

    The calculation performed would be:
    prior[oid] = Math.exp(prior[oid]) * 1/C; // where C is 1 for new models

    We then have to update the training code slightly to not store the correction constant
    used during training in the model. Its the wrong place anyway in my opinion, it could be an
    instance variable in GISTrainer or just passed to the static method doing the calculations.

    Any opinions ?

    Jörn

     
  • James Kosin

    James Kosin - 2010-08-11
    • status: pending-fixed --> open-fixed
     
  • James Kosin

    James Kosin - 2010-08-11

    Jason & Jorn,

    Thanks.
    I was also able to train with the new models; so we can mark this as fixed. My work Isn't as sensitive to using the old models anyway. I've been training models as I require them for now.

    I'm not sure something in the training would break the past models. Unless something was modified to the output or evaluation of the model at run time. I haven't reviewed the OpenNLP code that closely yet to understand all the consequences yet...

    Jason, thanks for the explaination and proper fix for this issue. It caught me off guard. I'm in the mode (or at least my IDE is) of running the tests everytime I re-build the project. So, you could say that the project caught the problem in several tests with the output being very undesireable for the models.

     
  • James Kosin

    James Kosin - 2010-08-11

    [OT] Jorn, We could also require that 1.5 use maxent 2.5.2 and not the most current.

    James

     
  • Joern Kottmann

    Joern Kottmann - 2010-08-11

    Jason can you please share your opinion about setting the correction constant to one for new models and rollback GISModel.java to CVS version 1.2?

    I did the change (not checked in yet) and old models could be loaded again.

    Jörn

     
  • Nobody/Anonymous

    There are two things going on:

    - there is the question of whether to use the correction *feature* -- Curran and Clark (2003) showed that to be unnecessary (http://www.cl.cam.ac.uk/~sc609/pubs/eacl03.ps), hence Tom's updated implementation that allows that to be ignored

    - there is also the use of the correction *constant*, which is used to make smaller steps in each update, even when you are not using the correction feature (e.g., see eqn (7) in the above paper). It should not be set to one, but should be used as the code currently calculates it.

    For actual model use, the correction constant should have no role in the determination of the predicted class probabilities (e.g. see eqn (1) in the Curran and Clark paper), so as far as I can tell, the new code *should* be compatible with the old models. Those models just have somewhat poorer parameter estimates because of the bug.

    Note that the code had it as:

    prior[oid] = Math.exp(prior[oid] * 1/C);

    Whereas your suggested fix is to do:

    prior[oid] = Math.exp(prior[oid]) * 1/C;

    This cancels out of the probabilities because you are dividing all classes by the same amount, and that is also factored into the normalization constant Z. With the previous (incorrect) calculation, what happened is that the individual feature sums were being considerably dampened---which explains an odd behavior I noted previously in which the predicted class probabilities in a binary labeling task were ofter close to .5 when would have expected lots of predictions in the .9 and upward range. The following code in R demonstrates this :

    > correctionConstant = 10
    > fsums = c(2.0, .3, 4.0)
    > true_probs = exp(fsums)/sum(exp(fsums))
    > dampened_probs = exp(fsums/correctionConstant)/sum(exp(fsums/correctionConstant))
    > true_probs
    [1] 0.11666243 0.02131230 0.86202526
    > dampened_probs
    [1] 0.3262571 0.2752516 0.3984913

    So, here the third class would still be the most probable, but by a considerably smaller amount. This really matters when you want to try using the model output with confidence bounds for predictions the model is quite sure about and others it isn't. It also affected the training as I mentioned because that eval was used to compute class probabilities in training, giving dampened boosts to the various class labels.

    Note that I have taken the correction constant out of that calculation for the case where the correction feature is used, so that should have the right behavior too.

    It is worth noting that with lots of training data, it isn't surprising that these things don't make too much difference. Questions of smoothing become less important when you have better estimates. I'm working with some much smaller training sets, and you do see much larger effects do to the wrong placement of the correction constant.

    Hope this helps -- let me know if I need to clarify anything. Basically, I *think* you should be able to use the old models with the new code -- maybe try it and see?

    Jason

     
  • Joern Kottmann

    Joern Kottmann - 2010-08-11

    First I ment
    prior[oid] = Math.exp(prior[oid] * 1/C);
    and not
    prior[oid] = Math.exp(prior[oid]) * 1/C;
    sorry for that mistake.

    "For actual model use, the correction constant should have no role in the
    determination of the predicted class probabilities (e.g. see eqn (1) in the
    Curran and Clark paper), so as far as I can tell, the new code *should* be
    compatible with the old models. Those models just have somewhat poorer
    parameter estimates because of the bug. "

    I tested an old model with the updated GISModel.eval method
    there we have:
    prior[oid] = Math.exp(prior[oid]);
    and the results with the Name Finder have been different
    (if you think that cannot be correct I can test again).

    Thats why I suggested that we rollback GISModel.java to cvs version 1.2
    (thats the version before your change).
    There we use the correction constant:
    prior[oid] = Math.exp(prior[oid]*model.getConstantInverse());

    If we write always one as correction constant into the model from now on, the calculation will be:
    prior[oid] = Math.exp(prior[oid]*1);
    and is almost the same as
    prior[oid] = Math.exp(prior[oid]);

    Old models still use a correction constant different than one and continue to work
    exactly as they always did.

    During the GIS computation we of course use the real correction constant.

    Jörn

     
  • Nobody/Anonymous

    It looks like the parameters are sensitive to that division by the correction constant, so I guess it makes sense that you need to retain the bug in order for those to work still. I'm fine with you making the changes you suggest here for the upcoming release, but I think we should break compatibility with the older trained models after that. Otherwise, this creates potential future messes for the maxent package all in the name of preserving models for a few languages that we can hopefully retrain anyway.

     
  • Joern Kottmann

    Joern Kottmann - 2010-08-11

    Ok, then I will commit my changes.

    Yes, you are right when we do a bigger refactoring it makes sense to break backward compatibility. Then we can just drop the correct constant and correction feature from the model format.

    Jörn

     

Log in to post a comment.