With alphanumeric optimization on, the tokenizer will not attempt to break up any whitespace-delimited token that is all alphanumeric. But if it has any alphanumeric character, then every character in the token becomes a potential split point, alphanumeric or not.
This can result in some strange behavior. e.g., (a real-life example), "cofactors" is left as is, since it is all alphanumeric. But
"cofactors." (with a period) becomes three tokens:
"cof" "actors" and "." Shouldn't it be the case that with alphanumeric optimization on, the only potential split points should be non-alphanumeric character? (and likewise for accumulating the events during training, I guess.) Is there any reason why "cofactors" should have any more of a possibility of being split up just because it happens to occur with a period in one case?
I suppose there might be some weirdness in the training data causing this particular case to split "cof" and "actors", but it still seems like it shouldn't even be considered.
thanks,
Seth
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi Seth,
The behavior of splitting anywhere is needed for things like: "won't" where the split is between the "wo" and and "n't". The model that was initially distributed has some issues and has been replaced with a new/less problematic model which handles "cofactors" as you'd expect. Please re-download and see the links below for more details.
Interesting point about "won't". I hadn't thought about those cases.
We are not using the model distributed with opennlp, but rather one trained only on the bio data. Can you tell me what you did to resolve this problem for the model you are working with? Is it just a matter of changing this line in TokenizerMe:
Hi,
Yeah, while smoothing helps in performance on the training data, it causes odd behavior on cases it hasn't seen. The new model has smoothing turned off and performance on the examples people have submitted appears more consistant. Thanks...Tom
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
okay, just to confirm then, I should retrain with false instead of true in the trainModel call, as I wrote in my earlier posting?
Someone in the bio group objected to the reasoning about needing to split up words with some alphanumeric character at any possible point, as with "won't", saying that there are just a limited number of such cases, and it can be handled by a list. That's true, but then of course you need a list for all such models, and the bias here is for doing such things without some sort of rule using a list, but rather relying on the model to get things right. I guess such a list could be incorporated as a feature, although I'm not sure if fits cleanly into the way the model featurs are set up.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
> okay, just to confirm then, I should retrain with false instead of true in the trainModel call, as I wrote in my earlier posting?
Yes, that's correct.
I saw some of the bio discussion. The rules as feature idea is reasonable, although I think you'll get good performance with this fix.
The approach as is worked reasonably well for Spanish, Thai where I can't construct a reasonable set of rules so I don't plan on changing it drastically in the future.
Thanks...Tom
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi Tom,
With alphanumeric optimization on, the tokenizer will not attempt to break up any whitespace-delimited token that is all alphanumeric. But if it has any alphanumeric character, then every character in the token becomes a potential split point, alphanumeric or not.
This can result in some strange behavior. e.g., (a real-life example), "cofactors" is left as is, since it is all alphanumeric. But
"cofactors." (with a period) becomes three tokens:
"cof" "actors" and "." Shouldn't it be the case that with alphanumeric optimization on, the only potential split points should be non-alphanumeric character? (and likewise for accumulating the events during training, I guess.) Is there any reason why "cofactors" should have any more of a possibility of being split up just because it happens to occur with a period in one case?
I suppose there might be some weirdness in the training data causing this particular case to split "cof" and "actors", but it still seems like it shouldn't even be considered.
thanks,
Seth
Hi Seth,
The behavior of splitting anywhere is needed for things like: "won't" where the split is between the "wo" and and "n't". The model that was initially distributed has some issues and has been replaced with a new/less problematic model which handles "cofactors" as you'd expect. Please re-download and see the links below for more details.
https://sourceforge.net/forum/forum.php?thread_id=1423261&forum_id=9942
https://sourceforge.net/forum/forum.php?thread_id=1474550&forum_id=9943
Hope this helps...Tom
Interesting point about "won't". I hadn't thought about those cases.
We are not using the model distributed with opennlp, but rather one trained only on the bio data. Can you tell me what you did to resolve this problem for the model you are working with? Is it just a matter of changing this line in TokenizerMe:
GISModel tokModel = opennlp.maxent.GIS.trainModel(100,new TwoPassDataIndexer(evc, 5),true);
to use false instead of true, to turn off the smooting?
thanks,
Seth
Hi,
Yeah, while smoothing helps in performance on the training data, it causes odd behavior on cases it hasn't seen. The new model has smoothing turned off and performance on the examples people have submitted appears more consistant. Thanks...Tom
Hi again Tom,
okay, just to confirm then, I should retrain with false instead of true in the trainModel call, as I wrote in my earlier posting?
Someone in the bio group objected to the reasoning about needing to split up words with some alphanumeric character at any possible point, as with "won't", saying that there are just a limited number of such cases, and it can be handled by a list. That's true, but then of course you need a list for all such models, and the bias here is for doing such things without some sort of rule using a list, but rather relying on the model to get things right. I guess such a list could be incorporated as a feature, although I'm not sure if fits cleanly into the way the model featurs are set up.
> okay, just to confirm then, I should retrain with false instead of true in the trainModel call, as I wrote in my earlier posting?
Yes, that's correct.
I saw some of the bio discussion. The rules as feature idea is reasonable, although I think you'll get good performance with this fix.
The approach as is worked reasonably well for Spanish, Thai where I can't construct a reasonable set of rules so I don't plan on changing it drastically in the future.
Thanks...Tom