Apertium: Machine Translation Toolbox / Tickets / #95 find_similar_ambiguity_class finds the largest subclass

Find attached email exchange:

Dear Sergio Ortiz-Rojas,

I understand you are the original author of find_similar_ambiguity_class.

Just to jog your memory, this function is used in the HMM tagger & the
LSWPoST (where it has been copy pasted) during tagging to find a
substitute ambiguity class for an ambiguity class in the input stream
which is not in the tagger model. The current reasons this might
happen are (in my understanding):
1) Lexemes with new ambiguity classes have been added to the
dictionary after the tagger model has been trained (this is less of a
problem -- people should retrain their taggers)
2) CG output has been piped to apertium-tagger creating new ambiguity
classes not in the dictionary (quite a few language pairs seems to do
this & I would hazard a guess that this does create new ambiguity
classes)

Currently find_similar_ambiguity_class appears to pick the largest
subset of the desired ambiguity class (this is what I thought reading
the code, but to be sure I made a tool trace-tagger-model to confirm
it which is available in this repo
https://github.com/frankier/apertiumhmm2dot ).

So for example, if the model contains {VERB}, {ADJ}, {NOUN}, {VERB,
ADJ, NOUN} & {VERB, ADJ, ADV} and the input stream contains {VERB,
ADJ} then find_similar_ambiguity_class will pick either {VERB} or
{ADJ}.

To my eyes, this doesn't seem like the best choice since an arbitrary
constraint is being added which has a chance of excluding the real POS
when perhaps the model would be able to make a good guess from a more
general class. That is to say I think find_similar_ambiguity_class
should pick the smallest superset, ie in the example either {VERB,
ADJ, NOUN} or {VERB, ADJ, ADV}. (Later, in the situation where there
are multiple candidates of the same size, it might be possible to add
a heuristic to try and pick the "best" by some metric.)

Can you recall if there is a reason it's the way it is now or would
you be happy for me to modify the function as I have outlined above
(without the heuristic for now)?

SF ticket: https://sourceforge.net/p/apertium/tickets/95/

Regards,
Frankie

Sergio Ortiz Rojas:

Hello Frankie,

Firstly, I want to point out that I only made minor changes to this piece of code whose authorship belongs more to Felipe than to me. Down to your problem, and if I am not wrong, the standard way to solve this kind of conflicts should be to include all classes in the tagger setup and retrain. Obviously the default behavior can be improved, it was intended as a fallback, but retraining (with all classes) should draw the best results in this case. Ideally a clean retraining corresponding to a fixed, unmodified morphological dictionary should be preferable.

Best

Sergio Ortiz

Felipe Sánchez Martínez:

Hi,

Quick answer from the author. If you go for a superset of the ambiguity class it may happen that a tag in that superset and not in the original one is chosen. In that case the current implementation of the tagger will crash.

Cheers

Felipe

find_similar_ambiguity_class finds the largest subclass - surely we want the...

The free and open-source rule-based machine translation platform

Searches

Help

#95 find_similar_ambiguity_class finds the largest subclass - surely we want the smallest superclass?!

Discussion

Cheers