Menu

#95 find_similar_ambiguity_class finds the largest subclass - surely we want the smallest superclass?!

open
nobody
None
2016-04-03
2016-03-27
No

You can play with find_similar_ambiguity_class using trace-tagger-model from https://github.com/frankier/apertiumhmm2dot . To me it makes no sense at all to pick a less general ambiguity class. If you pick a more general class, the class must contain the actual part of speech, if you pick a subclass, it might not.

I can go ahead and fix this unless someone has a reason why it's this way.

Discussion

  • Kevin Brubeck Unhammer

    The authors might not be subscribed to this ticket tracker – could you try emailing them if you haven't already? (I don't know if this is fsanchez' code or some later addition.)

     
  • Frankie Robertson

    Find attached email exchange:

    Dear Sergio Ortiz-Rojas,

    I understand you are the original author of find_similar_ambiguity_class.

    Just to jog your memory, this function is used in the HMM tagger & the
    LSWPoST (where it has been copy pasted) during tagging to find a
    substitute ambiguity class for an ambiguity class in the input stream
    which is not in the tagger model. The current reasons this might
    happen are (in my understanding):
    1) Lexemes with new ambiguity classes have been added to the
    dictionary after the tagger model has been trained (this is less of a
    problem -- people should retrain their taggers)
    2) CG output has been piped to apertium-tagger creating new ambiguity
    classes not in the dictionary (quite a few language pairs seems to do
    this & I would hazard a guess that this does create new ambiguity
    classes)

    Currently find_similar_ambiguity_class appears to pick the largest
    subset of the desired ambiguity class (this is what I thought reading
    the code, but to be sure I made a tool trace-tagger-model to confirm
    it which is available in this repo
    https://github.com/frankier/apertiumhmm2dot ).

    So for example, if the model contains {VERB}, {ADJ}, {NOUN}, {VERB,
    ADJ, NOUN} & {VERB, ADJ, ADV} and the input stream contains {VERB,
    ADJ} then find_similar_ambiguity_class will pick either {VERB} or
    {ADJ}.

    To my eyes, this doesn't seem like the best choice since an arbitrary
    constraint is being added which has a chance of excluding the real POS
    when perhaps the model would be able to make a good guess from a more
    general class. That is to say I think find_similar_ambiguity_class
    should pick the smallest superset, ie in the example either {VERB,
    ADJ, NOUN} or {VERB, ADJ, ADV}. (Later, in the situation where there
    are multiple candidates of the same size, it might be possible to add
    a heuristic to try and pick the "best" by some metric.)

    Can you recall if there is a reason it's the way it is now or would
    you be happy for me to modify the function as I have outlined above
    (without the heuristic for now)?

    SF ticket: https://sourceforge.net/p/apertium/tickets/95/

    Regards,
    Frankie

    Sergio Ortiz Rojas:

    Hello Frankie,

    Firstly, I want to point out that I only made minor changes to this piece of code whose authorship belongs more to Felipe than to me. Down to your problem, and if I am not wrong, the standard way to solve this kind of conflicts should be to include all classes in the tagger setup and retrain. Obviously the default behavior can be improved, it was intended as a fallback, but retraining (with all classes) should draw the best results in this case. Ideally a clean retraining corresponding to a fixed, unmodified morphological dictionary should be preferable.

    Best

    Sergio Ortiz

    Felipe Sánchez Martínez:

    Hi,

    Quick answer from the author. If you go for a superset of the ambiguity class it may happen that a tag in that superset and not in the original one is chosen. In that case the current implementation of the tagger will crash.

    Cheers

    Felipe

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.