From: Marcin M. <lis...@wp...> - 2010-11-07 14:19:26
|
Hi Dominique, > Hi Marcin > > I still have a problems with this way of skipping tokens (adverbs > for example). > > 1/ One problem is that it not only skips adverbs, but it also > skips nouns and adjectives in above example because the > regexp has to list not only the postag "A" to skip, but also all > the possible postags of the next token. So skipping can be > too greedy. What if I want to skip only adverbs? Well, it seems that it isn't possible with skipping as it is right now. We would have to rewrite the pattern matching a bit, for example to include your idea of repetition (ranging maximally from 0 to the end of sentence minus the number of tokens in the pattern). This could be specified to occur at the end of the pattern, so that no closing token of the skipping would be included. Feel free to propose a patch to the XML Schema and Java files: the class you need to look is AbstractPatternRule -- especially testAllReadings() and PatternRule -- especially match(). Frankly, I'm not sure how to implement it - probably you could simply change only some bits of the code. Now it simply calculates a possible range of skipped tokens and checks for all of them if they match an exception. You could use the same range but check for a positive condition, yet I'm not sure if that would work easily. The code is pretty short but the concepts involved are tricky... I guess it should go around line 172 of AbstractPatternRule - adding an OR condition after prevElement.isMatchedByScopeNextException -- something like: || !prevElement.matchesAPositiveCondition() while matchesAPositiveCondition would be computed in as AND rather than as OR (all positive conditions have to be met). I'm not sure if that would make any difference, I have no time to play with this. > 2/ It does not work well if a token can have multiple tags. > We saw for example that if postag to skip can have both "A" > (adverb) and "SENT_END", the regex has to list all those > cases ("A|SENT_END|...") But it's impossible to know what > are all the other possible tags besides "A" and SENT_END. > Some adverb words can also be verbs, or nouns, etc. > The disambiguator cannot always give a single tag to each > token. Such adverbs would not be skipped. Same as above... > 3/ It does not seem to allow me to skip with a token which > is either an adverb (postag) or a specific token value. For > example, I don't see how to skip a token which is either an > adverb (postag "A") or specific word "foo". There is a quick hack possible: just assign a new special POS tag to the word "foo" in the disambiguator, and you will be able to use the tag in the rules. Note that using disambiguator this way you may simplify your rules a lot, and this allows to have a certain "cascade" of rules. Regards Marcin |