Menu

More mixup documentation

Help
mass1tot
2011-01-15
2013-04-26
  • mass1tot

    mass1tot - 2011-01-15

    Hello,

    I am in the process of learning mixup and looking for all of the resources I possibly can find.  In addition to the resources included with the minorthird dist, I have downloaded the lectures and sample code from William's website.  So essentially I am learning mixup mostly by staring at a bunch of example code (which is fine, mostly).  Still, large questions remain in my mind such as:

    What is the difference between 'defTokenProp' and 'defSpanProp'?  Why use one over the other?  Is this a good (and responsive) forum for asking these types of questions?

    Thanks

    Jay

     
  • Frank Lin

    Frank Lin - 2011-01-15

    This is a good forum (as responsive as it's gonna get I think) for asking questions about m3rd or mixup. The difference between the two is related to the difference between token and span in m3rd. You can think of tokens as a sequences of characters as determined by the tokenizer - this is typically words. For example, the sentence "I ate Suzie's pizza." Can be tokenized into 6 tokens:

    "I" "ate" "Suzie" "''s" "pizza" "."

    A span is defined as a sequence of tokens. It can be as long as an entire document (a document span) or as short as a token. for example, in the sentence above, "Suzie's pizza" can be a noun phrase span and the whole sentence could be a span too.

     
  • mass1tot

    mass1tot - 2011-01-16

    Thanks for your quick response.  I understand the difference between tokens and spans (I have some background in text-mining and have been successfully using minorthird for a few months for probabilistic text extraction tasks).  I am actually trying to use mixup for post-processing extracted spans from the CRFAnnotator.  I could write some custom Java code to accomplish my task, but it appears mixup should be able to do what I need and I suspect I would find quite a bit of other utility from mixup as well.

    Here are a few different morphologies I am trying to capture in my current problem:

    "Three hundred and ten patients"
    "66 premature infants"
    "839 healthy adult volunteers"
    "Three hundred twenty-nine patients"

    Ideally, my mixup output would be phase such as these.  Note that I've got a dictionary of "patients", "volunteers", "infants", etc. in my mixup.  I also have a dictionary of number words in my mixup (e.g. one, five twelve, hundred, thousand, etc.).  I am trying to figure out how to capture "Three hundred twenty-nine", or, "342" (I realize this spcific part is done with a reg ex, i.e. re('^\\d+$')) in relation to the (patients|infants|Volunteers|etc.) part.  Also note that I am extracting these snippets from blobs of text that tend to be around ~20 tokens or so.

    So back to my original question, I am trying to understand how I should be using defSpanProp and defTokenProp to solve my problem.  Trial-and-error experimentation hasn't yield a path to solving the problem yet.

    Thanks for your help!

    Jay

     
  • Frank Lin

    Frank Lin - 2011-01-17

    Hi Jay,

    There are many ways to specify mixup patterns to extract certain kind of noun phrases; below is a basic example on which you can build your mixup extractor:

    // ==========================================================
    // A toy example of extracting "count phrases" using Mixup;
    // for example, "twenty-two tutus" or "365 days".
    //
    // Frank Lin
    // ==========================================================
    // define a dictionary of "number words" - not exhaustive, for demonstration
    defDict numword = one, two, three, four, five, six, seven, eight, nine, ten, twenty, thirty, forty, fifty, hundred, thousand;
    // a dictionary of "auxiliary number words"
    defDict numaux = - , and;
    // specify a token property which words can be "inside" a number phrase
    defTokenProp num:in =: ... [a(numword)] ... || ... [a(numaux)] ...;
    // define number phrase as a series of tokens that either a) is a single number word
    // or b) start and end with number word sandwiching zero or more tokens that can be
    // inside a number phrase
    defSpanType number =: ...[a(numword)] ... || ... [a(numword) num:in* a(numword)] ...;
    // pattern for arabic numerals
    defSpanType numeral =: ... [re('[0-9]+')] ...;
    // pattern for plural noun words
    defSpanType plural =: ... [re('.+s$')] ...;
    // finally, pattern for count phrases
    defSpanType countp =: ... [@number @plural] ... || ... [@numeral @plural] ...;
    

    This can be used to extract counting phrases like these below:

    I have thirty-five dogs, seven chickens, and three thousand three hundred and fifty wives.
    I bet 5000 chips on these 3 horses.
    
     
  • mass1tot

    mass1tot - 2011-01-17

    Thanks a lot, Frank!  This is very helpful.  I've modified the above code to suit my needs a bit better, but it is working well.  One remaining question (and I think this is linked to the defSpanType tag) is that the code will fire multiple times on the same span.  I noticed this behavior when I was messing around with Mixup before I solicited your help.  For example, the phrase "One hundred and forty-one children" will *fire* multiple times and output the following pattern (I inserted the '-'):

    One hundred and forty-one children - hundred and forty-one children - forty-one children - one children

    Is there an easy way to have it only fire once, on the full phase?

    cheers,

    Jay

     
  • Frank Lin

    Frank Lin - 2011-01-18

    You can modify that countp span type so that the word before the extracted span is NOT a num:in token.

     

Log in to post a comment.