MinorThird / Discussion / Help: More mixup documentation

mass1tot - 2011-01-15

Hello,

I am in the process of learning mixup and looking for all of the resources I possibly can find. In addition to the resources included with the minorthird dist, I have downloaded the lectures and sample code from William's website. So essentially I am learning mixup mostly by staring at a bunch of example code (which is fine, mostly). Still, large questions remain in my mind such as:

What is the difference between 'defTokenProp' and 'defSpanProp'? Why use one over the other? Is this a good (and responsive) forum for asking these types of questions?

Thanks

Jay

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Frank Lin - 2011-01-15

This is a good forum (as responsive as it's gonna get I think) for asking questions about m3rd or mixup. The difference between the two is related to the difference between token and span in m3rd. You can think of tokens as a sequences of characters as determined by the tokenizer - this is typically words. For example, the sentence "I ate Suzie's pizza." Can be tokenized into 6 tokens:

"I" "ate" "Suzie" "''s" "pizza" "."

A span is defined as a sequence of tokens. It can be as long as an entire document (a document span) or as short as a token. for example, in the sentence above, "Suzie's pizza" can be a noun phrase span and the whole sentence could be a span too.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

mass1tot - 2011-01-16

Thanks for your quick response. I understand the difference between tokens and spans (I have some background in text-mining and have been successfully using minorthird for a few months for probabilistic text extraction tasks). I am actually trying to use mixup for post-processing extracted spans from the CRFAnnotator. I could write some custom Java code to accomplish my task, but it appears mixup should be able to do what I need and I suspect I would find quite a bit of other utility from mixup as well.

Here are a few different morphologies I am trying to capture in my current problem:

"Three hundred and ten patients"
"66 premature infants"
"839 healthy adult volunteers"
"Three hundred twenty-nine patients"

Ideally, my mixup output would be phase such as these. Note that I've got a dictionary of "patients", "volunteers", "infants", etc. in my mixup. I also have a dictionary of number words in my mixup (e.g. one, five twelve, hundred, thousand, etc.). I am trying to figure out how to capture "Three hundred twenty-nine", or, "342" (I realize this spcific part is done with a reg ex, i.e. re('^\\d+$')) in relation to the (patients|infants|Volunteers|etc.) part. Also note that I am extracting these snippets from blobs of text that tend to be around ~20 tokens or so.

So back to my original question, I am trying to understand how I should be using defSpanProp and defTokenProp to solve my problem. Trial-and-error experimentation hasn't yield a path to solving the problem yet.

Thanks for your help!

Jay

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Hi Jay,

There are many ways to specify mixup patterns to extract certain kind of noun phrases; below is a basic example on which you can build your mixup extractor:

// ==========================================================
// A toy example of extracting "count phrases" using Mixup;
// for example, "twenty-two tutus" or "365 days".
//
// Frank Lin
// ==========================================================
// define a dictionary of "number words" - not exhaustive, for demonstration
defDict numword = one, two, three, four, five, six, seven, eight, nine, ten, twenty, thirty, forty, fifty, hundred, thousand;
// a dictionary of "auxiliary number words"
defDict numaux = - , and;
// specify a token property which words can be "inside" a number phrase
defTokenProp num:in =: ... [a(numword)] ... || ... [a(numaux)] ...;
// define number phrase as a series of tokens that either a) is a single number word
// or b) start and end with number word sandwiching zero or more tokens that can be
// inside a number phrase
defSpanType number =: ...[a(numword)] ... || ... [a(numword) num:in* a(numword)] ...;
// pattern for arabic numerals
defSpanType numeral =: ... [re('[0-9]+')] ...;
// pattern for plural noun words
defSpanType plural =: ... [re('.+s$')] ...;
// finally, pattern for count phrases
defSpanType countp =: ... [@number @plural] ... || ... [@numeral @plural] ...;

This can be used to extract counting phrases like these below:

I have thirty-five dogs, seven chickens, and three thousand three hundred and fifty wives.
I bet 5000 chips on these 3 horses.

mass1tot - 2011-01-17

Thanks a lot, Frank! This is very helpful. I've modified the above code to suit my needs a bit better, but it is working well. One remaining question (and I think this is linked to the defSpanType tag) is that the code will fire multiple times on the same span. I noticed this behavior when I was messing around with Mixup before I solicited your help. For example, the phrase "One hundred and forty-one children" will *fire* multiple times and output the following pattern (I inserted the '-'):

One hundred and forty-one children - hundred and forty-one children - forty-one children - one children

Is there an easy way to have it only fire once, on the full phase?

cheers,

Jay

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Frank Lin - 2011-01-18

You can modify that countp span type so that the word before the extracted span is NOT a num:in token.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

More mixup documentation

Forums

Help

More mixup documentation document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

More mixup documentation