[Text-similarity-developers] Re: tagging and compoundifying

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Jason,

This is really interesting. I hadn't thought of this before, but
I see exactly what you are referring to.

I'd suggest the following - my experience has been that compounds
are very often nouns (not always, but more often than they are
verbs). So if we can't do any better, I'd suggest assuming that
a compound is a noun.

Actually, now that I think of it - I wonder if there are many compounds
that are both nouns and verbs (at least those known to WordNet)? I would
doubt it. Would that be a useful fact?

I'll think about this some more...

Thanks!
Ted

On Fri, 12 Nov 2004, Jason Michelizzi wrote:

> I've come across a slight difficulty in working with compoundifying
> and converting POS tags from the Penn Treebank format to WN format.
> If we do compoundification on tagged words, it seems that we have to
> discard the POS tags.  The problem is that there are compound words
> that belong to more than one part of speech, such as machine_gun and
> goose_step (both of them can be either nouns or verbs).
>
> So if we came across text such as "goose/NN step/NN" or "machine/NN
> gun/NN", we could only turn that into "goose_step" and "machine_gun",
> but not "goose_step#n" or "machine_gun#v".  (The fact that step is
> tagged as a noun isn't much of a help, the Brill tagger always seems
> to tag it as a noun in the few experiments I tried, except when I had
> "stepped or stepping" instead).
>
> Jason
>

--
Ted Pedersen
http://www.d.umn.edu/~tpederse