From: Roger P. Menezes <rogerm@ya...> - 2006-09-18 07:15:23
We are working on a page segmentation task. To be more specific, it
deals with resume segmentation. The idea is to segment a given resume
into different sections like PERSONAL_DETAILS, EXPERIENCE, EDUCATION,
SKILLS, etc. We are using a naive model with each label having a single
state in the CRF. Labels correspond to our sections (experience,
education, skills etc.). Each line in the document is an observation.
We trained the CRF for the above task and tried analysing the
WordFeatures. Everything works fine but certain highly indicative
features (words in this case) had high negative weights like -5., -3.
etc. Words which are present 90-95% in the specific LABEL have negative
weights for that label. Don't know why this happens. In fact, it helps
to neutralize (reduce them to insignificant wts) these weights and then
carry out our test results.
I'm unable to identify if there's a pattern on what kind of words are
shown with such negative weights. Not all indicative (frequent) words
suffer this. Has anybody seen this kind of thing happening? Would
provide more details as and when I start identifying things.