I'm trying to use Maxent package for text classification, but I get confused
when I wrote the training data. I use unigram features for the experiment using their
presence as value. The value is 1 if the feature exists in document and 0
if the feature doesn't exist.
I try 2 different approach or feature representation:
First, if the feature exists in document I wrote 1_featurelabel to incorporate its existence in data and if it doesn't I wrote 0_featurelabel to give information about its non-existence.
example --> 1_a 1_b 0_c 0_d 1_e topic1
Second, if the feature exists in document I write 1_featurelabel and if it doesn't, I didn't write anything.
example --> 1_a 1_b 1_e topic1
Which one of the representations that is correct?
Thanks
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
The first approach is the most typical as the lack of presence is implicitly modeled as getting zero weight. The model will automatically assign the feature a value of 1 so you don't need to encode that in your features (not that it will hurt anything the way it is).
Hope this helps...Tom
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
From your answer, I don't need to encode the lack of presence in my data. So, the second
approach is the one I have to choose. Is it right? Just to clarify.
Thanks for your help.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I'm trying to use Maxent package for text classification, but I get confused
when I wrote the training data. I use unigram features for the experiment using their
presence as value. The value is 1 if the feature exists in document and 0
if the feature doesn't exist.
I try 2 different approach or feature representation:
First, if the feature exists in document I wrote 1_featurelabel to incorporate its existence in data and if it doesn't I wrote 0_featurelabel to give information about its non-existence.
example --> 1_a 1_b 0_c 0_d 1_e topic1
Second, if the feature exists in document I write 1_featurelabel and if it doesn't, I didn't write anything.
example --> 1_a 1_b 1_e topic1
Which one of the representations that is correct?
Thanks
Hi,
The first approach is the most typical as the lack of presence is implicitly modeled as getting zero weight. The model will automatically assign the feature a value of 1 so you don't need to encode that in your features (not that it will hurt anything the way it is).
Hope this helps...Tom
Hi,
From your answer, I don't need to encode the lack of presence in my data. So, the second
approach is the one I have to choose. Is it right? Just to clarify.
Thanks for your help.
Correct. I think I referred to them incorrectly in my last post...Tom
Hi,
Ok. I got it. It's really helpful. Thanks.