In my software there are 4 categories: SPORT, RELIGION, POLITIC, MOTOR
For each category I have a training set of 70 files, and a test set of 30 files.
I create 4 models, one for each category.
1)
To create the model of the category X I balance the training set with 70 files of the category X and 23 files of each other category
So I have I balanced training set with 50% yes and 50% no (the other three categories).
Then I balance the test set with the same procedure..
2)
I select a SET of 1000 most significative features for the YES category, with Chi Square method.
3)
I make the file.dat of training set (to create the model) in this way:
For each row I write the features (ngrams) values of a document of the training set in this way:
FOR EACH FEATURE "ngram" OF THE SELECTED SET,
IF THE DOCUMENT CONTAINS THE FEATURE ngram1 = 1.0
ELSE ngram = 0.0
then, at the end of the row, if the document is a document of the category X, I write "YES", else "NO".
Hi, the project moved to Apache, please repost your question on the user mailing list,
see our new website for details about how to subscribe to the mailing list:
incubator.apache.org/opennlp
Thanks,
Jörn
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello, this is my problem:
In my software there are 4 categories: SPORT, RELIGION, POLITIC, MOTOR
For each category I have a training set of 70 files, and a test set of 30 files.
I create 4 models, one for each category.
1)
To create the model of the category X I balance the training set with 70 files of the category X and 23 files of each other category
So I have I balanced training set with 50% yes and 50% no (the other three categories).
Then I balance the test set with the same procedure..
2)
I select a SET of 1000 most significative features for the YES category, with Chi Square method.
3)
I make the file.dat of training set (to create the model) in this way:
For each row I write the features (ngrams) values of a document of the training set in this way:
FOR EACH FEATURE "ngram" OF THE SELECTED SET,
IF THE DOCUMENT CONTAINS THE FEATURE ngram1 = 1.0
ELSE ngram = 0.0
then, at the end of the row, if the document is a document of the category X, I write "YES", else "NO".
ex.: ngram1 = 1.0 ngram2 = 0.0 ngram3 = 1.0 ……… …… ngramN = 0.0 yes
.
.
.
ngram1 = 0.0 ngram2 = 0.0 ngram3 = 0.0 ……… …… ngramN = 1.0 no
4) In the same way I create the file.test of the test set
5)
and the file to predict is:
ngram1 = 1.0 ngram2 = 0.0 ngram3 = 1.0 ……… …… ngramN = 0.0 ?
But when I start opennlp.maxent with the model created with this type of files, the result for the document to predict is always NO
and this is the output:
Model Diverging: loglikelihood decreased
Model Diverging: loglikelihood decreased
Model Diverging: loglikelihood decreased
RELIGION EVALUATION:
Precision 0.48979592
Recall 0.48979592
F-Measure 0.48979592
RELIGION prediction:
For context:
YES NO
MOTOR EVALUATION:
Precision 0.48979592
Recall 0.48979592
F-Measure 0.48979592
MOTOR prediction:
For context:
NO YES
SPORT EVALUATION:
Precision 0.48979592
Recall 0.48979592
F-Measure 0.48979592
SPORT prediction:
For context:
NO YES
POLITIC EVALUATION:
Precision 0.48979592
Recall 0.48979592
F-Measure 0.48979592
POLITIC prediction:
For context:
NO YES
IN WAHT I'M WRONK ??????
Can anyone help me???
thanks..
Hi, the project moved to Apache, please repost your question on the user mailing list,
see our new website for details about how to subscribe to the mailing list:
incubator.apache.org/opennlp
Thanks,
Jörn