First of I want to thank the creators of wikipediaminer, I did some investigating and realized that indexing and processing a wikipedia dump by my own software of some of the other software packages is a daunting task indeed! The plug and play feel of the java API is amazing!
The problem is that the javadoc is not very easy to read, In my research I would like to access terms statistics. I.e. a tf vector for wikipedia pages and also for wikipedia as a whole. Where should I start looking for this, should I try and read more about the Berkeley DB and try and go that route?
I am not very interested in the high level functions of wikipedia-miner because I am doing research and need control over what algorithms are used.
Best regards,
Erik
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi!
First of I want to thank the creators of wikipediaminer, I did some investigating and realized that indexing and processing a wikipedia dump by my own software of some of the other software packages is a daunting task indeed! The plug and play feel of the java API is amazing!
The problem is that the javadoc is not very easy to read, In my research I would like to access terms statistics. I.e. a tf vector for wikipedia pages and also for wikipedia as a whole. Where should I start looking for this, should I try and read more about the Berkeley DB and try and go that route?
I am not very interested in the high level functions of wikipedia-miner because I am doing research and need control over what algorithms are used.
Best regards,
Erik