From: Mauer, D. <dm...@mi...> - 2008-07-02 13:07:54
|
So I'm trying to figure out whether JBoost is workable for a research project, and I was wondering if anyone could help me out. What I'm doing is (more-or-less) a text categorization task, where my feature space consists of unigrams and bigrams (1-2 word terms) which either do or do not appear in each document in the collection. There are around 5,000,000 different n-grams. As you'd expect, though, the truth matrix of document-to-n-gram is extremely sparse - most documents in the collection are on the order of 200 words apiece. I know boosting can work on this set; I've written my own AdaBoost implementation which, while it requires a boatload of RAM (5GB or so), runs quite nicely. The results, though, aren't great, I'm sure due to noise. So I wanted to try BrownBoost. But unless I'm reading the specs wrong, the input format for JBoost requires that each example in the data set be given an explicit value for every feature. Clearly, this is infeasible for this set. So, I was wondering if there is a way to specify the data set by explicitly giving only the TRUE-valued features for each example, with all other features being set to FALSE. Or something similar. Or perhaps it's a feature that I could work on adding to JBoost, if it wouldn't require any major retooling (I only have funding for a week or so of work on this). Thanks, -Dan ______________________________ Daniel Mauer Software Systems Engineer, Sr. E547 Software Engineering The MITRE Corporation ______________________________ |