[Jboost-users] very-high-dimensional feature space?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

So I'm trying to figure out whether JBoost is workable for a research
project, and I was wondering if anyone could help me out.

What I'm doing is (more-or-less) a text categorization task, where my
feature space consists of unigrams and bigrams (1-2 word terms) which
either do or do not appear in each document in the collection.  There
are around 5,000,000 different n-grams.  As you'd expect, though, the
truth matrix of document-to-n-gram is extremely sparse - most documents
in the collection are on the order of 200 words apiece.

I know boosting can work on this set; I've written my own AdaBoost
implementation which, while it requires a boatload of RAM (5GB or so),
runs quite nicely.  The results, though, aren't great, I'm sure due to
noise.  So I wanted to try BrownBoost.  But unless I'm reading the
specs wrong, the input format for JBoost requires that each example in
the data set be given an explicit value for every feature.  Clearly,
this is infeasible for this set.  So, I was wondering if there is a way
to specify the data set by explicitly giving only the TRUE-valued
features for each example, with all other features being set to FALSE.
Or something similar.  Or perhaps it's a feature that I could work on
adding to JBoost, if it wouldn't require any major retooling (I only
have funding for a week or so of work on this).

Thanks,

-Dan

______________________________

Daniel Mauer

Software Systems Engineer, Sr.

E547 Software Engineering

The MITRE Corporation

______________________________