Re: [Jboost-users] very-high-dimensional feature space?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Dan,

You should be able to use the "text" feature of JBoost and pass in the 
text itself.  JBoost then has parsers for n-gram models.  While there 
exists code for n > 1 grams, it is currently disabled. However, n==1 grams 
are enabled in the CVS repository.  'ngramtype' (fixed, full, or sparse) 
and 'ngramsize' (1,2,3...) are mostly implemented, but not fully finished 
or integrated into JBoost (for unknown reasons).

I'll look into some details on quick ways to tie the code together.  In 
the mean time, you can look at the following:
jboost/src/jboost/examples/TextDescription.java
jboost/src/jboost/examples/ngram/*.java
jboost/src/jboost/examples/ngram/SparseNgram.java

You can also look at the current hack for handling text features as 
1-grams:
http://jboost.cvs.sourceforge.net/jboost/jboost/src/jboost/examples/TextDescription.java?r1=1.1&r2=1.2

You may be able to extend this hack to bigrams with little further 
pain/suffering.  While this is certainly not an ideal solution, it may be 
much faster.  If you get stuck during code edits, feel free to post 
specific questions about the code base to jboost-devel.

Also, if you have an original document with n words, you can create a 
document with n + (n-1) words that could use n==1 grams.  Just put all 2 
words together as one word in addition to their one word parts.  I didn't 
say that very elegantly, but I'm sure you get the idea.  This should in 
turn be stored fairly sparsely (let me know if it isn't).

I'll get back to you later this week with a further update on text 
processing options.  If you don't hear back by Monday, just send a 
reminder email.

Aaron

On Wed, 2 Jul 2008, Mauer, Dan wrote:

> So I'm trying to figure out whether JBoost is workable for a research
> project, and I was wondering if anyone could help me out.
>
> What I'm doing is (more-or-less) a text categorization task, where my
> feature space consists of unigrams and bigrams (1-2 word terms) which
> either do or do not appear in each document in the collection.  There
> are around 5,000,000 different n-grams.  As you'd expect, though, the
> truth matrix of document-to-n-gram is extremely sparse - most documents
> in the collection are on the order of 200 words apiece.
>
> I know boosting can work on this set; I've written my own AdaBoost
> implementation which, while it requires a boatload of RAM (5GB or so),
> runs quite nicely.  The results, though, aren't great, I'm sure due to
> noise.  So I wanted to try BrownBoost.  But unless I'm reading the
> specs wrong, the input format for JBoost requires that each example in
> the data set be given an explicit value for every feature.  Clearly,
> this is infeasible for this set.  So, I was wondering if there is a way
> to specify the data set by explicitly giving only the TRUE-valued
> features for each example, with all other features being set to FALSE.
> Or something similar.  Or perhaps it's a feature that I could work on
> adding to JBoost, if it wouldn't require any major retooling (I only
> have funding for a week or so of work on this).
>
> Thanks,
>
> -Dan
>
>
>
> ______________________________
>
> Daniel Mauer
>
> Software Systems Engineer, Sr.
>
> E547 Software Engineering
>
> The MITRE Corporation
>
> ______________________________
>
>