Re: [Jboost-users] very-high-dimensional feature space?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

I'm not sure if JBoost can squeeze everything into 8GB RAM.  Try it out on 
20% of the dataset and see if it fits in 2GB.  Since Java has an overhead 
of about 200-400MB and JBoost has an overhead of less than 50MB, than 20% 
should at 2GB should be able to squeeze 100% into 8GB.  However, depending 
on your use of perl, you may have had much more overhead than 450MB.

Give a shot and let me know if it doesn't fit.

You'll also want to edit the file 'jboost', which is just a shell script, 
so that jboost is given more memory by java.

exec java -Xmx3000M jboost.controller.Controller  $@

exec java -Xmx7500M jboost.controller.Controller  $@

Aaron

On Wed, 2 Jul 2008, Mauer, Dan wrote:

> Well, I've already built an efficient AdaBoost implementation in Perl,
> and as part of that effort I created a full "n-gram presence" table,
> where each n-gram has an integer ID, and each document is represented
> as a list of the integers which appear in that original document.  So I
> can easily treat the whole thing as unigrams.  My main concern is that
> the system handles the data storage very efficiently, and ideally
> holding the entirety of the data in RAM (I'm using a 64-bit machine
> with 8GB ram, so there's room for it)... would you expect Jboost would
> handle this well?
>
> Thanks!
> -d
>
> -----Original Message-----
> From: Aaron Arvey [mailto:aa...@cs...]
> Sent: Wednesday, July 02, 2008 12:01 PM
> To: Mauer, Dan
> Cc: jbo...@li...
> Subject: Re: [Jboost-users] very-high-dimensional feature space?
>
> Hi Dan,
>
> You should be able to use the "text" feature of JBoost and pass in the
> text itself.  JBoost then has parsers for n-gram models.  While there
> exists code for n > 1 grams, it is currently disabled. However, n==1
> grams
> are enabled in the CVS repository.  'ngramtype' (fixed, full, or
> sparse)
> and 'ngramsize' (1,2,3...) are mostly implemented, but not fully
> finished
> or integrated into JBoost (for unknown reasons).
>
> I'll look into some details on quick ways to tie the code together.  In
>
> the mean time, you can look at the following:
> jboost/src/jboost/examples/TextDescription.java
> jboost/src/jboost/examples/ngram/*.java
> jboost/src/jboost/examples/ngram/SparseNgram.java
>
> You can also look at the current hack for handling text features as
> 1-grams:
> http://jboost.cvs.sourceforge.net/jboost/jboost/src/jboost/examples/Tex
> tDescription.java?r1=1.1&r2=1.2
>
> You may be able to extend this hack to bigrams with little further
> pain/suffering.  While this is certainly not an ideal solution, it may
> be
> much faster.  If you get stuck during code edits, feel free to post
> specific questions about the code base to jboost-devel.
>
> Also, if you have an original document with n words, you can create a
> document with n + (n-1) words that could use n==1 grams.  Just put all
> 2
> words together as one word in addition to their one word parts.  I
> didn't
> say that very elegantly, but I'm sure you get the idea.  This should in
>
> turn be stored fairly sparsely (let me know if it isn't).
>
> I'll get back to you later this week with a further update on text
> processing options.  If you don't hear back by Monday, just send a
> reminder email.
>
> Aaron
>
> On Wed, 2 Jul 2008, Mauer, Dan wrote:
>
>> So I'm trying to figure out whether JBoost is workable for a research
>> project, and I was wondering if anyone could help me out.
>>
>> What I'm doing is (more-or-less) a text categorization task, where my
>> feature space consists of unigrams and bigrams (1-2 word terms) which
>> either do or do not appear in each document in the collection.  There
>> are around 5,000,000 different n-grams.  As you'd expect, though, the
>> truth matrix of document-to-n-gram is extremely sparse - most
> documents
>> in the collection are on the order of 200 words apiece.
>>
>> I know boosting can work on this set; I've written my own AdaBoost
>> implementation which, while it requires a boatload of RAM (5GB or
> so),
>> runs quite nicely.  The results, though, aren't great, I'm sure due
> to
>> noise.  So I wanted to try BrownBoost.  But unless I'm reading the
>> specs wrong, the input format for JBoost requires that each example
> in
>> the data set be given an explicit value for every feature.  Clearly,
>> this is infeasible for this set.  So, I was wondering if there is a
> way
>> to specify the data set by explicitly giving only the TRUE-valued
>> features for each example, with all other features being set to
> FALSE.
>> Or something similar.  Or perhaps it's a feature that I could work on
>> adding to JBoost, if it wouldn't require any major retooling (I only
>> have funding for a week or so of work on this).
>>
>> Thanks,
>>
>> -Dan
>>
>>
>>
>> ______________________________
>>
>> Daniel Mauer
>>
>> Software Systems Engineer, Sr.
>>
>> E547 Software Engineering
>>
>> The MITRE Corporation
>>
>> ______________________________
>>
>>
>