From: Aaron A. <aa...@cs...> - 2008-07-02 17:02:26
|
I'm not sure if JBoost can squeeze everything into 8GB RAM. Try it out on 20% of the dataset and see if it fits in 2GB. Since Java has an overhead of about 200-400MB and JBoost has an overhead of less than 50MB, than 20% should at 2GB should be able to squeeze 100% into 8GB. However, depending on your use of perl, you may have had much more overhead than 450MB. Give a shot and let me know if it doesn't fit. You'll also want to edit the file 'jboost', which is just a shell script, so that jboost is given more memory by java. exec java -Xmx3000M jboost.controller.Controller $@ exec java -Xmx7500M jboost.controller.Controller $@ Aaron On Wed, 2 Jul 2008, Mauer, Dan wrote: > Well, I've already built an efficient AdaBoost implementation in Perl, > and as part of that effort I created a full "n-gram presence" table, > where each n-gram has an integer ID, and each document is represented > as a list of the integers which appear in that original document. So I > can easily treat the whole thing as unigrams. My main concern is that > the system handles the data storage very efficiently, and ideally > holding the entirety of the data in RAM (I'm using a 64-bit machine > with 8GB ram, so there's room for it)... would you expect Jboost would > handle this well? > > Thanks! > -d > > -----Original Message----- > From: Aaron Arvey [mailto:aa...@cs...] > Sent: Wednesday, July 02, 2008 12:01 PM > To: Mauer, Dan > Cc: jbo...@li... > Subject: Re: [Jboost-users] very-high-dimensional feature space? > > Hi Dan, > > You should be able to use the "text" feature of JBoost and pass in the > text itself. JBoost then has parsers for n-gram models. While there > exists code for n > 1 grams, it is currently disabled. However, n==1 > grams > are enabled in the CVS repository. 'ngramtype' (fixed, full, or > sparse) > and 'ngramsize' (1,2,3...) are mostly implemented, but not fully > finished > or integrated into JBoost (for unknown reasons). > > I'll look into some details on quick ways to tie the code together. In > > the mean time, you can look at the following: > jboost/src/jboost/examples/TextDescription.java > jboost/src/jboost/examples/ngram/*.java > jboost/src/jboost/examples/ngram/SparseNgram.java > > You can also look at the current hack for handling text features as > 1-grams: > http://jboost.cvs.sourceforge.net/jboost/jboost/src/jboost/examples/Tex > tDescription.java?r1=1.1&r2=1.2 > > You may be able to extend this hack to bigrams with little further > pain/suffering. While this is certainly not an ideal solution, it may > be > much faster. If you get stuck during code edits, feel free to post > specific questions about the code base to jboost-devel. > > Also, if you have an original document with n words, you can create a > document with n + (n-1) words that could use n==1 grams. Just put all > 2 > words together as one word in addition to their one word parts. I > didn't > say that very elegantly, but I'm sure you get the idea. This should in > > turn be stored fairly sparsely (let me know if it isn't). > > I'll get back to you later this week with a further update on text > processing options. If you don't hear back by Monday, just send a > reminder email. > > Aaron > > On Wed, 2 Jul 2008, Mauer, Dan wrote: > >> So I'm trying to figure out whether JBoost is workable for a research >> project, and I was wondering if anyone could help me out. >> >> What I'm doing is (more-or-less) a text categorization task, where my >> feature space consists of unigrams and bigrams (1-2 word terms) which >> either do or do not appear in each document in the collection. There >> are around 5,000,000 different n-grams. As you'd expect, though, the >> truth matrix of document-to-n-gram is extremely sparse - most > documents >> in the collection are on the order of 200 words apiece. >> >> I know boosting can work on this set; I've written my own AdaBoost >> implementation which, while it requires a boatload of RAM (5GB or > so), >> runs quite nicely. The results, though, aren't great, I'm sure due > to >> noise. So I wanted to try BrownBoost. But unless I'm reading the >> specs wrong, the input format for JBoost requires that each example > in >> the data set be given an explicit value for every feature. Clearly, >> this is infeasible for this set. So, I was wondering if there is a > way >> to specify the data set by explicitly giving only the TRUE-valued >> features for each example, with all other features being set to > FALSE. >> Or something similar. Or perhaps it's a feature that I could work on >> adding to JBoost, if it wouldn't require any major retooling (I only >> have funding for a week or so of work on this). >> >> Thanks, >> >> -Dan >> >> >> >> ______________________________ >> >> Daniel Mauer >> >> Software Systems Engineer, Sr. >> >> E547 Software Engineering >> >> The MITRE Corporation >> >> ______________________________ >> >> > |