From: Aaron A. <aa...@cs...> - 2008-07-02 16:00:55
|
Hi Dan, You should be able to use the "text" feature of JBoost and pass in the text itself. JBoost then has parsers for n-gram models. While there exists code for n > 1 grams, it is currently disabled. However, n==1 grams are enabled in the CVS repository. 'ngramtype' (fixed, full, or sparse) and 'ngramsize' (1,2,3...) are mostly implemented, but not fully finished or integrated into JBoost (for unknown reasons). I'll look into some details on quick ways to tie the code together. In the mean time, you can look at the following: jboost/src/jboost/examples/TextDescription.java jboost/src/jboost/examples/ngram/*.java jboost/src/jboost/examples/ngram/SparseNgram.java You can also look at the current hack for handling text features as 1-grams: http://jboost.cvs.sourceforge.net/jboost/jboost/src/jboost/examples/TextDescription.java?r1=1.1&r2=1.2 You may be able to extend this hack to bigrams with little further pain/suffering. While this is certainly not an ideal solution, it may be much faster. If you get stuck during code edits, feel free to post specific questions about the code base to jboost-devel. Also, if you have an original document with n words, you can create a document with n + (n-1) words that could use n==1 grams. Just put all 2 words together as one word in addition to their one word parts. I didn't say that very elegantly, but I'm sure you get the idea. This should in turn be stored fairly sparsely (let me know if it isn't). I'll get back to you later this week with a further update on text processing options. If you don't hear back by Monday, just send a reminder email. Aaron On Wed, 2 Jul 2008, Mauer, Dan wrote: > So I'm trying to figure out whether JBoost is workable for a research > project, and I was wondering if anyone could help me out. > > What I'm doing is (more-or-less) a text categorization task, where my > feature space consists of unigrams and bigrams (1-2 word terms) which > either do or do not appear in each document in the collection. There > are around 5,000,000 different n-grams. As you'd expect, though, the > truth matrix of document-to-n-gram is extremely sparse - most documents > in the collection are on the order of 200 words apiece. > > I know boosting can work on this set; I've written my own AdaBoost > implementation which, while it requires a boatload of RAM (5GB or so), > runs quite nicely. The results, though, aren't great, I'm sure due to > noise. So I wanted to try BrownBoost. But unless I'm reading the > specs wrong, the input format for JBoost requires that each example in > the data set be given an explicit value for every feature. Clearly, > this is infeasible for this set. So, I was wondering if there is a way > to specify the data set by explicitly giving only the TRUE-valued > features for each example, with all other features being set to FALSE. > Or something similar. Or perhaps it's a feature that I could work on > adding to JBoost, if it wouldn't require any major retooling (I only > have funding for a week or so of work on this). > > Thanks, > > -Dan > > > > ______________________________ > > Daniel Mauer > > Software Systems Engineer, Sr. > > E547 Software Engineering > > The MITRE Corporation > > ______________________________ > > |