[Jboost-users] high dimensional high sparse tf-idf feature space

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi,

I've been using Jboost through my 2010 summer internship, I really loved it and I actually made some contribution to the source code of jboost (some cost-senstive learning algorithm)

Now I started my new job, I wanna continue the success of jboost on my new problem. The new problem is to do classification on a database with extremely highly dimensional (around 16000), highly sparse (every instance generally have no more than 10 features available) feature space. The volume of the database is also huge, millions of.... 

The data set is generated by counting tf-idf feature of some text corpus. If I treat each feature as a separate feature for jboost input, the data file would be prohibitive to be generated and loaded by jboost (countless of commas in every instance). jboost cannot even finish load all data into memory... I tried -Xmx4G, still failed.

I am wondering if there is a way to do some thing smart like svm-ligh or libsvm data format, i.e., you ONLY need to specify available feature, those missing feature can be ignored in the data file. This way size of the data set would be significantly shrinked, and hopefully jboost can process more efficiently accordingly.

Sheng