jboost-users Mailing List for JBoost

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi,

I've been using Jboost through my 2010 summer internship, I really loved it and I actually made some contribution to the source code of jboost (some cost-senstive learning algorithm)

Now I started my new job, I wanna continue the success of jboost on my new problem. The new problem is to do classification on a database with extremely highly dimensional (around 16000), highly sparse (every instance generally have no more than 10 features available) feature space. The volume of the database is also huge, millions of.... 

The data set is generated by counting tf-idf feature of some text corpus. If I treat each feature as a separate feature for jboost input, the data file would be prohibitive to be generated and loaded by jboost (countless of commas in every instance). jboost cannot even finish load all data into memory... I tried -Xmx4G, still failed.

I am wondering if there is a way to do some thing smart like svm-ligh or libsvm data format, i.e., you ONLY need to specify available feature, those missing feature can be ignored in the data file. This way size of the data set would be significantly shrinked, and hopefully jboost can process more efficiently accordingly.

Sheng

2007	Jan	Feb	Mar	Apr	May (3)	Jun	Jul	Aug	Sep	Oct (7)	Nov	Dec (2)
2008	Jan (8)	Feb (4)	Mar	Apr	May	Jun	Jul (7)	Aug (15)	Sep (5)	Oct	Nov (3)	Dec
2009	Jan (5)	Feb	Mar	Apr (2)	May	Jun (3)	Jul	Aug	Sep	Oct	Nov	Dec
2010	Jan	Feb	Mar	Apr (5)	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec (3)
2011	Jan	Feb	Mar	Apr (1)	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec

jboost-users Mailing List for JBoost

jboost-users — Users can ask questions and get answers from other users and developers