Let me restate, to see if I understand the problem. Is this right?
You have some data which contains 67,000 rows by 'n' columns. There is also an
additional label associated with 13,000 of the rows. The remaining 54,000 have
an unknown label that you wish to predict. 'n' is very large, such that if you
try to use a decision tree, a neural network, or an SVM to solve this problem,
it will take a very long time to train. You wish to use Isomap to reduce 'n'
to a much smaller value, then you will train a decision tree (or some other
supervised model) to predict the label. If this is correct, then you probably
want to do something like this:
1- Remove the label column from both data sets. (I will assume 'n' is 2000)
Let me know if any of these steps are confusing. In my experience, reducing
the dimensionality of the feature columns rarely improves predictive accuracy
(except with k-nn, which is highly vulnerable to problems with irrelevant
features), but it often significantly improves training time. Also, the
computational complexity of Isomap is poorly suited for datasets with more
than about 4000 rows. (It may take days or weeks to do 67000 rows.) You might
try sub-sampling the rows so you can get some results faster. Or, you might
try "breadthfirstunfolding" instead. It is an NLDR algorithm that scales
linearly with the number of rows. (It doesn't always handle noise very well,
though.) LLE would also be a good choice, but unfortunately my implementation
doesn't utilize sparse matrices yet.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Interesting... ok, Well Im going to try this anyways. I'll let you know how it
works out.
Just a question though:
you mentioned: "In my experience, reducing the dimensionality of the feature
columns rarely improves predictive accuracy but it often significantly
improves training time.'
I have a feeling there are aspects of my dataset that are non-linear in
nature. A PCA was not able to sepearate the data into two destinct groups
although it almost worked. Is there no chance that an isomap will help me
improve the separation?
JavR
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I think I spoke too strongly. I ran a bunch of tests with some variations of
this method on about 60 datasets from the UCI repository and found little
improvement. It is always possible that your implementation will be better
than mine, and I think it is quite likely that many interesting problems are
not represented well by the UCI datasets. NLDR seems to be particularly
helpful with images. I'm pretty sure that the best score yet obtained with the
mnist hand-written digits dataset was obtained using some sort of semi-
supervised dimensionality reduction. If it turns out that a lossy
transformation, like NLDR, does have beneficial effects in some cases, and
those cases can be identified programmatically, then it seems that this should
be built into our learning algorithms.
Another issue you might encounter with Isomap is that it expects the
neighborhoods to be sufficiently large that it forms a connected graph. If
your data is separated into clusters, it might take large neighborhoods to
ensure there are no disconnected clusters.
I think you might be interested in algorithms designed specifically for
transduction (http://en.wikipedia.org/wiki/Transduction_%28machine_learning%2
9),
such as agglomerativetranducer and graphcuttransducer. These algorithms are
particularly well-suited for separating classes that lie on manifolds.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi there,
I have an interesting problem I cant figure out how to tackle with waffles:
I have a dataset containing 54000 unclassified features.
I have another training dataset containing 13000 features classified into
contaminants vs non-contaminants.
I am interested in using an isomap on the training dataset and applying the
resulting transformation to the unknown dataset.
Would you happen to know if this is a possibility?
JavR
Right, I forgot to mention. I am doing this on linux.
JavR
Let me restate, to see if I understand the problem. Is this right?
You have some data which contains 67,000 rows by 'n' columns. There is also an
additional label associated with 13,000 of the rows. The remaining 54,000 have
an unknown label that you wish to predict. 'n' is very large, such that if you
try to use a decision tree, a neural network, or an SVM to solve this problem,
it will take a very long time to train. You wish to use Isomap to reduce 'n'
to a much smaller value, then you will train a decision tree (or some other
supervised model) to predict the label. If this is correct, then you probably
want to do something like this:
1- Remove the label column from both data sets. (I will assume 'n' is 2000)
waffles_transform dropcolumn train.arff 2000 > train_no_labels.arff
waffles_transform dropcolumn test.arff 2000 > test_no_labels.arff
2- Merge the two datasets to form one dataset with 67,000 rows:
waffles_transform mergevert train_no_labels.arff test_no_labels.arff >
big.arff
3- Reduce 'n' to 12 with Isomap (using 14 neighbors):
waffles_transform isomap big.arff kdtree 14 12 > reduced.arff
4- Separate the reduced data back into a training and test set
waffles_transform split reduced.arff 13000 a.arff b.arff
5- Add the label columns to the reduced datasets:
waffles_transform mergehoriz a.arff labels_a.arff > new_train.arff
waffles_transform mergehoriz a.arff labels_b.arff > new_test.arff
6- Predict labels for the new test data
waffles_learn transduce new_train.arff newtest.arff decisiontree
Let me know if any of these steps are confusing. In my experience, reducing
the dimensionality of the feature columns rarely improves predictive accuracy
(except with k-nn, which is highly vulnerable to problems with irrelevant
features), but it often significantly improves training time. Also, the
computational complexity of Isomap is poorly suited for datasets with more
than about 4000 rows. (It may take days or weeks to do 67000 rows.) You might
try sub-sampling the rows so you can get some results faster. Or, you might
try "breadthfirstunfolding" instead. It is an NLDR algorithm that scales
linearly with the number of rows. (It doesn't always handle noise very well,
though.) LLE would also be a good choice, but unfortunately my implementation
doesn't utilize sparse matrices yet.
Interesting... ok, Well Im going to try this anyways. I'll let you know how it
works out.
Just a question though:
you mentioned: "In my experience, reducing the dimensionality of the feature
columns rarely improves predictive accuracy but it often significantly
improves training time.'
I have a feeling there are aspects of my dataset that are non-linear in
nature. A PCA was not able to sepearate the data into two destinct groups
although it almost worked. Is there no chance that an isomap will help me
improve the separation?
JavR
I think I spoke too strongly. I ran a bunch of tests with some variations of
this method on about 60 datasets from the UCI repository and found little
improvement. It is always possible that your implementation will be better
than mine, and I think it is quite likely that many interesting problems are
not represented well by the UCI datasets. NLDR seems to be particularly
helpful with images. I'm pretty sure that the best score yet obtained with the
mnist hand-written digits dataset was obtained using some sort of semi-
supervised dimensionality reduction. If it turns out that a lossy
transformation, like NLDR, does have beneficial effects in some cases, and
those cases can be identified programmatically, then it seems that this should
be built into our learning algorithms.
Another issue you might encounter with Isomap is that it expects the
neighborhoods to be sufficiently large that it forms a connected graph. If
your data is separated into clusters, it might take large neighborhoods to
ensure there are no disconnected clusters.
I think you might be interested in algorithms designed specifically for
transduction (http://en.wikipedia.org/wiki/Transduction_%28machine_learning%2
9),
such as agglomerativetranducer and graphcuttransducer. These algorithms are
particularly well-suited for separating classes that lie on manifolds.