Help save net neutrality! Learn more.

Apply isomap transformation other datasets

  • JavR

    JavR - 2010-09-13

    Hi there,

    I have an interesting problem I cant figure out how to tackle with waffles:

    I have a dataset containing 54000 unclassified features.

    I have another training dataset containing 13000 features classified into
    contaminants vs non-contaminants.

    I am interested in using an isomap on the training dataset and applying the
    resulting transformation to the unknown dataset.

    Would you happen to know if this is a possibility?


  • JavR

    JavR - 2010-09-13

    Right, I forgot to mention. I am doing this on linux.


  • Nobody/Anonymous

    Let me restate, to see if I understand the problem. Is this right?

    You have some data which contains 67,000 rows by 'n' columns. There is also an
    additional label associated with 13,000 of the rows. The remaining 54,000 have
    an unknown label that you wish to predict. 'n' is very large, such that if you
    try to use a decision tree, a neural network, or an SVM to solve this problem,
    it will take a very long time to train. You wish to use Isomap to reduce 'n'
    to a much smaller value, then you will train a decision tree (or some other
    supervised model) to predict the label. If this is correct, then you probably
    want to do something like this:

    1- Remove the label column from both data sets. (I will assume 'n' is 2000)

    waffles_transform dropcolumn train.arff 2000 > train_no_labels.arff

    waffles_transform dropcolumn test.arff 2000 > test_no_labels.arff

    2- Merge the two datasets to form one dataset with 67,000 rows:

    waffles_transform mergevert train_no_labels.arff test_no_labels.arff >

    3- Reduce 'n' to 12 with Isomap (using 14 neighbors):

    waffles_transform isomap big.arff kdtree 14 12 > reduced.arff

    4- Separate the reduced data back into a training and test set

    waffles_transform split reduced.arff 13000 a.arff b.arff

    5- Add the label columns to the reduced datasets:

    waffles_transform mergehoriz a.arff labels_a.arff > new_train.arff

    waffles_transform mergehoriz a.arff labels_b.arff > new_test.arff

    6- Predict labels for the new test data

    waffles_learn transduce new_train.arff newtest.arff decisiontree

    Let me know if any of these steps are confusing. In my experience, reducing
    the dimensionality of the feature columns rarely improves predictive accuracy
    (except with k-nn, which is highly vulnerable to problems with irrelevant
    features), but it often significantly improves training time. Also, the
    computational complexity of Isomap is poorly suited for datasets with more
    than about 4000 rows. (It may take days or weeks to do 67000 rows.) You might
    try sub-sampling the rows so you can get some results faster. Or, you might
    try "breadthfirstunfolding" instead. It is an NLDR algorithm that scales
    linearly with the number of rows. (It doesn't always handle noise very well,
    though.) LLE would also be a good choice, but unfortunately my implementation
    doesn't utilize sparse matrices yet.

  • JavR

    JavR - 2010-09-14

    Interesting... ok, Well Im going to try this anyways. I'll let you know how it
    works out.

    Just a question though:

    you mentioned: "In my experience, reducing the dimensionality of the feature
    columns rarely improves predictive accuracy but it often significantly
    improves training time.'

    I have a feeling there are aspects of my dataset that are non-linear in
    nature. A PCA was not able to sepearate the data into two destinct groups
    although it almost worked. Is there no chance that an isomap will help me
    improve the separation?


  • Nobody/Anonymous

    I think I spoke too strongly. I ran a bunch of tests with some variations of
    this method on about 60 datasets from the UCI repository and found little
    improvement. It is always possible that your implementation will be better
    than mine, and I think it is quite likely that many interesting problems are
    not represented well by the UCI datasets. NLDR seems to be particularly
    helpful with images. I'm pretty sure that the best score yet obtained with the
    mnist hand-written digits dataset was obtained using some sort of semi-
    supervised dimensionality reduction. If it turns out that a lossy
    transformation, like NLDR, does have beneficial effects in some cases, and
    those cases can be identified programmatically, then it seems that this should
    be built into our learning algorithms.

    Another issue you might encounter with Isomap is that it expects the
    neighborhoods to be sufficiently large that it forms a connected graph. If
    your data is separated into clusters, it might take large neighborhoods to
    ensure there are no disconnected clusters.

    I think you might be interested in algorithms designed specifically for
    transduction (

    such as agglomerativetranducer and graphcuttransducer. These algorithms are
    particularly well-suited for separating classes that lie on manifolds.



Cancel  Add attachments