Sorry for writing a book in response to a simple question, but...
The "waffles_transform samplerows" tool can be used to randomly sub-sample a dataset. I recommend that you start by doing cross-validation over an exponentially increasing range of data sizes, and plotting the accuracy on a logarithmic scale. This will tell you whether using so much data is really giving you returns, or whether it is just wasting your time.
If you are using an ensemble of trees, you might consider limiting the depth of those trees. Often, this improves generalization accuracy as well as limits memory usage. The GDecisionTree class supports two ways to do this, GDecisionTree::setLeafThresh and GDecisionTree::setMaxLevels. I prefer the former method, but it's probably better to let the data choose by testing them both.
Assuming that using all of your data and fully building out your trees really is the right thing to do, and getting more memory is not an option, then I suppose the right answer is to swap out to disk. Since virtual memory is supposed to do that automatically, and all major operating systems implement virtual memory (some better than others), I am not very confident that you will gain much by doing it manually. For this reason, I have not yet implemented any disk-swapping features.
However, if you really want to implement disk-swapping, Waffles provides all the pieces you might need. To do this, you will have to hack the code in GEnsemble.cpp. All of my learning algorithms implement a "serialize" method that marshals the model into a JSON DOM, and JSON can be written to and read from a file. To load it again, all of my learning algorithms implement a constructor that unmarshals from a JSON DOM. (Alternatively, to write more general code, you can use the GLearnerLoader class, which dynamically determines the type being loaded.) If you try this, and it turns out to actually be faster than just using virtual memory, please let me know so I can work it into my code.
On the theory-side of things, several researchers have reported that deep neural networks trained to mimic large ensembles perform just as well as the large ensembles. Therefore, it seems plausible that one could alternately train an ensemble based on sub-sampled data, then refine a deep net with it for a while, and continue until all of the data had been presented several times. However, more research is still needed to determine whether the ensemble even plays an important role in such a training process. Another theoretical possibility is to use this approach to just replace groves of trees within the ensemble. In the lab at my Alma Mater, someone was working on a model called a "Decision DAG", which compressed a grove of trees into a much smaller graph. I didn't really follow how that turned out.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I find that ensemble methods are causing my computer to max out on memory. Is there a workaround?
Sorry for writing a book in response to a simple question, but...
The "waffles_transform samplerows" tool can be used to randomly sub-sample a dataset. I recommend that you start by doing cross-validation over an exponentially increasing range of data sizes, and plotting the accuracy on a logarithmic scale. This will tell you whether using so much data is really giving you returns, or whether it is just wasting your time.
If you are using an ensemble of trees, you might consider limiting the depth of those trees. Often, this improves generalization accuracy as well as limits memory usage. The GDecisionTree class supports two ways to do this, GDecisionTree::setLeafThresh and GDecisionTree::setMaxLevels. I prefer the former method, but it's probably better to let the data choose by testing them both.
Assuming that using all of your data and fully building out your trees really is the right thing to do, and getting more memory is not an option, then I suppose the right answer is to swap out to disk. Since virtual memory is supposed to do that automatically, and all major operating systems implement virtual memory (some better than others), I am not very confident that you will gain much by doing it manually. For this reason, I have not yet implemented any disk-swapping features.
However, if you really want to implement disk-swapping, Waffles provides all the pieces you might need. To do this, you will have to hack the code in GEnsemble.cpp. All of my learning algorithms implement a "serialize" method that marshals the model into a JSON DOM, and JSON can be written to and read from a file. To load it again, all of my learning algorithms implement a constructor that unmarshals from a JSON DOM. (Alternatively, to write more general code, you can use the GLearnerLoader class, which dynamically determines the type being loaded.) If you try this, and it turns out to actually be faster than just using virtual memory, please let me know so I can work it into my code.
On the theory-side of things, several researchers have reported that deep neural networks trained to mimic large ensembles perform just as well as the large ensembles. Therefore, it seems plausible that one could alternately train an ensemble based on sub-sampled data, then refine a deep net with it for a while, and continue until all of the data had been presented several times. However, more research is still needed to determine whether the ensemble even plays an important role in such a training process. Another theoretical possibility is to use this approach to just replace groves of trees within the ensemble. In the lab at my Alma Mater, someone was working on a model called a "Decision DAG", which compressed a grove of trees into a much smaller graph. I didn't really follow how that turned out.