Hi,

I originally posted this to the RM forums, but Sebastian invited me to join the R extension SIG and so I’m resending it here.

I'm interested in using RapidMiner to find the optimal values for hyperparameters tuning an R model In particular, I'd like to use EvolutionaryOptimization to do so.  But I've run into several issues I can't quite figure out myself.

I've got a simple test case that demonstrates what I want to do.  An R script builds a model using the "penalized" function R package "penalized", and takes a parameter lambda2 that controls how severe a penalty is applied.  The goal in the process is to optimize the value of lambda2.  I use 10-fold cross-validation to estimate the generalization with each penalty factor tried.  The example works, selecting 100 as the best parameter on the list.  But I can't get it to run using evolutionary parameter optimization, primarily because I can't seem to construct and pass a numeric parameter into the R script.

Questions:

1) How I can I specify a numeric parameter to be used inside the R code?  The grid optimization is using a list of values to set the value of a macro definition "lambda2" inside the validation.  I can then use the macro inside the R code to vary the penalty.  But if I try to replace the grid optimization with evolutionary optimization, I am not permitted to specify a range because the macro value could be a string rather than a numeric  I couldn't see another way to pass a parameter value into R code other than the macro approach.

2) In cross-validating, the R script "Build Training Model" returns an R object, not a model, so I couldn't directly connect the port to pass to the testing side.  I got around this by storing the R object in the repository, and retrieving it on the testing side.  This seems awkward, but I couldn't figure out how to pass an R object around otherwise.  Then in order to get RM to accept the process, I had to connect the R object to the model port on the training side, even though it complains that they aren't compatible objects.  Is there a better way to do this?

3) There doesn't appear to be a way within a process to delete an object from the repository?  I'm temporarily storing an R object in the repository during cross-validation, and wanted to remove them when completed, but the only two operators are Store and Retrieve.  If I could solve 2) without using the repository, this concern would go away for now, although I can see the functionality being pretty important.  Did I miss something obvious? [Note: I’ve since learned to use Remember and Recall instead of Store and Retrieve
J ]

4) Because RM doesn't seem to know about applying R models, I manually constructed a performance vector from the testing label and the R-generated predictions within another R script to calculate performance.  Seems to work, although RM complains about metadata being unspecified when I connect the constructed example set to the label port on the Performance operator.  Not a big deal, but thought it worth mentioning in case there's a cleaner way to do this.

5) The "results.label <- column_name"  trick for setting roles on a R data frame when converted back to an RM data table worked for label, but not for prediction, which is why the "Change role" operator is in the process.

Note that you'll need R package "penalized" to be installed in order for this test case to work.

Any suggestions would be welcomed.  I want to use RM to do a lot of this kind of parameter tuning, since I find similar capabilities in R somewhat lacking.  Thanks for any help.

Keith

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.001" expanded="true" name="Process">
    <process expanded="true" height="295" width="681">
      <operator activated="true" class="generate_data" compatibility="5.1.001" expanded="true" height="60" name="Generate Data" width="90" x="112" y="165">
        <parameter key="target_function" value="non linear"/>
      </operator>
      <operator activated="true" class="optimize_parameters_grid" compatibility="5.1.001" expanded="true" height="94" name="Optimize Parameters (Grid)" width="90" x="380" y="120">
        <list key="parameters">
          <parameter key="Set lambda2.value" value="10,100,1000"/>
        </list>
        <process expanded="true" height="313" width="1005">
          <operator activated="true" class="x_validation" compatibility="5.1.001" expanded="true" height="112" name="Validation" width="90" x="447" y="38">
            <parameter key="sampling_type" value="shuffled sampling"/>
            <process expanded="true" height="313" width="477">
              <operator activated="true" class="set_macro" compatibility="5.1.001" expanded="true" height="76" name="Set lambda2" width="90" x="45" y="30">
                <parameter key="macro" value="lambda2"/>
                <parameter key="value" value="1000"/>
              </operator>
              <operator activated="true" class="r:execute_script_r" compatibility="5.1.000" expanded="true" height="76" name="Build Training Model" width="90" x="180" y="30">
                <parameter key="script" value="library(penalized)&#10;library(e1071)&#10;&#10;print(paste(&quot;lambda2 is:&quot;,%{lambda2}))&#10;&#10;mod.penalized &lt;- penalized(&#10;&#9;&#9;&#9; label ~ att1 + att2 + att3 + att4 + att5&#10;&#9;&#9;&#9;, data=my.data&#10;&#9;&#9;&#9;, standardize=TRUE&#10;&#9;&#9;&#9;, lambda1=10&#10;&#9;&#9;&#9;, lambda2=%{lambda2}&#10;&#9;&#9;&#9;)&#10;&#10;"/>
                <enumeration key="inputs">
                  <parameter key="name_of_variable" value="my.data"/>
                </enumeration>
                <list key="results">
                  <parameter key="mod.penalized" value="Generic R Result"/>
                </list>
              </operator>
              <operator activated="true" class="store" compatibility="5.1.001" expanded="true" height="60" name="Store" width="90" x="328" y="30">
                <parameter key="repository_entry" value="PenalizedModel_temp"/>
              </operator>
              <connect from_port="training" to_op="Set lambda2" to_port="through 1"/>
              <connect from_op="Set lambda2" from_port="through 1" to_op="Build Training Model" to_port="input 1"/>
              <connect from_op="Build Training Model" from_port="output 1" to_op="Store" to_port="input"/>
              <connect from_op="Store" from_port="through" to_port="model"/>
              <portSpacing port="source_training" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
            <process expanded="true" height="313" width="496">
              <operator activated="true" class="retrieve" compatibility="5.1.001" expanded="true" height="60" name="Retrieve" width="90" x="45" y="30">
                <parameter key="repository_entry" value="PenalizedModel_temp"/>
              </operator>
              <operator activated="true" class="r:execute_script_r" compatibility="5.1.000" expanded="true" height="112" name="Evaluate vs test data" width="90" x="180" y="30">
                <parameter key="script" value="results &lt;- cbind.data.frame(&#10;&#9;&#9;&#9;  actual    = my.data$label&#10;&#9;&#9;&#9;, predicted = predict(model, data=my.data)[,1]&#10;&#9;&#9;)&#10;results.prediction &lt;- &quot;predicted&quot;&#10;results.label      &lt;- &quot;actual&quot;&#10;"/>
                <enumeration key="inputs">
                  <parameter key="name_of_variable" value="my.data"/>
                  <parameter key="name_of_variable" value="model"/>
                  <parameter key="name_of_variable" value="ignore_me"/>
                </enumeration>
                <list key="results">
                  <parameter key="results" value="Data Table"/>
                </list>
              </operator>
              <operator activated="true" class="set_role" compatibility="5.1.001" expanded="true" height="76" name="Change role of prediction to prediction" width="90" x="315" y="30">
                <parameter key="name" value="predicted"/>
                <parameter key="target_role" value="prediction"/>
                <list key="set_additional_roles"/>
              </operator>
              <operator activated="true" class="performance_regression" compatibility="5.1.001" expanded="true" height="76" name="Performance" width="90" x="396" y="30"/>
              <connect from_port="model" to_op="Evaluate vs test data" to_port="input 3"/>
              <connect from_port="test set" to_op="Evaluate vs test data" to_port="input 1"/>
              <connect from_op="Retrieve" from_port="output" to_op="Evaluate vs test data" to_port="input 2"/>
              <connect from_op="Evaluate vs test data" from_port="output 1" to_op="Change role of prediction to prediction" to_port="example set input"/>
              <connect from_op="Change role of prediction to prediction" from_port="example set output" to_op="Performance" to_port="labelled data"/>
              <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_averagable 1" spacing="0"/>
              <portSpacing port="sink_averagable 2" spacing="0"/>
            </process>
          </operator>
          <connect from_port="input 1" to_op="Validation" to_port="training"/>
          <connect from_op="Validation" from_port="averagable 1" to_port="performance"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_performance" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Generate Data" from_port="output" to_op="Optimize Parameters (Grid)" to_port="input 1"/>
      <connect from_op="Optimize Parameters (Grid)" from_port="performance" to_port="result 2"/>
      <connect from_op="Optimize Parameters (Grid)" from_port="parameter" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>