Recent changes to DMAB

DMAB modified by Alberto Tonda

Alberto Tonda — Mon, 08 Jun 2015 14:42:13 -0000

--- v3
+++ v4
@@ -1,9 +1,11 @@
 Choosing operators using DMAB
 -----------------------------

+µGP exploits a Dynamic Multi-Armed Bandit logic to self-adapt the probability of activating a certain genetic operator at each generation.
+
 The general goal of the Multi Armed Bandit problem is to get the highest reward out of a cash machine with multiple arms, each of which has its own unknown reward probability distribution. Good MAB algorithms find a right balance between trying each arm to discover their probability distribution (at the risk of activating multiple times an arm that gives no benefit) and exploiting the arm with the highest perceived reward probability (at the risk of missing another arm with a higher reward). The problem becomes dynamic when the reward probabilities of the arms vary over time.

-In the context of Evolutionary Computation, MAB assimilates each genetic operator to an arm of the bandit. The goal of the MAB algorithm in MicroGP is to apply the operators that yield the best increase of fitness among the generated offspring. We use the dynamic (DMAB) variant because operators can be more or less relevant at different stages of the evolution.  The DMAB implementation in MicroGP is inspired by the 2008 paper from DaCosta et al.  "Adaptive Operator Selection with Dynamic Multi Armed Bandit" (1), with a few modifications to accommodate with MicroGP's generation flow.
+In the context of Evolutionary Computation, MAB assimilates each genetic operator to an arm of the bandit. The goal of the MAB algorithm in MicroGP is to apply the operators that yield the best increase of fitness among the generated offspring. We use the dynamic (DMAB) variant because operators can be more or less relevant at different stages of the evolution.  The DMAB implementation in MicroGP is inspired by the 2008 paper from DaCosta et al.  "Adaptive Operator Selection with Dynamic Multi Armed Bandit" (1), with a few modifications to accommodate with MicroGP's generation flow (2).

 ### Main steps of the DMAB algorithm

@@ -67,4 +69,5 @@

 ### References

-1.  DaCosta, Luis, et al. "Adaptive operator selection with dynamic multi-armed bandits." Proceedings of the 10th annual conference on Genetic and evolutionary computation. ACM, 2008. 
+1.  DaCosta, Luis, et al. "Adaptive operator selection with dynamic multi-armed bandits." Proceedings of the 10th annual conference on Genetic and evolutionary computation (GECCO). ACM, 2008. 
+2.  Belluz J., Gaudesi M., Squillero G., Tonda A., "Operator Selection using Improved Dynamic Multi-Armed Bandit." Proceedings of the 17th annual conference on Genetic and evolutionary computation (GECCO). ACM, 2015.

DMAB modified by Jany Belluz

Jany Belluz — Thu, 27 Nov 2014 15:33:54 -0000

--- v2
+++ v3
@@ -61,11 +61,9 @@

 The other problem of the DMAB implementation is that we have to select an operator several times in a row without being able to properly update the indicators between each selection. Indeed, MicroGP can compute the reward of an operator only when the children have been evaluated, at the end of the generation. During one generation MicroGP can apply any arbitrary number of operators, any number of times. 

-Without any modification we would be applying only one operator per generation, the one with the best MAB score from the last indicator update. This situation would not allow any exploration during the generation, only exploitation. To prevent that, we count the number of non-rewarded-yet executions of the operator, and we add it to the UCB1 formula in order to lower its score, so eventually another operator can take over during the generation and be given a chance. Here is the modified formula:
+Without any modification we would be applying only one operator per generation, the one with the best MAB score from the last indicator update. This situation would not allow any exploration during the generation, only exploitation. To prevent that, we use a Russian roulette selection based on the MAB scores instead of taking always the best. This randomized selection still favors the best operator and keeps exploring the others. However, its performance against standard DMAB has not yet been measured precisely. 

-`(a * p_i + b) + sqrt(2 * log(sum of n_k for all k)) / (n_i + nNonRewardedYet_i))`
-
-We apply the same trick when there are operators that have never been used yet.  In order to choose each of them in turn, we affect a very high score (1000) to such an unused operator, minus its number of non rewarded executions.
+Finally, in order to call at least once every operator, we compute a special score for operators that have never been used yet. We affect a very high score to such an unused operator, so that its chances of being selected are overwhelming compared to those of the other operators. After it has been selected, we reduce its "high score" value, which allows the Russian Roulette to switch to the next unused operator.

 ### References

DMAB modified by Jany Belluz

Jany Belluz — Sat, 11 Oct 2014 08:55:13 -0000

--- v1
+++ v2
@@ -1,4 +1,3 @@
-
 Choosing operators using DMAB
 -----------------------------

@@ -22,10 +21,9 @@
 3.  Choose the next arm to use according to the Upper Confidence Bound algorithm.
     * If some arm has not been used yet (`n_i == 0`), choose it.
     * If all arms have already been used, choose the arm `i` that maximizes the following quantity (UCB1 formula):
+        `(a * p_i + b) + sqrt(2 * log(sum of n_k for all k)) / n_i)`

-      `(a * p_i + b) + sqrt(2 * log(sum of n_k for all k)) / n_i)`
-
-      where `(a * p_i + b)`, the scaled mean reward, favors exploitation of good arms, and the second term favors exploration of arms that have not been used since some time.
+        where `(a * p_i + b)`, the scaled mean reward, favors exploitation of good arms, and the second term favors exploration of arms that have not been used since some time.

 4.  Use the chosen arm, measure the obtained reward `r`, and update the arm's variables as follows:
     * `p_i := (n_i * p_i + r) / (n_i + 1)`

DMAB modified by Giovanni Squillero

Giovanni Squillero — Fri, 10 Oct 2014 14:09:30 -0000

Choosing operators using DMAB

The general goal of the Multi Armed Bandit problem is to get the highest reward out of a cash machine with multiple arms, each of which has its own unknown reward probability distribution. Good MAB algorithms find a right balance between trying each arm to discover their probability distribution (at the risk of activating multiple times an arm that gives no benefit) and exploiting the arm with the highest perceived reward probability (at the risk of missing another arm with a higher reward). The problem becomes dynamic when the reward probabilities of the arms vary over time.

In the context of Evolutionary Computation, MAB assimilates each genetic operator to an arm of the bandit. The goal of the MAB algorithm in MicroGP is to apply the operators that yield the best increase of fitness among the generated offspring. We use the dynamic (DMAB) variant because operators can be more or less relevant at different stages of the evolution. The DMAB implementation in MicroGP is inspired by the 2008 paper from DaCosta et al. "Adaptive Operator Selection with Dynamic Multi Armed Bandit" (1), with a few modifications to accommodate with MicroGP's generation flow.

Main steps of the DMAB algorithm

Initialize the following variables for each arm i:
- p_i := 0 (mean observed reward)
- n_i := 0 (number of uses of this arm)
- m_i := 0 (average deviation of observed rewards)
- M_i := 0 (maximum deviation of observed rewards)
Determine scaling parameters a and b for the mean rewards, by solving the following equations:
- sum of (a * p_i + b) for all i == 1
- max of (a * p_i + b) for all i == CA_scale where CA_scale is a constant, usually 1.
Choose the next arm to use according to the Upper Confidence Bound algorithm.
- If some arm has not been used yet (n_i == 0), choose it.
- If all arms have already been used, choose the arm i that maximizes the following quantity (UCB1 formula):
(a * p_i + b) + sqrt(2 * log(sum of n_k for all k)) / n_i)

where (a * p_i + b), the scaled mean reward, favors exploitation of good arms, and the second term favors exploration of arms that have not been used since some time.
Use the chosen arm, measure the obtained reward r, and update the arm's variables as follows:
- p_i := (n_i * p_i + r) / (n_i + 1)
- n_i := n_i + 1
- m_i := m_i + (p_i - r + delta) where delta is a constant.
- M_i := max(M_i, m_i)
Test for changes in the arm's probability distribution:
- If M_i - m_i > lambda, where lambda is a constant, go back to step one to restart the bandit.
- Else, continue using the bandit by going back to step two.

Refer to the paper the work of DaCosta et al. (1) to know about the meaning of the various indicators and constants.

Implementation in MicroGP

In MicroGP, an arm is a genetic operator, and the reward is determined by the quality of the generated offspring. To store indicators about the available operators, we use the Statistics class. Each population has a Statistics instance. Inside this class, each operator has a Data instance that contains the four DMAB indicators.

During a generation, MicroGP applies several operators before evaluating the individuals and measuring the rewards. Moreover, the execution of an operator generate several individuals or groups, which will all get a different evaluation. The reward of for this call to the operator must aggregate in some way the various evaluations of all generated offspring. These two constraints impose the introduction of another data structure, and small modifications to the algorithm.

Aggregating evaluation results into a reward value

In order to link the evaluation of an individual to the reward of an operator, we create an instance of CallData after each call. We the call returns new offspring, we put a reference to the CallData instance in the Lineage of the offspring.

Later on, when the offspring has been evaluated, we loop over each new individual or group and build inside the CallData object and histogram of the performance of the offspring. That is, if a child is better than all other individuals of the previous generation, we increment the VeryGood performance class of the operator. If the child is better than its parents, we count it as Good. Conversely, if the child is worse than its parents or worse than all the previous individuals, it is counted respectively as Bad or VeryBad. If the child has a fitness between that of its parents, it is counted as Normal.

When all newly evaluated children have been aggregated into the CallData object, we use each of those histograms to reward the operator. We use the "boolean reward" strategy, which means the reward for a call can only take the values 0 or 1. We thus give a positive reward to the operator if it has produced at least one Good or VeryGood individual.

The CallData instances are destroyed at the end of each generation.

Performing several operator selections without reward updates

The other problem of the DMAB implementation is that we have to select an operator several times in a row without being able to properly update the indicators between each selection. Indeed, MicroGP can compute the reward of an operator only when the children have been evaluated, at the end of the generation. During one generation MicroGP can apply any arbitrary number of operators, any number of times.

Without any modification we would be applying only one operator per generation, the one with the best MAB score from the last indicator update. This situation would not allow any exploration during the generation, only exploitation. To prevent that, we count the number of non-rewarded-yet executions of the operator, and we add it to the UCB1 formula in order to lower its score, so eventually another operator can take over during the generation and be given a chance. Here is the modified formula:

(a * p_i + b) + sqrt(2 * log(sum of n_k for all k)) / (n_i + nNonRewardedYet_i))

We apply the same trick when there are operators that have never been used yet. In order to choose each of them in turn, we affect a very high score (1000) to such an unused operator, minus its number of non rewarded executions.

References

DaCosta, Luis, et al. "Adaptive operator selection with dynamic multi-armed bandits." Proceedings of the 10th annual conference on Genetic and evolutionary computation. ACM, 2008.