Why Reinforcement Learning?
Reinforcement learning (RL) is both simple and powerful.
Power - reinforcement learning has proven its ability to solve complex problems. It can beat world champions in ancient games such as Go, or control the motion of industrial robots.
Simplicity - Adapting simulations for reinforcement learning with Pathmind is intuitive and fast, even on complex problems. All pattern matching, analysis and optimization is performed by a Policy trained to achieve your goals.
This tutorial outlines one of the key steps in adapting your simulation for RL— creating the reward function.
What Is a Reward Function?
A reward function mixes reward variables into a single output value. This output provides feedback for a policy so that it can learn desirable behavior. Adjusting the reward function to output a positive reward for desirable behavior and a negative reward for undesirable behavior is the process of “reward shaping.”
The goal of this tutorial is to teach you how to write and shape your first reward function. This introduction will use the syntax detailed in the reward function syntax tutorial.
Figure 1: What is a reward function? The reward function combines reward variables (blue, orange and black squares on the left) with different importance weights and outputs a single reward value (right). This output is the main feedback for training an intelligent policy. If the reward variables define your values, then the reward function expresses your priorities.
Example: Supply Chain
Consider the supply chain model, in which a retailer, warehouse and factory exchange goods to meet incoming demand. The objective of this model is to reduce the cost incurred by each member of the supply chain in the face of variable demand. Desirable behavior drives down costs. Therefore, it is natural to make a totalCost reward variable. In addition to
totalCost, other influences on the cost incurred such as
totalWaitTime should also be defined as reward variables.
Once the reward variables are defined to align with “metrics that matter” - i.e. the goals you seek to achieve -- it is time to bundle them together in a reward function.
The two reward variables
totalWaitTime accumulate throughout a simulation run or “episode.” The Pathmind Policy will alter the
totalCost as it explores and trains for the best set of actions. The reward function accounts for how the policy has changed these reward variables.
The syntax shown below in the Pathmind Web app tracks how the reward variables change at each step due to the policy. The “before” value stores
totalCost before the action is taken, and the “after” value is the
totalCost afterward. The difference between the two “after - before” represents how much additional cost has resulted from the action. If that cost is large, then the policy should be penalized.
Figure 2: Example of a single-variable reward function. At each step, the total cost changes as a result of the policy’s actions. The change is a difference between the difference between the resulting daily cost (blue) and the previous daily cost (red). The total reward is decremented or penalized by the change in daily cost at each step (purple).
General Reward Function Strategy
The reward function described above has a structure that can be used in a wide variety of scenarios. In order to outline a general strategy, let’s first take a closer look at the
after – before; expressions in Figure 2, above. This syntax includes everything that the policy needs in order to learn: how actions affect the reward variables (“metrics that matter”); whether the effect is desirable or not (“reward vs. penalty”); and the relative importance of the reward variables.
Figure 3: Reward shaping strategy. Reward variables (black blocks) are first defined to accumulate during an episode (step 1), then they are normalized (step 2). Next, isolated component reward functions, shown as green, orange and red rectangles, are individually tested in the Pathmind production web app (step 3). Finally, the component reward functions are stacked together, and their weights, which denote relative importance, are tuned (step 4). The reward signal produced by the function is represented by the gold square.
Step 1: Define Aggregate Reward Variables
The first step in writing your reward function is to clearly define reward variables that accumulate during an episode. Variables such as
totalDownTime are several hypothetical examples. These variables are represented by black blocks on the left of the figure below.
Step 2: Normalize the Reward Variables
The reward function combines one or more reward variables together. This combination is simplified if the reward variables referenced in AnyLogic all have roughly the same magnitude. We recommend normalizing reward variables to ~1 in AnyLogic by:
(1) running the simulation model for one episode on random actions;
(2) recording the maximum values of each aggregate reward variable;
(3) dividing each reward variable by its maximum value so that it approaches ~1 as the episode progresses; and
(4) multiplying by the number of Pathmind trigger steps in each episode so that each reward variable increments by ~1 each step.
Note: You can also create separate reward variables that are not normalized. These will be used for tracking the true values of “metrics that matter” in the simulation metrics panel of the Pathmind Production web app.
Step 3: Run Isolated Reward Function Experiments
Reward functions are modular. In this step, you will build isolated components of a reward function, so that you can stack them together in the next step. Each component is a reward function that has just one reward variable. In the case of the supply chain, the components would read:
reward -= after.totalCost – before.totalCost;
reward -= after.totalWaitTime – before.totalWaitTime;
Run each of these component functions, then evaluate their outputs using the simulation metrics panel. For the first component, the
totalCost should be low, and in the second, the
totalWaitTime should be low.
Step 4: Combine Component Reward Functions and Tune Importance
The middle panel in Figure 3 represents how component reward functions can be combined. Each colored line shows a different individual reward function either as a penalty (-=) or a reward (+=), with a multiplicative factor adjusting its relative importance.
Run several combinations of reward functions built from pretested component functions as described in the previous step. Evaluate the performance of each experiment using the simulation metrics panel, then adjust the importance factor on each line to shape your reward.
Step 5: Iterate to Build your Reward-Shaping Skills
Reward shaping is an iterative process made simply by the Pathmind interface. Repeat steps 1 through 4 until there is a policy that meets your goals for the simulation metric. Once a policy has been trained that meets your expectations for simulation metric values, export that policy for testing in AnyLogic, where you can compare it with other heuristics.