Once you properly frame your action space, the next challenge is shaping your reward function. This is more of an art than a science, but we recommend following the process outlined below.

Step 1: Normalize your metrics.

Ensure that all your reward variables are normalized between -1 and 1. In this way, each reward is given equal weight, making it much easier to understand the relative impact of each reward.

Step 2: Start with the most basic reward formulation.

After uploading your AnyLogic simulation to Pathmind, you will be asked to specify a goal.

This will populate your reward function with the most basic reward formulation as a starting point which is to maximize or minimize something.

after = This term represents the reward at time t.

before = This term represents the reward at time t - 1.

The policy will compare the difference in reward at each time step and seek to minimize or maximize that value.

Step 3: Train a policy using each reward term individually to confirm if it achieves the desired behavior.

We recommend testing each reward variable independently to observe what it is teaching the policy. For example, if a reward specifies utilization, train a policy using the utilization metric alone to see if it actually influences utilization.

Example Using One Reward

Once you establish that each reward does what you need, you can mix and match to get your desired behavior.

Example Using Multiple Rewards

Also note that you can multiple rewards by a factor. This tells the policy if particular rewards are more important than others.

Step 4: Advanced reward functions

If the above fails to achieve your desired outcome or you want more control over a policy's behavior, you may leverage Java Math classes to manipulate your reward "topography".

Below are a couple of example tactics that you could try.

  1. Parabolic Rewards - reward -= Math.pow(1 - after.yourReward , 2);

  2. Conditional Statements - after.yourReward > 10 ? 1: 0;

Did this answer your question?