Introduction

Intuitively, a reward function teaches an agent how it should behave. This is expressed as a numerical value representing the goodness or badness of an action.

Below are several (but not all) the ways you can shape rewards in Pathmind.

Building Blocks

Once you start a new experiment, you will be prompted to write a reward function. In this view, you'll be able to leverage several tools to nudge your agent towards your business objectives.

1. Before and After Variables

The most fundamental approach is to measure the impact of an action before and after an action is taken.

after = The value of a reward variable at time t (current action or current step)

before = The value of a reward variable at time t - 1 (previous action or previous step)

So for example, we could shape the reward function, using this mechanism, to ask the agent to check whether or not it's doing better or worse after each action.

For example:

reward += after.reward0 - before.reward0; // Maximize reward at each step 
reward -= after.reward1 - before.reward1; // Minimize reward at each step

2. Java Math Class

In many situations, you will want to manipulate the relative magnitude of each reward to tell the agent how important an outcome should be.

A simple example is to multiply the reward by 10.

reward += (after.reward0 - before.reward0) * 10; // Multiply by 10
reward -= after.reward1 - before.reward1;

This tells the agent that reward0 is ten times more important than reward1 because multiplying by 10 makes it a larger numerical value.

However, simple arithmetic may not capture exactly what you want the agent to learn. In this case, you may use any of the Java Math classes in your reward function.

reward += Math.log(after.reward0 - before.reward0); // Log of the difference
reward -= Math.abs(after.reward1 - before.reward1); // Absolute value of the difference

3. Conditional Operators

Another trick is to use conditional operators to only assign a reward if certain conditions are met.

Let's say you want to reward an agent only if after[0] is greater than 1.

if (after.reward0 >= 1) {
reward += 10;
}

This can also be written in shorthand.

reward += after.reward0 ? 10 : 0;

4. In-Built Pathmind Variables

Finally, Pathmind exposes a couple in-built variables that you can bake into your reward function.

isDone(-1) - A boolean (true or false) that returns once an episode is complete.

getObservation() - The observations array at the current step.

action - The action taken by the agent at the current step.

Using the above, you can create more granular conditions in which to reward an agent.

reward += isDone(-1) ? after.reward0 : 0; // Reward at the end of an episode

This reward function will assign a reward equal to after.reward0 at the end of an episode. For example, in the case of a factory, it might be less confusing for an agent to maximize total production at the end of each episode (e.g. workday) versus after each action.

Conclusion

Reward shaping is more of an art than a science. There's no golden rule so you will need to conduct many experiments to craft a reward function that perfectly meets your business objectives.

Did this answer your question?