Once training is complete, you will want to examine the learning progress graphs to determine how well the policy had learned.
Iteration - The interval in which the weights and biases of a neural network are updated. At each iteration, the policy tries to get closer to finding the optimal set of parameters.
Mean Reward Score Over All Episodes - During training, Pathmind will replay the same "episode" (i.e. simulation run) as many times as possible. At the end of each episode, the final reward is captured and averaged across all episodes in a single training iteration.
A well formulated reward function will yield a graph similar to the below. This usually means that the policy has "converged", meaning it has achieved the best possible outcome.
You may notice training curves that appear to learn well but are not fully converged. In this scenario, simply training the policy longer is usuallly all that is needed.
Not Enough Signal
A poorly performing reward function will yield a noisy graph with no discernible pattern.
To improve learning, you will want to investigate three things:
Additional observations to determine if the policy is missing information that enables it to make a well-informed decision.
Different rewards to see if other metrics provide a stronger signal. It is crucial that the rewards correspond with a meaningful change to the state of the environment.
Reformulate your action space as it may not be well suited to your use case. (Reach out to us if you need help with this)