A reinforcement learning policy is a machine learning model. It represents what your reinforcement learning agent has learned during training. You can think of it as a “digital brain" that you have trained to perform a specific task. The output of a Pathmind experiment is a reinforcement learning policy.
The policy is contained in a file. Querying the file, or asking it to make decisions based on simulated or historical data, is how you evaluate what the AI has learned. It can be deployed in your AnyLogic or Python simulations, or in real-world applications.
Training a Pathmind Policy
During training, Pathmind steps through your simulation (i.e. that stepping through is called an “episode,” in RL terminology) tens of thousands of times. After each episode, the policy is updated based on what was learned.
What the agent learns is guided by your reward function, which defines the agent’s goals. Those goals also affect how the policy is updated. An RL policy is a neural network. A neural network is a cube of numbers known as a tensor. Those numbers are referred to as parameters. The parameters are updated as the agent learns during its episodic experiences in the simulation.
Pathmind Policy Contents
Once training is complete, Pathmind outputs a policy as a Tensorflow SavedModel. This file is a standalone file (e.g. meaning it can be run anywhere) that contains the trained parameters learned during Pathmind training.
The inputs to a policy file are observations from the simulation, and the outputs of the policy are predictions about the best action to take. You can query predictions directly from the policy file within your own application, outside of simulation software like AnyLogic; e.g. using real data in an application.
Using a Pathmind Policy
Typically, there are two ways to use a Pathmind policy:
Load your Pathmind policy in AnyLogic to inspect the results. Within your simulation environment, you can audit the policy predictions to identify learned behavior and use those learnings to improve your business processes.
Deploy the trained Pathmind policy in real-world applications using Pathmind Policy Serving, or manually querying it. You can read more at: https://help.pathmind.com/en/articles/5395946-deploying-trained-policies-in-real-world-applications
Re-Training a Pathmind Policy
The performance of policies can “drift” or change as the data distribution drifts, compared to the original data the policy was trained on.
To deal with policy and data “drift”, you will need to track a KPI that indicates whether or not the policy predictions are of high quality. If the KPI in question degrades by a certain threshold, say 10%, then it would indicate that your Pathmind policy needs to be retrained and updated.
For more information on machine-learning operations (MLOps) and ensuring good policy performance over time, please contact us.