Introduction

The starting point for all reinforcement learning experiments is your AnyLogic simulation. It must be framed in a way that is conducive to reinforcement learning.

Core Components

Episodes

"Episodes" are defined as a full simulation run from start to end as represented by the arrow in the box above. An episode should capture the real-life dynamics of the use case that you are trying to solve.

The most important consideration when defining episodes is it's endpoint. All simulations must end at some point and this endpoint can impact the quality of training.

  1. Model Time - A simulation that ends based on a predefined end time or end date. For example, this could be one work day (e.g. 9am to 5pm) in a factory or a particular date range (e.g. January to April).
  2. Goal Based - A simulation that ends when an objective is reached. An example of a goal could be ending the simulation when production volume has exceeded a certain threshold.

Defining your episode correctly is the first step to obtaining a well trained policy.

Steps

In the reinforcement learning domain, a "step" defines when a reinforcement learning agent should interact with it's environment.

This interaction is known as an "action", and an "action" only occurs when you trigger an action in Pathmind. An example of an action could be telling a machine what to do next after a task is complete.

There are commonly two ways to trigger steps (i.e. actions):

1. Cyclic Steps

A cyclic step is set to occur at a specified rate of time, say one step per second; e.g. schedule production only once per day.

2. Conditional Steps

Conditional steps only occur when a condition is met; e.g. when you arrive home, the next step would be to check the mail or open the door.

Impact on Training Quality

In order to learn, a reinforcement learning algorithm will replay the same "episode" (which you had defined above) as many times as possible. The more often it can experience an episode, the easier it is for a reinforcement learning algorithm to learn -- similar to how a human improves with more practice.

Pathmind will attempt to fit as many episodes as possible into each experiment as shown in the diagram below. The small grey boxes represent your AnyLogic simulation.

To audit how many episodes Pathmind is using for training, simply mouse over the training graph. A box will appear that displays the episode count at that particular training iteration. The larger the episode count, the more experience the model accumulated in that particular training iteration.

Key Takeaways

To obtain a successfully trained policy, there are a couple factors that you can control in your AnyLogic model.

  1. Shorter "episodes" are superior (while preserving the integrity of the model) since it will make training faster and allow Pathmind to fit more episodes in each training run. Try to capture the real-life dynamics of your use case as concisely as possible.
  2. Ensure each "step" results in a meaningful change in your simulation's environment. If every step (i.e. action) has nearly the same impact, it will be difficult for reinforcement learning to understand the levers that impact results.
Did this answer your question?