1. Sign up for Pathmind
  2. Download and install the Pathmind Helper
  3. Download AnyLogic
  4. Download the tutorial files

Simulation Overview

Managing inventory levels in even a basic supply chain can be a challenging task. Factors such as inventory holding costs and backlogged orders must be considered while keeping profit margins high and customer wait times low. Simulation modeling is an ideal tool for finding solutions to supply chain uncertainties since no real-world testing is needed. Adding reinforcement learning to those models opens up even more possibilities, especially where changing demands come into play. Compared to heuristics or optimizers which are static, reinforcement learning can proactively prepare for unexpected stock out events. That feature is its key advantage.

The model used in this tutorial is inspired by the supply chain model publicly available on AnyLogic Cloud (please read for context). It was retrofitted to enable a reinforcement learning policy to control fixed and variable inventory levels, then those results were compared to OptQuest, the built-in optimizer.

The objective of the new model is to determine the inventory levels necessary to minimize holding costs while also keeping customer wait times low. 


Step 1 - Perform a run with random actions to check Pathmind setup.

Go through the steps of the Check Pathmind Helper Setup guide to make sure that everything is working without error. Completing this step will also demonstrate how the model performs using random actions instead of an optimizer or policy.

Step 2 - Examine the reinforcement learning elements.

Observations - This simulation contains 12 observations for the retailer, the wholesaler, and the factory. These observations provide a comprehensive snapshot of the current state of the system.

Reward Variables - There are two reward variables used as ingredients for the reward function in this model: 

  • Average waiting time (in days) to satisfy demand.
  • Average total cost of managing inventory levels.

Actions - The action sets the level of both fixed (big "S") and variable (small "s") inventory levels. There are two things to note:

  1. We have discretized the action space into increments of 10. This means the policy can select inventory in intervals of 10, 20...200 (n = 20). By reducing the number of actions, policy training is more efficient without compromising the integrity of the model.
  2. There are 6 action outputs (size = 6). These correspond to 2 inventory levels (fixed and variable) for 3 agents (retailer, wholesaler, and factory) which means 2 x 3 = 6 action outputs.

Finally, the doAction function accepts and executes the actions from the policy.

Done - The completion of the simulation is set to one calendar year. 

Event Trigger - An action is triggered at the end of each day. To do so, a checkOrders event fires triggerNextAction() at 11pm each day.

Step 3 - Export model and get Pathmind policy.

Complete the steps in the Exporting Models and Training guide to export your model, complete training, and download the Pathmind policy.

Reward Function - The reward function for this model will be:

reward -= after.meanDailyCost - before.meanDailyCost; // Minimize cost

Notice that is exactly the optimizer's objective function.

Step 4 - Run training.

Click Start Training. You will receive a message notifying you that training has begun, and an email when it completes. Once that happens, you’ll be able to export the Pathmind policy. 

Step 5 - Validate the reinforcement learning policy against the in-built optimizer.

Back in AnyLogic, open the Pathmind Helper properties. Change the Mode field to Use Policy, click Browse, and select the new policy file.

Next, run the Optimization Experiment to obtain the best parameter values. Once the Optimization Experiment is complete, the best parameter values are automatically saved.

Exit the Optimization Experiment, open the Monte Carlo Experiment, and take note of the drop down options.

For each drop down option, run the Monte Carlo experiment.*

  • OptQuest Optimizer - Validate the results of the optimization experiment.
  • Reinforcement Learning Policy - Validate the results of the policy trained with Pathmind. 

*Do not close the Monte Carlo experiment window. It will clear the results. Click the stop button and re-run with another selection. 

Once complete, results similar to those below will be presented. 


From the Monte Carlo results above, the reinforcement learning policy is able to outperform the optimizer by over 20%.

Why is this the case? 

Optimizers work extremely well in static environments. In other words, optimizer tend to overfit, deriving parameters that work well in one specific situation, but not in others. In contrast, reinforcement learning is able to adapt well to stochastic environments.

Adapting to new data can be important for many real-life use cases, such as demand dynamics, weather, or supply shocks. For companies operating in predictable, relatively static, environments, optimizers may be the best choice. 

Companies operating in dynamic and unpredictable environments, however, are likely to find that reinforcement learning may be able to get better results by responding immediately to expected changes.

Did this answer your question?