Prerequisites

  1. Complete the Get Started tutorial
  2. Download the tutorial files

Simulation Overview

Managing inventory levels in even a basic supply chain can be a challenging task. Factors such as inventory holding costs and backlogged orders must be considered while keeping profit margins high and customer wait times low. Simulation modeling is an ideal tool for finding solutions to supply chain uncertainties since no real-world testing is needed. Adding reinforcement learning to those models opens up even more possibilities, especially where changing demands come into play. Compared to heuristics or optimizers which are static, reinforcement learning can proactively prepare for unexpected stock out events. That feature is its key advantage.

The model used in this tutorial is inspired by the supply chain model publicly available on AnyLogic Cloud (please read for context). It was retrofitted to enable a reinforcement learning policy to control fixed and variable inventory levels, then those results were compared to OptQuest, the built-in optimizer.

The objective of the new model is to determine the inventory levels necessary to minimize holding costs while also keeping customer wait times low. 

Tutorial

Step 1 - Run the simulation to check Pathmind Helper setup.

Go through the steps of the Check Pathmind Helper Setup guide to make sure that everything is working without error. Completing this step will also demonstrate how the model performs using random actions instead of an optimizer or policy.

Step 2 - Examine the Pathmind Properties.

Observations - This simulation contains 12 observations for the retailer, the wholesaler, and the factory. These observations provide a comprehensive snapshot of the current state of the system.

Metrics - There are two metrics that can be used as ingredients for the reward function: 

  • Average waiting time (in days) to satisfy demand.
  • Average total cost of managing inventory levels.
  • Total inventory levels for the retailer, wholesaler, and factory.

Actions - The action sets the level of both fixed (big "S") and variable (small "s") inventory levels. There are two things to note:

  1. We have discretized the action space into increments of 10. This means the policy can select inventory in intervals of 10, 20...200 (n = 20). By reducing the number of actions, policy training is more efficient without compromising the integrity of the model.
  2. There are 6 action outputs (size = 6). These correspond to 2 inventory levels (fixed and variable) for 3 agents (retailer, wholesaler, and factory) which means 2 x 3 = 6 action outputs.

Finally, the doAction function accepts and executes the actions from the policy.

Event Trigger - An action is triggered at the end of each day. To do so, a checkOrders event fires triggerNextAction() at 11pm each day.

Done - The completion of the simulation is set to one calendar year. 

Step 3 - Export model and get Pathmind policy.

Complete the steps in the Exporting Models and Training guide to export your model, complete training, and download the Pathmind policy.

Reward Function - The reward function for this model is:

reward -= after.meanDailyCost - before.meanDailyCost; // Minimize cost

Notice that is basically the optimizer's objective function to allow for an apples-to-apples comparison.

A policy will be generated after training completes. A trained policy file is included in the tutorial folder.

Step 4 - Validate the reinforcement learning policy against the in-built optimizer.

Back in AnyLogic, open the Pathmind Helper properties. Change the Mode field to Use Policy, click Browse, and locate the policy file.

Next, run the Optimization Experiment to obtain the best parameter values. Once the Optimization Experiment is complete, the best parameter values are automatically saved.

Exit the Optimization Experiment and run the Monte Carlo Experiment to compare the Pathmind policy with the OptQuest Optimizer.

Conclusions

From the Monte Carlo results above, the reinforcement learning policy is able to outperform the optimizer by over 20%.

Why is this the case? 

Optimizers work extremely well in static environments. In other words, optimizer tend to over fit, deriving parameters that work well in one specific situation, but not in others. In contrast, reinforcement learning is able to adapt well to stochastic environments.

Adapting to new data can be important for many real-life use cases, such as demand dynamics, weather, or supply shocks. For companies operating in predictable, relatively static, environments, optimizers may be the best choice. 

Companies operating in dynamic and unpredictable environments, however, are likely to find that reinforcement learning may be able to get better results by responding immediately to expected changes.

Did this answer your question?