1. Complete the Get Started tutorial

  2. Download the tutorial files

Simulation Overview

This model simulates a multi-echelon product delivery scenario with 2 manufacturing centers, 4 distributors, and 8 retailers. Consumer demand fluctuates randomly and the time it takes for new inventory to arrive depends on which manufacturing center or distributor is fulfilling the order.

The objective is to maximize profit by ensuring all entities have sufficient inventory to fulfill demand at any given moment. If there is not enough inventory on hand, sales are lost. However, excess inventory incurs storage costs so it is crucial to stock just enough product to satisfy demand while minimizing excess inventory.

To evaluate the performance of the reinforcement learning policy, we compare it to a hybrid optimizer/heuristic baseline. The optimizer is used to find the optimal minimum and maximum restock levels for each distributor and retailer while the heuristic will direct restock requests to the closest fulfillment partner.

In this example, the reinforcement learning policy achieves 34.3% greater profit than our baseline.


Step 1 - Run the simulation to check Pathmind Helper setup.

Go through the steps in the Pathmind Helper Setup guide to make sure that everything is working correctly. Completing this step will also demonstrate how the model performs using random actions instead of following a trained policy.

Step 2 - Examine the Pathmind properties.

Observations - We use 42 observations to train the policy. These include:

  1. Manufacturing Center stock and orders backlog.

  2. Distributor stock, orders backlog, and expected deliveries.

  3. Retailer stock, orders backlog, and expected deliveries.

  4. Time which includes the day of the week and the day of the year.

Metrics - 15 metrics are tracked in this simulation. However, as you will see later, we only use total profit, distributor queue, and retailer queue to train the policy.

Actions - The action space in this model is a mixed tuple, meaning the policy will make several simultaneous decisions. In short, the policy basically decides how much inventory each distributor and retailer will order simultaneously.

Event Trigger - Inside the generateDemand event, you will see that Pathmind is triggered once per day. In other words, distributors and retailers will decide how much inventory to replenish at the end of each day.

Done - This simulation is set to run for one year.

However, there is a slight nuance. Inside the takeStep function, we end the simulation if the backlog grows too large. This is a trick to hide "bad" data from the policy which can confuse it. We only want the policy to learn from good simulation runs.

Step 3 - Export model and get Pathmind policy.

Complete the steps in the Exporting Models and Training guide to export your model, complete training, and download the Pathmind policy.

Reward Function

// Maximize total profit. Divide by 5,000,000 to normalize.
reward += (after.total_profit - before.total_profit) / 5000000;
// Minimize distributor queue. Divide by 60,000 to normalize.
reward -= (after.distributor_queue - before.distributor_queue) / 60000;
// Minimize retailer queue. Divide by 80,000 to normalize.
reward -= (after.retailer_queue - before.retailer_queue) / 80000;

// If the simulation is ended early, slap the policy with a large penalty.
if (after.isCut) {
reward -= 10;

A policy will be generated after training completes. A trained policy file is included in the tutorial folder.

Step 4 - Run the simulation with the Pathmind policy.

Back in AnyLogic, open the pathmindHelper properties and locate the policy file.

Now run the included Monte Carlo experiment to validate the results. As you can see below, the policy is able to achieve a much higher total profit. How did the policy achieve these results?

From the Monte Carlo, you may notice two things:

  1. The policy minimizes lost demand (i.e. lost sales) because it almost always reorders enough inventory to satisfy demand.

  2. Secondly, the policy greatly reduces inventory costs as it is punished (i.e. less profit) for requesting too much inventory.

You may wonder why the optimizer is unable to achieve the same result. The Monte Carlo results below reveal the answer. Compared to an optimizer (i.e. finding min/max restock thresholds) which is static in nature, the policy can dynamically change reorder quantities depending on the situation.

As a result, it places orders more frequently and at smaller quantities, enabling it to fine tune restock requests to expected demand. This results in much higher profit due to lower costs and lost sales.


This tutorial demonstrates the benefits of reinforcement learning over an optimizer. Compared to an optimizer which provide static outputs, reinforcement learning can dynamically adjust itself in real-time, enabling it to achieve 34.3% greater profit in this multi-echelon supply chain example.

Did this answer your question?