1. Complete the Getting Started tutorial.
  2. Download the tutorial files.

Simulation Overview

Manufacturers optimize their delivery tactics to maximize profit and minimize product wait times.

When networks of manufacturing facilities and distributors are spread across a large area, they sometimes struggle to manage deliveries. They use simulation models to explore cost-effective ways to adjust deliveries. In those simulations, reinforcement learning can surface the best possible routes and decisions quickly.

The base simulation for the model used in this tutorial is publicly available on AnyLogic Cloud. It has been modified by the Accenture Applied Intelligence team to showcase reinforcement learning using Pathmind.

In the model, cities are shown in a region of Europe. The model uses AnyLogic’s GIS elements to place agents in the correct locations, which are provided by an Excel spreadsheet. These features also allow the delivery trucks to move along real roads.

The model includes three manufacturing centers and fifteen distributors in various locations. Each manufacturing center houses a fleet of three delivery trucks.

Within the distributor agent, a `generateDemand` Event creates orders. These orders consist of a random quantity of goods of between 500 and 1000 units, and they occur randomly every 1 to 2 days. Once an order is received, the reinforcement learning agent will determine which manufacturer can fulfill the order most quickly.

Determining which manufacturer should handle an order depends on several time-dependent factors. The manufacturer will not send out an order, for example, if it does not have enough inventory in stock to fulfill it. The added time needed to produce more inventory is determined by the production processing diagram and order processing flow chart within the ManufacturingCenter agent. This is the key bottleneck solved by reinforcement learning.

Another important factor that impact delivery time is the distance between the distributor and manufacturer. Sending a truck from a manufacturer that has the inventory to immediately fulfill an order may not be the fastest solution if it is many kilometers away from the distributor.

The model considers these factors and seeks to select the manufacturer that should fulfill an order while minimizing wait times and the distance driven. The best case scenario: the manufacturer nearest to the ordering distributor would have enough inventory in stock to complete the order, since that would result in minimal wait times for both production and travel.


Step 1 - Perform a run with random actions to check Pathmind Helper setup.

Go through the steps in the Pathmind Helper Setup guide to make sure that everything is working correctly. Completing this step will also demonstrate how the model performs using random actions instead of following a trained policy.

Step 2 - Examine the reinforcement learning elements.

Observations - Each of the three manufacturing centers are assigned an index: 0, 1, and 2. The observations function works through each of those to make the same observations at all three locations. Those observations include stock levels, total number of trucks, available trucks, and order amounts for each distributor.

Reward Variables - The reward variables are defined in the Reward Variables field. Since the goal of the model is to optimize speed and minimize distance, the reward variables track average wait times and average kilometers traveled.

Actions - This model contains 15 decision points (each of the 15 distributors order products) with 3 possible actions (which of the 3 manufacturing centers fulfills the order).

The actions are executed in doAction(). The actions tell the model which of the manufacturers to select when an order needs to be fulfilled. Since there are three total manufacturers (0, 1, and 2), the model has three total possible actions for each distributor.

Done - This simulation is set to run for four months as specified in the Simulation:Main properties

Event Trigger - The pathmindTrigger event within the Main agent serves as the event trigger in this model. Actions are triggered once per day.

Step 3 - Export model and get Pathmind policy.

Complete the steps in the Exporting Models and Training guide to export your model, complete training, and download the Pathmind policy.

Reward Function

reward -= after.avgWaitTime - before.avgWaitTime; // Minimize wait times
reward -= (after.avgDistanceKM - before.avgDistanceKM) * 20; // Minimize travel distances, scaled up by 20

Note: Kilometers traveled is multiplied by 20 so that it returns values on the scale of the "minimize wait times" reward. Otherwise, the reinforcement learning algorithm would think that kilometers traveled is less important because it would be a much smaller number than the average wait time values.

A trained policy file is included in the tutorial folder.

Step 4 - Run the simulation with the Pathmind policy.

Back in AnyLogic, open the pathmindHelper properties and select the downloaded policy file.

Now run the included Monte Carlo experiment to validate the results. Observe its behavior with a policy being used. Wait times are dramatically lower than the "nearest manufacturing center" heuristic.

Step 5 - Try more reward functions.

To demonstrate the importance of reward shaping, try teaching the policy to only care about distances traveled.

reward -= (after.avgDistanceKM - before.avgDistanceKM) * 20; // Minimize travel distances only

The shape of the reward graph indicates that the policy has learned well, but the Monte Carlo results are dramatically worse. This tells us that wait times are a stronger signal than travel distances so craft your reward function carefully.


Adding reinforcement learning to this model demonstrates how manufacturers can rely on Pathmind to solve problems without spending countless hours of trial-and-error unpacking data and testing actions by hand. The decisions surfaced by reinforcement point the way toward greater efficiency.

Beyond the scope of the model used here, reinforcement learning and Pathmind can be applied to larger networks of supply and demand. Reinforcement learning algorithms are able to understand and respond to dynamic environments, and their decisions can be game-changing for businesses operating complex distribution networks.

Did this answer your question?