1. Complete the Get Started tutorial

  2. Download the tutorial files

Simulation Overview

Manufacturers optimize their delivery tactics to maximize profit and minimize product wait times.

When networks of manufacturing facilities and distributors are spread across a large area, they sometimes struggle to manage deliveries. They use simulation models to explore cost-effective ways to adjust deliveries. In those simulations, reinforcement learning can surface the best possible routes and decisions quickly.

The base simulation for the model used in this tutorial is publicly available on AnyLogic Cloud. It has been modified by the Accenture Applied Intelligence team to showcase reinforcement learning using Pathmind.

In the model, cities are shown in a region of Europe. The model uses AnyLogic’s GIS elements to place agents in the correct locations, which are provided by an Excel spreadsheet. These features also allow the delivery trucks to move along real roads.

The model includes three manufacturing centers and fifteen distributors in various locations. Each manufacturing center houses a fleet of three delivery trucks.

Within the distributor agent, a generateDemand event creates orders. These orders consist of a random quantity of goods of between 500 and 1000 units, and they occur randomly every 1 to 2 days. Once an order is received, the reinforcement learning agent will determine which manufacturer can fulfill the order most quickly.

Determining which manufacturer should handle an order depends on several time-dependent factors. The manufacturer will not send out an order, for example, if it does not have enough inventory in stock to fulfill it. The added time needed to produce more inventory is determined by the production processing diagram and order processing flow chart within the ManufacturingCenter agent. This is the key bottleneck solved by reinforcement learning.

Another important factor that impact delivery time is the distance between the distributor and manufacturer. Sending a truck from a manufacturer that has the inventory to immediately fulfill an order may not be the fastest solution if it is many kilometers away from the distributor.

The model considers these factors and seeks to select the manufacturer that should fulfill an order while minimizing wait times and the distance driven. The best case scenario: the manufacturer nearest to the ordering distributor would have enough inventory in stock to complete the order, since that would result in minimal wait times for both production and travel.


Step 1 - Run the simulation to check Pathmind Helper setup.

Go through the steps in the Pathmind Helper Setup guide to make sure that everything is working correctly. Completing this step will also demonstrate how the model performs using random actions instead of following a trained policy.

Step 2 - Examine the Pathmind properties.

Observations - Each of the three manufacturing centers are assigned an index: 0, 1, and 2. The observations function works through each of those to make the same observations at all three locations. Those observations include stock levels, available trucks, and order amounts for each distributor.

Metrics - The metrics are defined in the Metrics field. Since the goal of the model is to minimize delivery delays, the metrics track average wait times and average kilometers traveled.

Actions - This model contains 15 decision points (each of the 15 distributors order products) with 3 possible actions (which of the 3 manufacturing centers fulfills the order).

The actions are executed in doAction(). The actions tell the model which of the manufacturers to select when an order needs to be fulfilled. Since there are three total manufacturers (0, 1, and 2), the model has three total possible actions for each distributor.

Event Trigger - The pathmindTrigger event within the Main agent serves as the event trigger in this model. Actions are triggered once per day.

Done - This simulation is set to run for four months as specified in the Simulation:Main properties

Step 3 - Export model and get Pathmind policy.

Complete the steps in the Exporting Models and Training guide to export your model, complete training, and download the Pathmind policy.

Reward Function

reward -= after.avgWaitTime - before.avgWaitTime; // Minimize wait times

A policy will be generated after training completes. A trained policy file is included in the tutorial folder.

Step 4 - Run the simulation with the Pathmind policy.

Back in AnyLogic, open the pathmindHelper properties and locate the policy file.

Now run the included Monte Carlo experiment to validate the results. Observe its behavior with a policy being used. Wait times are dramatically lower than the "nearest manufacturing center" heuristic.

Step 5 - Try more reward functions.

To demonstrate the importance of reward shaping, try teaching the policy to only care about distances traveled.

reward -= (after.avgDistanceKM - before.avgDistanceKM); // Minimize travel distance

The shape of the reward graph indicates that the policy has learned well, but the Monte Carlo results are much worse. The policy basically learned the nearest manufacturing center heuristic at best. This tells us that the distance reward does not capture enough "signal" to teach the policy where it can extract new efficiencies in the system.


Adding reinforcement learning to this model demonstrates how manufacturers can rely on Pathmind to solve problems without spending countless hours of trial-and-error unpacking data and testing actions by hand. The decisions surfaced by reinforcement point the way toward greater efficiency.

Beyond the scope of the model used here, reinforcement learning and Pathmind can be applied to larger networks of supply and demand. Reinforcement learning algorithms are able to understand and respond to dynamic environments, and their decisions can be game-changing for businesses operating complex distribution networks.

Did this answer your question?