1. Complete the Get Started tutorial

  2. Download the tutorial files.


This tutorial will walk you through a reinforcement implementation in an automated guided vehicle (AGV) fleet management problem, which incorporates Multi-Controller, Shared Policy reinforcement learning technology. The goal of the tutorial is to introduce you to Multi-Controller, Shared Policy in a fleet-management problem and showcase 3 advantages that reinforcement learning has over any heuristic: (1) its implementation is straightforward; (2) it yields a policy that responds to variable conditions quickly; and (3) it yields emergent behavior that is difficult to capture with even the most complex heuristics.

Simulation Overview

A fleet of automated guided vehicles (AGVs) optimizes its dispatching routes to maximize product throughput in a manufacturing center. When component parts arrive to be processed, they must be brought to the appropriate machine according to a specific processing sequence. In theory, AGVs can increase the delivery rate at correct locations, maximizing the output rate of finished products. In practice, however, coordinating AGV tasks in the face of variable processing times, maintenance shutdowns and supply arrivals can be difficult. Inefficiency in AGV fleet management can result in lost time and under-utilization of resources. In this example, an AnyLogic simulation of the processing warehouse demonstrates how reinforcement learning can be used to outperform a shortest queue heuristic by 79%.

The AnyLogic simulation, shown above, consists of a fleet of four AGVs that pick up and drop off objects at different processing machines throughout the warehouse. Each AGV can carry at most one product (indicated by blue, green and pink squares) or one raw material (indicated by yellow circles). Precursor products and raw materials arrive on the left of the figure above, then are moved by the AGVs in stages to the right. At each stage, a given product must be combined with one raw material to pass through a processing machine. The raw material is consumed in the process, and the product is then ready to be moved to the next stage. Products must be processed in a specific sequence (stage 1, 2, 3) before exiting the system. Each processing stage has three machines with variable processing times.

Introduction to the Multi-Controller, Shared Policy Method

Reinforcement learning offers a simple, scalable solution to this fleet management problem. The included policy is trained using Pathmind’s Multi-Controller, Shared Policy functionality. The logic behind Multi-Controller Shared Policy, which is schematized in the figure below, is straightforward. Each AGV has its own “controller,” which can be individually triggered, gathering observations specific to its own state, and then performing an action based on the policy it shares with all other AGV’s controllers. That controller’s individual action changes the environment in either a beneficial or detrimental way, which is reflected in its metrics.

Notably, the metrics (i.e. rewards) and observations in Multi-Controller, Shared Policy can include individual metrics and observations as well as shared collective metrics and observations. For example, if one AGV drops off a product at a machine that is shutdown, that individual AGV should be penalized. This individual metric is defined as “deliveryAtShutdown”. On the other hand, if a product passes successfully and quickly through the full processing sequence, then all AGVs should be rewarded. This collective metric is “totalThroughput”. Examples of observations and metrics that mix individual and collective metrics are shown below. Blue, green and red boxes represent observations, rewards and penalties, respectively. Collective information is outlined with a solid black line, and individual information is outlined with a dashed black line.


Step 1 - Perform a run with random actions to check Pathmind Helper setup.

Go through the steps of the Check Pathmind Helper Setup guide to make sure that everything is working correctly. Completing this step will also demonstrate how the model performs using random actions instead of a policy.

Step 2 - Examine the reinforcement learning elements.

Observations – Before choosing an action, each AGV observes its environment. The observations can be grouped into different categories: (1) the position of the AGV that is currently observing; (2) the status of each line in the manufacturing center, including where there are products available for pickup, what the queue sizes are, and how much time remains for each product being processed.

Actions – Each AGV independently chooses the origin and destination of its next trip. The AGV picks up a product or raw material at the chosen origin and drops it at the chosen destination. There are 11 possible origins (9 lines and 2 raw material source lines) and there are 9 possible destinations (3 stages x 3 machines). The Actions, shown below, are defined by 3 different elements: (A) the origin, an integer with 11 possible values; (B) the destination stage, an integer with 3 possible values; and (C) the destination machine, an integer with 3 possible values.

Note that the Actions class has a function doIt(), which passes the origin and destination values to a doAction() function that is defined in Main. Below are the details on how doAction() instructs each individual AGV to visit the appropriate origin and destination.

Because the raw materials are consumed at process lines, they can only be picked up from the raw material source line. The above logic allows for raw materials to be taken from the source line to any process. Products, however, are automatically brought to their next processing stage (stage 1 -> stage 2 -> stage 3), with the specific machine chosen by the output of Actions.

For more information on the mapping between Actions and origin-destination pairs, please examine the bodies of the sendTo_sourceLine and sendTo_processLine functions.

Metrics – As discussed in the “Introduction to Multi-Controller, Shared Policy” section, each AGV is rewarded individually for the competence of its previous action, and the population of AGVs is rewarded collectively for the overall throughput. The field below shows the specific reward variables that define both individual and collective rewards.

Step 3 – Evaluate the Pathmind policy

Included with the model is a pre-trained Pathmind Policy that was trained in the 4 AGV scenario. First, reference the trained policy in the PathmindHelper “Policy File” field:

Next, run the Monte Carlo experiment for 100 episodes and take note of the performance. Now repeat this Monte Carlo experiment with the heuristic. Compare the totalThroughput of the policy to that of the heuristic. A representative Monte Carlo comparison of the policy to the heuristic is shown below.

The mean throughput is ~43 products when AGVs follow the heuristic and ~77 products when they are controlled by the Pathmind policy. The Pathmind policy outperforms the heuristic by roughly 79%.

Step 4 – Explore how Pathmind outperforms the heuristic

Showing that the intelligent policy can substantially outperform a simple routing heuristic is only half the battle. The second half is understanding, interpreting and harnessing the results. In this section, you will interpret the policy results shown above. To do so, start by running the simulation using the policy.

The first thing you may notice is that the policy essentially ignores machines 3, 7, and 8. It seems that the policy figured out that by ignoring these three machines, it can reduce travel times and thus, greatly improve throughput.

Secondly, it appears as if the AGVs are working together. If you focus your attention on the top left corner, it seems like one AGV focuses on supply raw material whereas the other focuses on grabbing products.

The Pathmind-controlled AGVs are showing emergent behavior—it has decided to ignore certain machines and tag-team AGVs on others to greatly improve throughput. This type of behavior would be incredibly difficult to encode in a heuristic.

Step 5 – Train an intelligent Pathmind policy

To train your own version of the intelligent policy, reference the Exporting Models and Training guide. This guide will step you through to exporting your model, completing training, and downloading the Pathmind policy.

Reward function

reward += after.totalThroughput * 0.01; 
reward += ( after.machineUtil - before.machineUtil ) * 100;
reward += (after.essentialDelivery - before.essentialDelivery) / 60 * 0.5;
reward -= (after.fullQueue - before.fullQueue);
reward -= after.fullConveyor - before.fullConveyor;
reward -= (after.emptyOrigins - before.emptyOrigins);
reward -= (after.tripDuration) / 140;

The reward function used to train the policy included in this tutorial is shown above. Notice that the reward function combines collective rewards such as totalThroughput and machineUtil with individual rewards such as essentialDelivery (wherein an AGV delivers a raw material to a product line where it is needed).

Try influencing AGVs to adopt different behavior by adding, deleting or modifying the reward terms in the reward function and running new experiments.


Pathmind was used to adapt an AGV fleet management problem for reinforcement learning. When trained on a 4 AGV scenario, the resulting policy is able to outperform a simple heuristic by 79%. This behavior showcases one of the strengths of reinforcement learning over traditional, static methods: complex, emergent behavior.

Did this answer your question?