Note: This tutorial can be completed with AnyLogic PLE. If using AnyLogic Professional, also see the Getting Started with Simple Stochastic Model tutorial.


  1. Sign up for Pathmind

  2. Have AnyLogic installed

  3. Download and install Pathmind Helper

  4. Download the Getting Started with Product Delivery tutorial folder


Pathmind helps simulation modelers use reinforcement learning to optimize simulations. With Pathmind, new decision paths can be found that improve performance -- without spending hours on manual testing. Pathmind policies also tend to handle variable inputs in dynamic environments better than other optimizers.

This Getting Started guide will show you how a trained Pathmind AI agent performs in an AnyLogic model. Everything you need, including a model with reinforcement learning functions already written, is in the tutorial folder linked to above.

You'll begin by uploading the model to the Pathmind Training web application to see how a policy is trained. A trained policy file is also included in the tutorial folder, so you won't need to wait for training to complete to go through the tutorial. Finally, we'll show you how the Pathmind policy performs in AnyLogic and review how we set the model up for reinforcement learning success.

Model Background

The AnyLogic model will be used to optimize product deliveries in a network of manufacturers and distribution centers. In a real-world setting, manufacturers relying on those networks seek to maximize profit and minimize wait times. When the networks are spread across a large area, making deliveries quickly and efficiently can be hard.

Simulation allows you to explore cost-effective ways to adjust delivery tactics to increase speed and profit. The example Pathmind policy provided with this tutorial demonstrates how reinforcement learning can surface better possible routes and decisions needed for optimal performance in distribution networks and other applications.

The base simulation for the model used in this tutorial is publicly available on AnyLogic Cloud. It has been modified by the Accenture Applied Intelligence team to showcase reinforcement learning using Pathmind.

Before you begin, you may want to take a high-level look at the model. Ignore the reinforcement learning elements for now and just focus on how the model operates.

Cities are shown in a region of Europe. The model uses AnyLogic’s GIS elements to place agents in the correct locations, which are provided by an Excel spreadsheet. These features also allow the delivery trucks to move along real roads.

The model includes three manufacturing centers and fifteen distributors in various locations. Each manufacturing center houses a fleet of three delivery trucks.

Within the distributor agent, a 'generateDemand' Event creates orders. These orders consist of a random quantity of goods of between 500 and 1000 units, and they occur randomly every 1 to 2 days. Once an order is received, the reinforcement learning agent will determine which manufacturer can fulfill the order most quickly.

Determining which manufacturer should handle an order depends on several time-dependent factors. The manufacturer will not send out an order, for example, if it does not have enough inventory in stock to fulfill it. The added time needed to produce more inventory is determined by the production processing diagram and order processing flow chart within the ManufacturingCenter agent. This is the key bottleneck solved by reinforcement learning.

Another important factor that impacts delivery time is the distance between the distributor and manufacturer. Sending a truck from a manufacturer that has the inventory to immediately fulfill an order may not be the fastest solution if it is many kilometers away from the distributor.

The model considers all of these factors and seeks to select the manufacturer that should fulfill an order while minimizing wait times and the distance driven. The best case scenario: the manufacturer nearest to the ordering distributor would have enough inventory in stock to complete the order, since that would result in minimal wait times for both production and travel.

Uploading to Pathmind Training

Pathmind Training is an easy-to-use web application where AI training takes place in the cloud. Note the hierarchy within the application:

Projects > Models > Experiments

Projects are at the highest level and hold all of the models being used to solve a problem. Multiple models can be added to a Project, so that all of the various model versions being edited and updated for a single Project are in one place. For each of those models, multiple experiments can be run, each with different reward functions and observations.

Once the reinforcement learning functions have been added to a model, it can be uploaded to Pathmind Training. The model provided with this tutorial already has those functions in place, so we will start with uploading.

To begin, open Pathmind Training and click '+ New Project.' You will be prompted to create a name for the project. Enter something like "Getting Started - Product Delivery' and click 'Create Project.'

Now you will need to upload a model. Normally, you would need to export your model from AnyLogic as a standalone Java application, but we've completed that step for this tutorial. Click 'Upload as Zip' beneath the 'Upload Exported Folder' button and select the 'ProductDeliveryExport' zip file included in the tutorial download.

Next, you will have the option of adding your model's ALP file to help with version tracking. For now, skip this step and click 'Next.' On the following screen, you may add any notes about the model you are uploading. Click 'Next" again.

You will see that Pathmind Training has automatically pulled the reward variables from the model. This feature is possible because of the reward function that was written in the model itself before exporting. Later, you will see the full reward function as it appears in the model.

In some cases, you may wish to add a goal to help determine success, which can be added at this stage. For now, just note that we are tracking Average Wait Time and Average Distance in Kilometers and click 'Next.'

The next step is writing the reward function. This is a crucial moment since a good reward function teaches the AI agent to accomplish your goals, and it will help get the best results in training. Starting simple is generally best, but the function will depend on the model and which metrics are being optimized.

Copy and paste the following reward function into the 'Reward Function' field.

reward -= after.avgWaitTime - before.avgWaitTime; // Minimize wait times
reward -= (after.avgDistanceKM - before.avgDistanceKM) * 20; // Minimize travel distances, scaled up by 20

Note: Kilometers traveled is multiplied by 20 so that it returns values on the scale of the "minimize wait times" reward. Otherwise, the reinforcement learning algorithm would think that kilometers traveled is less important because it would be a much smaller number than the average wait time values.

You may choose the Observations, which are also defined in the model itself, to use. For this tutorial, make sure that all options are selected.

Click 'Train Policy.' Training will begin and early results will be displayed after a few moments.

Once training is complete, a policy file will be generated and you will receive an email update. Training time will vary depending on model. Since a trained policy file is included in the tutorial download, we can move to the next step without waiting.


Head back to AnyLogic and the Product Delivery model. Open the Pathmind Helper properties panel.

Make sure that Pathmind is enabled and Mode is set to usePolicy. Click Browse and select the 'ProductDeliveryPolicy' zip file included in the tutorial folder.

We have included a Monte Carlo experiment to demonstrate how the Pathmind policy performs against random actions, as well as against the heuristic of always using the nearest manufacturing center. Select and run the Monte Carlo experiment.

Observe that the Pathmind policy results in dramatically lower wait times than either random actions or the heuristic. (Note that the y axes below are not on the same scale.)

Reinforcement Learning Functions

Now that you've seen how a Pathmind policy is trained and performs in a model, you might want to take a closer look at how the simulation was set up to work with Pathmind.

Open the Pathmind Helper properties panel and examine the various reinforcement learning values.

Observations - Observations are used to determine the state of the environment in which your reinforcement learning agent operates (the simulation is essentially the environment, and observations are measurements we take from it). More simply, they tell a simulation where things stand for the elements that we define as important. In this model, we have defined those observations as stock levels, total number of trucks, total number of available trucks, and order amounts for each distributor.

Since this model includes multiple manufacturing centers, each one is assigned an index: 0, 1, and 2. The observations function works through each of those to check through our list of observations at all three locations.

Reward Variables - Reward variables are used to determine if an action taken was good or bad. In most cases, reward variables are the items we want to optimize, such as speed or profit. As you saw in Pathmind Training, reward variables work with the reward function we wrote to help the algorithm understand which actions lead to the best results. This setup works because points are given when an agent takes an action. Reinforcement learning will work to earn as many of those points as possible, adopting new patterns of decision and action as it does so.

Since the goal of the model is to optimize speed and minimize distance, the reward variables track average wait times and average kilometers traveled.

Actions - Actions determine what an agent can do. This model contains 15 decision points (each of the 15 distributors order products) with 3 possible actions (which of the 3 manufacturing centers fulfills the order).

The actions are executed in doAction(). The actions tell the model which of the manufacturers to select when an order needs to be fulfilled. Since there are three total manufacturers (0, 1, and 2), the model has three possible actions for each distributor.

Done - Done tells a simulation when to end, whether it be after a specified amount of time or when a certain condition becomes true. This simulation is set to run for four months, as specified in the Simulation:Main properties.

Event Trigger - The event trigger tells Pathmind when to trigger the next action. The pathmindTrigger event within the Main agent serves as the event trigger in this model. Actions are triggered once per day.

The Importance of the Reward Function

To demonstrate the importance of crafting the right reward function, try using a different function for this model in Pathmind Training.

Start a new experiment for the project and model we created earlier in this tutorial. When prompted to add the reward function, copy and paste the following:

reward -= (after.avgDistanceKM - before.avgDistanceKM) * 20; // Minimize travel distances only

This reward function focuses only on minimizing distance traveled and ignores wait times.

Click Start Training and wait for training to complete. Once you receive the notification email, download the policy and return to AnyLogic.

Now run the Monte Carlo again using the new policy. Notice that even though the policy trained well, as shown in the reward graph in Pathmind Training, the final results aren't as good as those we got with the original reward function. This difference tells us that wait times are a stronger signal than travel distance.

Craft your reward functions carefully!


Adding reinforcement learning to this model demonstrates how manufacturers can use Pathmind to solve problems without spending weeks unpacking data and testing actions by manual trial-and-error. The decisions surfaced by reinforcement point the way toward greater efficiency.

Beyond the scope of the model used here, reinforcement learning and Pathmind can be applied to larger networks of supply and demand. Reinforcement learning algorithms are able to understand and respond to dynamic environments, and their decisions can be game-changing for businesses operating complex distribution networks.

Did this answer your question?