General Reinforcement Learning

Agent - The entity making decisions in your AnyLogic simulation. For example, a traffic light is the "agent" in a traffic light simulation.

Environment - All possible factors that could influence the agent. In the case of a traffic light, this could be weather, a concert, or anything else that could influence how the traffic light should behave. In the AnyLogic world, this is your entire simulation.

State - A snapshot of the environment at a specific point in time. The "state of the environment" when you pause your simulation. That which exists for every aspect of your environment in any given instant.

Action - Literally, the action that an agent performs. For a traffic light, this would be turning green or red. (For a mouse in a maze, it would be taking moving up, down, right or left.) An action usually changes some aspect of the environment.

Reward - A numerical value representing the goodness or badness of an action. In the case of the traffic light, actions that produce shorter wait times at intersections are good, and actions that produce longer wait times are bad. The objective of reinforcement learning is to accumulate as much "reward" as possible. So the traffic light agent would seek to minimize wait times for all vehicles at the intersection.

Policy - You can think of a policy as an agent's instruction manual or guidebook. It tells the agent what it should do in any given situation. Reinforcement learning's job is to write this guidebook, based on what you have defined as good or bad in the reward function.

Your AnyLogic Simulation

Observations - A numerical representation of the "state" of your simulation environment's different aspects. Observations contain information that is crucial for an agent to make an informed decision. In a complex traffic-light simulation, you might observe traffic conditions like the weather, since ice, snow and rain can affect the flow of vehicles and their ability to stop and go.

Reward Variables - The building blocks for the reward function. Reward variables can embody important simulation metrics such as revenue and cost. These metrics are probably what you are seeking to optimize simultaneously, and the reward function combines them and assigns them their relative importance. For example, in factory operations, you may care about output and safety. Therefore, you will want to know how quickly you can produce goods without causing accidents. Those would be two different reward metrics to monitor to judge the factory's overall performance.

Actions - List of possible actions that your agent is allowed to perform.

Step - A change in the state of the environment after an agent performs an action. A step is only counted once you trigger an action.

Episode - One full AnyLogic simulation run from start to finish.

Done - A condition that defines when your simulation is complete. When this condition is satisfied, an episode is finished.


Reward Function - A set of rules defining the agent's incentives, what it should seek to achieve.

Reward Shaping - The process of adjusting your reward function to encourage certain behavior to ensure that the agent's behavior aligns with your business objectives.

Reward Mean - The amount of reward accumulated, averaged over all episodes.

Training Iteration - A "unit" of training that defines how often a policy should be updated. An update is essentially a change in the weights and biases of the neural network that forms part of your reinforcement learning model. The update nudges the policy toward behavior that aligns with your reward function.

Did this answer your question?