MARL Plots#

Multi-Agent Reinforcement Learning (MARL) Plotting Module#

This module provides visualization tools for analyzing the performance of multi-agent reinforcement learning (MARL) algorithms. It includes functions for plotting policies, rewards, and positions of agents over time.

Dependencies:#

  • InflGame.utils

  • InflGame.MARL

Usage:#

The policy_histogram function visualizes the Q-table as a policy heatmap, while the reward_plot and pos_plot functions plot the rewards and positions of agents over time, respectively. The policy_deterministically_to_actions function simulates deterministic actions for agents based on their policies.

Example:#

import numpy as np
import torch
from InflGame.MARL.async_game import influencer_env_async
from InflGame.MARL.MARL_plots import policy_histogram, reward_plot, pos_plot, policy_deterministically_to_actions

# Define environment configuration
env_config = {
    "num_agents": 3,
    "initial_position": [0.2, 0.5, 0.8],
    "bin_points": np.linspace(0, 1, 100),
    "resource_distribution": np.random.rand(100),
    "step_size": 0.01,
    "domain_type": "1d",
    "domain_bounds": [0, 1],
    "infl_configs": {"infl_type": "gaussian"},
    "parameters": [0.1, 0.1, 0.1],
    "fixed_pa": 0,
    "NUM_ITERS": 100
}

# Initialize the environment
env = influencer_env_async(config=env_config)

# Simulate deterministic actions
q_tensor = torch.rand((3, 100, 3))  # Example Q-tensor
pos_matrix, reward_matrix = policy_deterministically_to_actions(env=env, q_tensor=q_tensor, num_step=50)

# Plot policy heatmap for player 0
policy_fig = policy_histogram(q_tensor=q_tensor, player_id=0)
policy_fig.show()

# Plot rewards over time
reward_fig = reward_plot(reward_matrix=reward_matrix, possible_agents=env.possible_agents)
reward_fig.show()

# Plot positions over time
pos_fig = pos_plot(pos_matrix=pos_matrix, possible_agents=env.possible_agents, domain_bounds=env_config["domain_bounds"])
pos_fig.show()

Functions

InflGame.MARL.MARL_plots.policy_deterministically_to_actions(env, q_table=None, q_tensor=None, initial_position=array([0, 1]), num_step=10, temperature=1)#

Simulates deterministic actions for agents based on their policies. By doing the following

1. The Q-table is converted to a policy using a softmax function. i.e.

\[P(a|s) = \frac{e^{Q(s,a)/T}}{\sum_{a'} e^{Q(s,a')/T}}\]
where:
  • \(a\) is the action

  • \(s\) is the current state

  • \(a'\) is the next state

  • \(T\) is the temperature parameter

  • \(P(a|s)\) is the probability of taking action \(a\) in state \(s\)

  • \(Q(s,a)\) is the Q-value for action \(a\) in state \(s\)

  1. The maximum action is selected for each state.

  2. The environment is stepped through the selected actions for a specified number of steps.

  3. The positions and rewards are recorded at each step.

Parameters:
  • env (influencer_env_async) – The environment object.

  • q_table (dict, optional) – Q-table in dictionary format. Defaults to None.

  • q_tensor (torch.Tensor, optional) – Q-table as a torch.Tensor. Defaults to None.

  • initial_position (np.ndarray) – Initial position of players. Defaults to np.array([0, 1]).

  • num_step (int) – Number of steps to simulate. Defaults to 10.

  • temperature (float) – A smoothness factor for the softmax function. Defaults to 1.

Returns:

Position matrix and reward matrix as torch.Tensors.

Return type:

tuple[torch.Tensor, torch.Tensor]

InflGame.MARL.MARL_plots.policy_histogram(q_table=None, q_tensor=None, agent_id=0, temperature=1)#

Visualizes the Q-table as a policy using a softmax function and plots it as a heatmap.

\[P(a|s) = \frac{e^{Q(s,a)/T}}{\sum_{a'} e^{Q(s,a')/T}}\]
where:
  • \(a\) is the action

  • \(s\) is the current state

  • \(a'\) is the next state

  • \(T\) is the temperature parameter

  • \(P(a|s)\) is the probability of taking action \(a\) in state \(s\)

  • \(Q(s,a)\) is the Q-value for action \(a\) in state \(s\)

Parameters:
  • q_table (dict, optional) – Q-table in dictionary format. Defaults to None.

  • q_tensor (torch.Tensor, optional) – Q-table as a torch.Tensor. Defaults to None.

  • player_id (int) – Player’s ID number. Defaults to 0.

  • temperature (float) – A smoothness factor for the softmax function. Defaults to 1.

Returns:

Figure representing the policy as a heatmap.

Return type:

matplotlib.figure.Figure

InflGame.MARL.MARL_plots.pos_plot(pos_matrix, possible_agents, domain_bounds)#

Plots the positions of all players over time.

Parameters:
  • pos_matrix (torch.Tensor) – Matrix containing positions for each player at each step.

  • possible_agents (dict) – Dictionary of possible agents in the environment.

  • domain_bounds (list) – List containing the lower and upper bounds of the domain.

Returns:

A figure of the agent positions through time using the optimal policy.

Return type:

matplotlib.figure.Figure

InflGame.MARL.MARL_plots.reward_plot(reward_matrix, possible_agents)#

Plots the rewards for all players over time.

Parameters:
  • reward_matrix (torch.Tensor) – Matrix containing rewards for each player at each step.

  • possible_agents (dict) – Dictionary of possible agents in the environment.

Returns:

A figure of the reward through time using the optimal policy.

Return type:

matplotlib.figure.Figure