Synchronous IQL#

Independent Q-Learning with Synchronized Updates Module#

This module implements the Independent Q-Learning (IQL) algorithm with synchronized updates for multi-agent reinforcement learning. The IQL algorithm allows agents to learn independently while interacting in a shared environment. It supports epsilon-greedy and softmax policies for action selection and provides utilities for training agents over multiple epochs and episodes.

Mathematical Definitions:#

The Q-learning update rule for an agent \(i\) is defined as:

\[Q_i(s, a) \leftarrow (1 - \alpha) Q_i(s, a) + \alpha \left( r + \gamma \max_{a'} Q_i(s', a') \right)\]
where:
  • \(s\) is the current state

  • \(a\) is the action taken

  • \(r\) is the reward received

  • \(s'\) is the next state

  • \(\alpha\) is the learning rate

  • \(\gamma\) is the discount factor

This implementation supports both epsilon-greedy and softmax action selection policies.

Dependencies:#

  • numpy

  • torch

  • random

  • InflGame.MARL.utils.IQL_utils

Usage:#

The IQL_sync class provides an implementation of the IQL algorithm with synchronized updates. It supports custom configurations for learning rate, discount factor, epsilon decay, and more.

Example:#

import numpy as np
from InflGame.MARL.async_game import influencer_env_async
from InflGame.MARL.IQL_sync import IQL_sync

# Define environment configuration
env_config = {
    "num_agents": 3,
    "initial_position": [0.2, 0.5, 0.8],
    "bin_points": np.linspace(0, 1, 100),
    "resource_distribution": np.random.rand(100),
    "step_size": 0.01,
    "domain_type": "1d",
    "domain_bounds": [0, 1],
    "infl_configs": {"infl_type": "gaussian"},
    "parameters": [0.1, 0.1, 0.1],
    "fixed_pa": 0,
    "NUM_ITERS": 100
}

# Initialize the environment
env = influencer_env_async(config=env_config)

# Define IQL configuration
iql_config = {
    "random_seed": 42,
    "env": env,
    "epsilon_configs": {"TYPE": "cosine_annealing", "epsilon_max": 1.0, "epsilon_min": 0.1},
    "gamma": 0.9,
    "alpha": 0.01,
    "epochs": 100,
    "random_initialize": True,
    "soft_max": False,
    "episode_configs": {"TYPE": "fixed", "episode_max": 10}
}

# Initialize and train the IQL_sync agent
iql_agent = IQL_sync(config=iql_config)
final_q_table = iql_agent.train()

print("Training completed. Final Q-table:", final_q_table)

Classes

class InflGame.MARL.IQL_sync.IQL_sync(config=None)#

Implements Independent Q-Learning (IQL) with synchronized updates for multi-agent reinforcement learning.

Methods

Q_max_action(agent)

Chooses the action with the maximum Q-value for the given player.

Q_soft_max_action(agent_id)

Chooses an action for the given agent using a softmax policy.

Q_step(episode)

Performs a single step of Q-learning for all agents in the environment.

Q_table_initiation()

Initializes the Q-table for all agents.

action_choice(episode)

Chooses actions for all agents based on the epsilon-greedy policy.

observation_initialized()

Resets the environment and initializes observations for all agents by sampling from their observation spaces.

train([checkpoints, save_positions, ...])

Trains the agents using the IQL algorithm over multiple epochs and episodes.

Q_max_action(agent)#

Chooses the action with the maximum Q-value for the given player.

\[a^* = \arg\max_a Q(s,a)\]
where:
  • \(s\) is the current state

  • \(a^*\) is the action with the highest Q-value

  • \(Q(s,a)\) is the Q-value for action \(a\) in state \(s\)

Parameters:

agent (int) – The ID of the player.

Returns:

The action with the highest Q-value.

Return type:

int

Q_soft_max_action(agent_id)#

Chooses an action for the given agent using a softmax policy.

\[P(a|s) = \frac{e^{Q(s,a)/T}}{\sum_{a'} e^{Q(s,a')/T}}\]

The temperature is adjusted based on the observation and the configuration settings via InflGame.MARL.utils.IQL_utils.adjusted_temperature.

where:
  • \(T\) is the temperature parameter

  • \(a'\) is the set of all possible actions

  • \(P(a|s)\) is the probability of taking action \(a\) in state \(s\)

  • \(Q(s,a)\) is the Q-value for action \(a\) in state \(s\)

Parameters:

agent_id (int) – The ID of the agent.

Returns:

The chosen action.

Return type:

int

Q_step(episode)#

Performs a single step of Q-learning for all agents in the environment. Actions are chosen simultaneously, and the Q-values are updated based on the received rewards.

\[Q(s, a) \leftarrow (1 - \alpha) Q(s, a) + \alpha \left( r + \gamma \max_{a'} Q(s', a') \right)\]
Parameters:

episode (int) – The current episode number.

Returns:

The updated Q-table.

Return type:

dict

Q_table_initiation()#

Initializes the Q-table for all agents.

If random_initialize is True, the Q-values are initialized randomly; otherwise, they are initialized to zero.

\[\begin{split}Q(s, a) = \begin{cases} \text{random value in } [0, 0.3] & \text{if random initialization is enabled} \\ 0 & \text{otherwise} \end{cases}\end{split}\]
action_choice(episode)#

Chooses actions for all agents based on the epsilon-greedy policy.

the \(\epsilon\) value is adjusted based on the current episode via the function InflGame.MARL.utils.IQL_utils.adjusted_epsilon.

If a random value is less than \(\epsilon\), a random action is chosen. Otherwise, the action is selected based on the Q-values using either a softmax or max policy.

Parameters:

episode (int) – The current episode number.

Returns:

A dictionary mapping each agent to its chosen action.

Return type:

dict

observation_initialized()#

Resets the environment and initializes observations for all agents by sampling from their observation spaces.

train(checkpoints=False, save_positions=False, data_parameters=None, name_ads=None)#

Trains the agents using the IQL algorithm over multiple epochs and episodes.

Here the number of episodes is adjusted based on the epoch using the function InflGame.MARL.utils.IQL_utils.adjusted_episodes.

At the end of an all episodes the environment is reset and the observations are reinitialized randomly.

Returns:

The final Q-table after training.

Return type:

dict