Synchronous IQL#
Independent Q-Learning with Synchronized Updates Module#
This module implements the Independent Q-Learning (IQL) algorithm with synchronized updates for multi-agent reinforcement learning. The IQL algorithm allows agents to learn independently while interacting in a shared environment. It supports epsilon-greedy and softmax policies for action selection and provides utilities for training agents over multiple epochs and episodes.
Mathematical Definitions:#
The Q-learning update rule for an agent \(i\) is defined as:
- where:
\(s\) is the current state
\(a\) is the action taken
\(r\) is the reward received
\(s'\) is the next state
\(\alpha\) is the learning rate
\(\gamma\) is the discount factor
This implementation supports both epsilon-greedy and softmax action selection policies.
Dependencies:#
numpy
torch
random
InflGame.MARL.utils.IQL_utils
Usage:#
The IQL_sync class provides an implementation of the IQL algorithm with synchronized updates. It supports custom configurations for learning rate, discount factor, epsilon decay, and more.
Example:#
import numpy as np
from InflGame.MARL.async_game import influencer_env_async
from InflGame.MARL.IQL_sync import IQL_sync
# Define environment configuration
env_config = {
"num_agents": 3,
"initial_position": [0.2, 0.5, 0.8],
"bin_points": np.linspace(0, 1, 100),
"resource_distribution": np.random.rand(100),
"step_size": 0.01,
"domain_type": "1d",
"domain_bounds": [0, 1],
"infl_configs": {"infl_type": "gaussian"},
"parameters": [0.1, 0.1, 0.1],
"fixed_pa": 0,
"NUM_ITERS": 100
}
# Initialize the environment
env = influencer_env_async(config=env_config)
# Define IQL configuration
iql_config = {
"random_seed": 42,
"env": env,
"epsilon_configs": {"TYPE": "cosine_annealing", "epsilon_max": 1.0, "epsilon_min": 0.1},
"gamma": 0.9,
"alpha": 0.01,
"epochs": 100,
"random_initialize": True,
"soft_max": False,
"episode_configs": {"TYPE": "fixed", "episode_max": 10}
}
# Initialize and train the IQL_sync agent
iql_agent = IQL_sync(config=iql_config)
final_q_table = iql_agent.train()
print("Training completed. Final Q-table:", final_q_table)
Classes
- class InflGame.MARL.IQL_sync.IQL_sync(config=None)#
Implements Independent Q-Learning (IQL) with synchronized updates for multi-agent reinforcement learning.
Methods
Q_max_action(agent)Chooses the action with the maximum Q-value for the given player.
Q_soft_max_action(agent_id)Chooses an action for the given agent using a softmax policy.
Q_step(episode)Performs a single step of Q-learning for all agents in the environment.
Initializes the Q-table for all agents.
action_choice(episode)Chooses actions for all agents based on the epsilon-greedy policy.
Resets the environment and initializes observations for all agents by sampling from their observation spaces.
train([checkpoints, save_positions, ...])Trains the agents using the IQL algorithm over multiple epochs and episodes.
- Q_max_action(agent)#
Chooses the action with the maximum Q-value for the given player.
\[a^* = \arg\max_a Q(s,a)\]- where:
\(s\) is the current state
\(a^*\) is the action with the highest Q-value
\(Q(s,a)\) is the Q-value for action \(a\) in state \(s\)
- Parameters:
agent (int) – The ID of the player.
- Returns:
The action with the highest Q-value.
- Return type:
int
- Q_soft_max_action(agent_id)#
Chooses an action for the given agent using a softmax policy.
\[P(a|s) = \frac{e^{Q(s,a)/T}}{\sum_{a'} e^{Q(s,a')/T}}\]The temperature is adjusted based on the observation and the configuration settings via
InflGame.MARL.utils.IQL_utils.adjusted_temperature.- where:
\(T\) is the temperature parameter
\(a'\) is the set of all possible actions
\(P(a|s)\) is the probability of taking action \(a\) in state \(s\)
\(Q(s,a)\) is the Q-value for action \(a\) in state \(s\)
- Parameters:
agent_id (int) – The ID of the agent.
- Returns:
The chosen action.
- Return type:
int
- Q_step(episode)#
Performs a single step of Q-learning for all agents in the environment. Actions are chosen simultaneously, and the Q-values are updated based on the received rewards.
\[Q(s, a) \leftarrow (1 - \alpha) Q(s, a) + \alpha \left( r + \gamma \max_{a'} Q(s', a') \right)\]- Parameters:
episode (int) – The current episode number.
- Returns:
The updated Q-table.
- Return type:
dict
- Q_table_initiation()#
Initializes the Q-table for all agents.
If random_initialize is True, the Q-values are initialized randomly; otherwise, they are initialized to zero.
\[\begin{split}Q(s, a) = \begin{cases} \text{random value in } [0, 0.3] & \text{if random initialization is enabled} \\ 0 & \text{otherwise} \end{cases}\end{split}\]
- action_choice(episode)#
Chooses actions for all agents based on the epsilon-greedy policy.
the \(\epsilon\) value is adjusted based on the current episode via the function
InflGame.MARL.utils.IQL_utils.adjusted_epsilon.If a random value is less than \(\epsilon\), a random action is chosen. Otherwise, the action is selected based on the Q-values using either a softmax or max policy.
- Parameters:
episode (int) – The current episode number.
- Returns:
A dictionary mapping each agent to its chosen action.
- Return type:
dict
- observation_initialized()#
Resets the environment and initializes observations for all agents by sampling from their observation spaces.
- train(checkpoints=False, save_positions=False, data_parameters=None, name_ads=None)#
Trains the agents using the IQL algorithm over multiple epochs and episodes.
Here the number of episodes is adjusted based on the epoch using the function
InflGame.MARL.utils.IQL_utils.adjusted_episodes.At the end of an all episodes the environment is reset and the observations are reinitialized randomly.
- Returns:
The final Q-table after training.
- Return type:
dict