Asynchronous IQL#
Independent Q-Learning with Asynchronous Updates Module#
This module implements the Independent Q-Learning (IQL) algorithm in an asynchronous multi-agent reinforcement learning setting. The IQL algorithm allows agents to learn independently while interacting in a shared environment. It supports epsilon-greedy and softmax policies for action selection and provides utilities for training agents over multiple epochs and episodes.
Mathematical Definitions:#
The Q-learning update rule for an agent \(i\) is defined as:
- where:
\(s\) is the current state
\(a\) is the action taken
\(r\) is the reward received
\(s'\) is the next state
\(\alpha\) is the learning rate
\(\gamma\) is the discount factor
This implementation supports both epsilon-greedy and softmax action selection policies.
Dependencies:#
InflGame.MARL
Usage:#
The IQL_async class provides an implementation of the IQL algorithm with asynchronous updates. It supports custom configurations for learning rate, discount factor, epsilon decay, and more.
Example:#
import numpy as np
from InflGame.MARL.async_game import influencer_env_async
from InflGame.MARL.IQL_async import IQL_async
# Define environment configuration
env_config = {
"num_agents": 3,
"initial_position": [0.2, 0.5, 0.8],
"bin_points": np.linspace(0, 1, 100),
"resource_distribution": np.random.rand(100),
"step_size": 0.01,
"domain_type": "1d",
"domain_bounds": [0, 1],
"infl_configs": {"infl_type": "gaussian"},
"parameters": [0.1, 0.1, 0.1],
"fixed_pa": 0,
"NUM_ITERS": 100
}
# Initialize the environment
env = influencer_env_async(config=env_config)
# Define IQL configuration
iql_config = {
"random_seed": 42,
"env": env,
"epsilon_configs": {"TYPE": "cosine_annealing", "epsilon_max": 1.0, "epsilon_min": 0.1},
"gamma": 0.9,
"alpha": 0.01,
"epochs": 100,
"random_initialize": True,
"soft_max": False,
"episode_configs": {"TYPE": "fixed", "episode_max": 10}
}
# Initialize and train the IQL_async agent
iql_agent = IQL_async(config=iql_config)
final_q_table = iql_agent.train()
print("Training completed. Final Q-table:", final_q_table)
Classes
- class InflGame.MARL.IQL_async.IQL_async(config=None)#
Implements Independent Q-Learning (IQL) with asynchronous updates for multi-agent reinforcement learning.
Methods
Q_max_action(agent)Chooses the action with the maximum Q-value for the given player.
Q_soft_max_action(agent)Chooses an action for the given player using a softmax policy based on the Q-values.
Q_step(episode)Performs a single step of Q-learning for all agents in the environment asynchronously where each agent updates its Q-table independently.
Initializes the Q-table for all agents.
action_choice(episode, agent)Chooses an action for the given player based on the epsilon-greedy policy.
Resets the environment and initializes random observations for all agents.
train()Trains the agents using the IQL algorithm over multiple epochs and episodes.
- Q_max_action(agent)#
Chooses the action with the maximum Q-value for the given player.
\[a^* = \arg\max_a Q(s,a)\]- where:
\(s\) is the current state
\(a^*\) is the action with the highest Q-value
\(Q(s,a)\) is the Q-value for action \(a\) in state \(s\)
- Parameters:
agent (str) – The player for whom the action is being chosen.
- Returns:
The chosen action.
- Return type:
int
- Q_soft_max_action(agent)#
Chooses an action for the given player using a softmax policy based on the Q-values.
\[P(a|s) = \frac{e^{Q(s,a)/T}}{\sum_{a'} e^{Q(s,a')/T}}\]The temperature is adjusted based on the observation and the configuration settings via the function
InflGame.MARL.utils.IQL_utils.adjusted_temperature.- where:
\(T\) is the temperature parameter
\(a'\) is the set of all possible actions
\(P(a|s)\) is the probability of taking action \(a\) in state \(s\)
\(Q(s,a)\) is the Q-value for action \(a\) in state \(s\)
- Parameters:
agent (str) – The player for whom the action is being chosen.
- Returns:
The chosen action.
- Return type:
int
- Q_step(episode)#
Performs a single step of Q-learning for all agents in the environment asynchronously where each agent updates its Q-table independently.
The Q-learning update step is done using the following formula:
\[Q(s, a) \leftarrow (1 - \alpha) Q(s, a) + \alpha \left( r + \gamma \max_{a'} Q(s', a') \right)\]Where actions are chosen using the
action_choicemethod independently for each agent and then the Q-values are updated based on the received rewards and the next state. In the next state the next agent chooses its action. In a loop this looks like this:- for each agent in agent_order:
action = action_choice(agent) observations, rewards, terminateds, _, _ = env.step(action) Q table update
after this loop agent_order is shuffled and the loop is repeated until all episodes are passed through for the epoch.
- Parameters:
episode (int) – The current episode number.
- Returns:
The updated Q-table.
- Return type:
dict
- Q_table_initiation()#
Initializes the Q-table for all agents.
If random_initialize is True, the Q-values are initialized randomly; otherwise, they are initialized to zero.
\[\begin{split}Q(s, a) = \begin{cases} \text{random value in } [0, 0.3] & \text{if random initialization is enabled} \\ 0 & \text{otherwise} \end{cases}\end{split}\]
- action_choice(episode, agent)#
Chooses an action for the given player based on the epsilon-greedy policy.
\(\epsilon\) is adjusted based on the episode number and the number of players in the environment via the function
InflGame.MARL.utils.IQL_utils.adjusted_epsilon.If a random value is less than \(\epsilon\), a random action is chosen. Otherwise, the action is selected based on the Q-values using either a softmax or max policy.
- Parameters:
episode (int) – The current episode number.
agent (str) – The player for whom the action is being chosen.
- Returns:
The chosen action.
- Return type:
int
- observation_initialized()#
Resets the environment and initializes random observations for all agents.
- train()#
Trains the agents using the IQL algorithm over multiple epochs and episodes.
The number of episodes in an epoch is adjusted based on the configuration settings via the function
InflGame.MARL.utils.IQL_utils.adjusted_episodes.At the end of an epoch, the environment is reset and the observations are reinitialized randomly.
- Returns:
The final Q-table after training.
- Return type:
dict