IQL Utilities#
IQL Utilities Module#
This module provides utility functions for Independent Q-Learning (IQL) in multi-agent reinforcement learning (MARL). The functions include methods for converting Q-tables to tensors, generating policies from Q-tensors, and adjusting parameters such as epsilon, temperature, and the number of episodes based on various scheduling strategies.
The module supports: - Conversion of Q-tables to tensor representations for efficient computation. - Policy generation using softmax over Q-values. - Dynamic adjustment of epsilon, temperature, and episodes for exploration-exploitation trade-offs. - Scheduling strategies such as cosine annealing and reverse cosine annealing.
Dependencies:#
`InflGame.utils
Usage:#
The functions in this module can be used to preprocess Q-tables, generate policies, and dynamically adjust parameters for MARL algorithms.
Example:#
from InflGame.MARL.utils.IQL_utils import Q_table_to_tensor, Q_tensor_to_policy, adjusted_epsilon
# Convert a Q-table to a tensor
Q_table = {
0: {0: {0: 1.0, 1: 2.0}, 1: {0: 3.0, 1: 4.0}},
1: {0: {0: 5.0, 1: 6.0}, 1: {0: 7.0, 1: 8.0}}
}
Q_tensor = Q_table_to_tensor(Q_table)
# Generate a policy from the Q-tensor
policy = Q_tensor_to_policy(Q_tensor, temperature=0.5, agent_id=0)
# Adjust epsilon using cosine annealing
configs = {
'TYPE': 'cosine_annealing',
'epsilon_min': 0.1,
'epsilon_max': 1.0
}
epsilon = adjusted_epsilon(configs, num_agents=2, episode=10, episodes=100)
print(f"Adjusted epsilon: {epsilon}")
Functions
- InflGame.MARL.utils.IQL_utils.Q_table_to_tensor(Q_table)#
Converts a Q-table (nested dictionary) into a tensor representation.
\[\]Q_{torch}[i,j,k] = Q_{dict}[p_i,s_i, a_j]
- where:
-\(Q(s_i, a_j, a_k)\) is the Q-value for state \(s_i\), actions \(a_j\) , and \(p_i\) the \(i\)-th player.
- Parameters:
Q_table (dict) – Nested dictionary representing the Q-table.
- Returns:
Tensor representation of the Q-table.
- Return type:
torch.Tensor
- InflGame.MARL.utils.IQL_utils.Q_tensor_to_policy(q_tensor, temperature=0.5, agent_id=0)#
Converts a Q-tensor into a policy tensor using a softmax function.
\[\pi(a|s) = \]rac{exp(Q(s, a) / T)}{sum_{a’} exp(Q(s, a’) / T)}
- where:
\(a\) is the action.
\(a'\) is the set of all possible actions.
\(s\) is the state.
\(Q(s, a)\) is the Q-value for state \(s\) and action \(a\).
\(T\) is the temperature parameter.
\(\pi(a|s)\) is the policy for action \(a\) given state \(s\).
- param torch.Tensor q_tensor:
Q-tensor for all players.
- param float temperature:
Temperature parameter for softmax.
- param int agent_id:
ID of the player for which to compute the policy.
- return:
Policy tensor for the specified player.
- rtype:
torch.Tensor
- InflGame.MARL.utils.IQL_utils.adjusted_episodes(configs, epoch, epochs)#
Adjusts the number of episodes based on the specified configuration and epoch progress.
- Parameters:
configs (dict) – Configuration dictionary containing episode adjustment parameters TYPE, episode_max, episode_min. TYPE (str): schedule type episodes_min (int): minimum number of episodes episodes_max (optional,int):maximum number of episodes
epoch (int) – Current epoch number.
epochs (int) – Total number of epochs.
- Returns:
Adjusted number of episodes.
- Return type:
int
- InflGame.MARL.utils.IQL_utils.adjusted_epsilon(configs, num_agents, episode, episodes)#
rac{1}{2} (epsilon_{ ext{max}} - epsilon_{ ext{min}}) (1 + cos( rac{pi cdot ext{episode}}{ ext{episodes}}))`
- param dict configs:
Configuration dictionary containing epsilon adjustment parameters.
- param int num_agents:
Number of players in the game.
- param int episode:
Current episode number.
- param int episodes:
Total number of episodes.
- return:
Adjusted epsilon value.
- rtype:
float
- InflGame.MARL.utils.IQL_utils.adjusted_temperature(configs, observation, observation_space_size)#
Adjusts the temperature value based on the specified configuration and observation.
- Parameters:
configs (dict) – Configuration dictionary containing temperature adjustment parameters.
observation (int) – Current observation value.
observation_space_size (int) – Size of the observation space.
- Returns:
Adjusted temperature value.
- Return type:
float
- InflGame.MARL.utils.IQL_utils.cosine_annealing_distance_dependent(value_max, value_min, time, time_crit, time_max, max_distance=None)#
Computes a value using cosine annealing based on the distance from a critical time.
\[v(t) = v_{\text{min}} + \frac{1}{2} (v_{\text{max}} - v_{\text{min}}) \left(1 + \cos\left(\frac{\pi \cdot |t - t_{\text{crit}}|}{d_{\text{max}}}\right)\right)\]- where:
\(v(t)\) is the computed value at time \(t\).
\(v_{\text{max}}\) is the maximum value.
\(v_{\text{min}\) is the minimum value.
\(t_{\text{crit}}\) is the critical time step.
\(d_{\text{max}}\) is the maximum distance from the critical time, defaulting to half of \(t_{\text{max}}\).
This function adjusts the value smoothly using a cosine function, depending on the distance from a critical time step.
- Parameters:
value_max (float) – Maximum value.
value_min (float) – Minimum value.
time (int) – Current time step.
time_crit (int) – Critical time step.
time_max (int) – Maximum time step.
max_distance (int) – Maximum distance from the critical time. Defaults to half of time_max.
- Returns:
Computed value based on cosine annealing.
- Return type:
float
- InflGame.MARL.utils.IQL_utils.q_tables_to_q_tensors(num_runs, q_tables)#
Converts multiple Q-tables into a stacked tensor representation.
- Parameters:
num_runs (int) – Number of runs (Q-tables).
q_tables (dict) – Dictionary containing Q-tables for each run.
- Returns:
Stacked tensor representation of all Q-tables.
- Return type:
torch.Tensor
- InflGame.MARL.utils.IQL_utils.reverse_cosine_annealing(value_max, value_min, time, time_max)#
Reverse cosine annealing adjusts a value smoothly over time, starting from the minimum value and increasing toward the maximum value, following a cosine curve.
\[v(t) = v_{ ext{min}} + v_{ ext{max}} - \left\lfloor v_{ ext{min}} + \]- rac{1}{2} (v_{ ext{max}} - v_{ ext{min}})
left(1 + cosleft(
rac{pi cdot t}{t_{ ext{max}}} ight) ight) ight floor
- where:
\(v(t)\) is the computed value at time \(t\).
\(v_{ ext{max}}\) is the maximum value.
\(v_{ ext{min}\) is the minimum value.
\(t_{ ext{max}\) is the maximum time step.
- param float value_max:
Maximum value.
- param float value_min:
Minimum value.
- param int time:
Current time step.
- param int time_max:
Maximum time step.
- return:
Computed value based on reverse cosine annealing.
- rtype:
float