Python
Double DQN For Lunar Lander With PyTorch

Double DQN For Lunar Lander With PyTorch

Deep Q-Networks (DQN) have revolutionized the field of reinforcement learning, achieving human-level performance in various environments.

However, DQNs often suffer from overestimation bias, which can degrade their performance. Double DQN is an improved version designed to mitigate this issue by separating action selection from action evaluation. In this blog post, we will implement Double DQN using PyTorch to solve the Lunar Lander environment from OpenAI Gym.

Understanding Double DQN

In a standard DQN, the same network is used for both selecting and evaluating actions. This can lead to overoptimistic value estimates. Double DQN addresses this by using two separate networks.

  • Online Network: Used to select actions.
  • Target Network: Used to evaluate the selected actions

By decoupling these roles, Double DQN reduces overestimation and stabilizes training.

Implementing Double DQN in PyTorch

Basic Installation

!pip install gym==0.25.2
!pip install swig
!pip install gym[box2d]

Import

import os
import gym
import torch
import random
import copy
from collections import deque
from torch.autograd import Variable
import numpy as np
import matplotlib.pyplot as plt

DQN Class

The ‘DQN’ class encapsulates the functionality of a Deep Q-Network, including the network architecture, prediction, updating, replay and target network management.

‘__init__’ Method

def __init__(self, n_state, n_action, n_hidden=50, lr=0.05, device='cpu'):
    self.device = device
    self.criterion = torch.nn.MSELoss()
    self.model = torch.nn.Sequential(
        torch.nn.Linear(n_state, n_hidden),
        torch.nn.ReLU(),
        torch.nn.Linear(n_hidden, n_hidden),
        torch.nn.ReLU(),
        torch.nn.Linear(n_hidden, n_hidden),
        torch.nn.ReLU(),
        torch.nn.Linear(n_hidden, n_action)
    ).to(self.device)

    self.model_target = copy.deepcopy(self.model).to(self.device)
    self.optimizer = torch.optim.Adam(self.model.parameters(), lr)
  • Parameters:
    • ‘n_state’: The number of state inputs (dimensionality of the state space).
    • ‘n_action’: The number of possible actions.
    • ‘n_hidden’: The number of hidden units in the hidden layers.
    • ‘lr’: Learning rate for the optimizer.
    • ‘device’: The device to run the model on (‘cpu‘ or ‘cuda‘)
  • Attributes:
    • ‘device’: Device to run the model (‘cpu’ or ‘cuda’).
    • ‘criterion’: Loss function (Mean Squared Error Loss).
    • ‘model’: The main Q-network.
    • ‘model_target’: The target Q-network (a copy of main network).
    • ‘optimizer’: Optimizer for the main network (Adam)

‘update’ Method

def update(self, s, y):
    s = np.array(s)
    y_pred = self.model(torch.Tensor(s).to(self.device))
    loss = self.criterion(y_pred, Variable(torch.Tensor(np.array(y)).to(self.device)))
    self.optimizer.zero_grad()
    loss.backward()
    self.optimizer.step()
  • Parameters:
    • ‘s’: Batch of states.
    • ‘y’: Corresponding target Q-values.
  • Functionality:
    • Converts states and targets to tensors.
    • Predicts Q-values using the current network.
    • Computes the loss between predicted and target Q-values.
    • Performs backpropagation and updates the network weights.

‘predict’ Method

def predict(self, s):
    with torch.no_grad():
        return self.model(torch.Tensor(np.array(s)).to(self.device))
  • Parameters:
    • ‘s’: States(s) for which to predict Q-values.
  • Functionality:
    • Predicts Q-values using the current network without updating the weights (no gradient calculation).

‘target_predict’ Method

def target_predict(self, s):
    with torch.no_grad():
        return self.model_target(torch.Tensor(np.array(s)).to(self.device))
  • Parameters:
    • ‘s’: State(s) for which to predict Q-values.
  • Functionality:
    • Predicts Q-values using the target network without updating the weights (no gradient calculation).

‘replay’ Method

def replay(self, memory, replay_size, gamma):
    if len(memory) >= replay_size:
        replay_data = random.sample(memory, replay_size)
        states = []
        td_targets = []
        for state, action, next_state, reward, is_done in replay_data:
            states.append(state)
            q_values = self.predict(state).tolist()
            if is_done:
                q_values[action] = reward
            else:
                next_action = torch.argmax(self.predict(next_state)).item()
                q_values_next = self.target_predict(next_state).detach()
                q_values[action] = reward + gamma * q_values_next[next_action].item()
            td_targets.append(q_values)
        self.update(states, td_targets)
  • Parameters:
    • ‘memory’: Replay buffer containing experience tuples.
    • ‘replay_size’: Number of samples to draw from the replay buffer.
    • ‘gamma’: Discount factor for future rewards.
  • Functionality:
    • Samples a batch of experiences from the replay buffer.
    • For each experience:
      • Predicts Q-values for the current state.
      • Updates the Q-value for the action taken based on the reward and the discounted maximum Q-value of the next state.
    • Updates the network with the computed target Q-values.

‘copy_target’ Method

def copy_target(self):
    self.model_target.load_state_dict(self.model.state_dict())
  • Functionality:
    • Copies the weights from the main network to the target network to keep them in sync.

‘save_model’ Model

def save_model(self, file_path):
    torch.save(self.model.state_dict(), file_path)
  • Parameters:
    • ‘file_path’: Path to save model.
  • Functionality:
    • Saves the state dictionary (weights) of the main network to a file.

‘load_model’ Method

def load_model(self, file_path):
    self.model.load_state_dict(torch.load(file_path, map_location=self.device))
    self.model_target = copy.deepcopy(self.model)
  • Parameters:
    • ‘file_path’: Path to load the model from.
  • Functionality:
    • Loads the state dictionary (weights) from a file into the main network.
    • Copies the weights to the target network to keep them in sync.

Exploration and Exploitation

def gen_epsilon_greedy_policy(estimator, epsilon, n_action):
    def policy_function(state):
        if random.random() < epsilon:
            return random.randint(0, n_action - 1)
        else:
            q_values = estimator.predict(state)
            return torch.argmax(q_values).item()
    return policy_function

Breakdown of the ‘gen_epsilon_greedy_policy’ function

def gen_epsilon_greedy_policy(estimator, epsilon, n_action):
  • Parameters:
    • ‘estimator’: An instance of the DQN class (or any model that has ‘predict’ method) used to predict Q-values.
    • ‘epsilon’: The probability of choosing a random action (exploration). It controls the trade-off between exploration and exploitation.
    • ‘n_action’: The number of possible actions.
  • Returns:
    • A policy function that takes a state as input and returns an action.

Inner Policy Function:

def policy_function(state):
    if random.random() < epsilon:
        return random.randint(0, n_action - 1)
    else:
        q_values = estimator.predict(state)
        return torch.argmax(q_values).item()
  • Parameters:
    • ‘state’: The current state for which an action needs to be decided.
  • Return:
    • An action based on the epsilon-greedy policy.

Exploration:

if random.random() < epsilon:
    return random.randint(0, n_action - 1)
  • ‘random.random()’: generates a random float between 0 and 1.
  • If this values is less than ‘epsilon’, a random action is chosen from the action on space using ‘random.randint(0,n_action-1)’.

Exploitation:

else:
    q_values = estimator.predict(state)
    return torch.argmax(q_values).item()
  • If the random value is greater than or equal to ‘epsilon’, the policy chooses the action with the highest predicted Q-value.
  • ‘estimator.predict(state)’ returns the Q-values for all actions in the given state.
  • ‘torch.argmax(q_values).item()’ selects the action with the highest Q-value.

Return of the Policy Function:

return policy_function
  • The ‘gen_epsilon_greedy_policy’ function returns the ‘policy_function’ which can be used to decide actions based on the current state.

Hyperparameters

n_episode = 1000
replay_size = 142
n_state = env.observation_space.shape[0]
n_action = env.action_space.n
n_hidden = 62
target_update = 10
replay_size = 142
lr = 0.001
epsilon = 0.1
gamma = 1.0
memory = deque(maxlen=10000)
epsilon_decay = 0.99

Environment

env = gym.envs.make("LunarLander-v2")

Path to save

import os
PATH= '/DDQN/Models'
os.chdir(PATH)

Instantiate model and try to load saved one

dqn = DQN(n_state,n_action,n_hidden,lr)
model_path = "double_dqn_lunar_lander.pth"
if os.path.exists(model_path):
    dqn.load_model(model_path)
    print("Loaded existing model.")
else:
    print("No existing model found. Starting training from scratch.")

Device

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Training

Save rewards of each episode

total_reward_episode = [0] * n_episode

Initialization and Setup

for episode in range(n_episode):

Purpose: This line starts a loop that runs for a total of ‘n_episode’ episodes. Each iteration represents a single episode of the agent interacting with the environment.

Target Network Update

if episode % target_update == 0:
        dqn.copy_target()

Purpose: Every target_update episodes, the target network is updated to have the same weights as the online network. This periodic update helps stabilize training by ensuring that the target values used in training are not rapidly changing.

Policy Generation

policy = gen_epsilon_greedy_policy(dqn, epsilon, n_action)

Purpose: An epsilon-greedy policy is generated for the current episode. This policy balances exploration (choosing a random action) and exploitation (choosing the best action based on current knowledge).

Episode Initialization

state = env.reset()
is_done = False

Purpose: The environment is reset to its initial state, and the is_done flag is set to False to indicate that the episode is not yet finished.

Main Loop (Within an Episode)

while not is_done:

Purpose: This loop runs until the episode is finished.

Action Selection and Environment Step

action = policy(state)
next_state, reward, is_done, _ = env.step(action)

Purpose: An action is selected using the policy, and the environment performs this action. The environment returns the next state, the reward, and whether the episode is done.

Reward Accumulation

total_reward_episode[episode] += reward

Purpose: The reward received from the environment is added to the total reward for the current episode.

Memory Update

memory.append((state, action, next_state, reward, is_done))

Purpose: The experience (current state, action, next state, reward, and done flag) is added to the replay memory for future training.

Episode Termination Check

if is_done:
    break

Purpose: If the episode is finished (i.e., the agent reached a terminal state), the loop breaks, and the episode ends.

Model Training

dqn.replay(memory, replay_size, gamma)

Purpose: The agent samples a batch of experiences from the replay memory and uses them to update the online network’s weights.

State Update

state = next_state

Purpose: The current state is updated to the next state for the next iteration of the loop.

Episode Completion and Logging

print(f"Episode: {episode} Reward: {total_reward_episode[episode]}")
epsilon = max(epsilon * epsilon_decay, 0.01)

Purpose: After the episode ends, the total reward for the episode is printed. The epsilon value is decayed to reduce the exploration rate over time, but it is kept above a minimum threshold (0.01).

Model Saving

if (episode + 1) % 50 == 0:
    dqn.save_model(model_path)
    print(f"Model saved after episode {episode + 1}")

Purpose: Every 50 episodes, the current state of the model is saved to a specified path. This allows the training to be resumed later if interrupted and preserves the progress.

The Entire Code For Training


for episode in range(n_episode):
    # print(f"Episode: {episode}")
    if episode % target_update == 0:
        dqn.copy_target()
    policy = gen_epsilon_greedy_policy(dqn,epsilon,n_action)
    state = env.reset()
    is_done = False

    while not is_done:
        # print(f"State: {state}")
        action = policy(state)
        next_state,reward,is_done,_ = env.step(action)
        total_reward_episode[episode] += reward
        memory.append((state,action,next_state,reward,is_done))

        if is_done:
            break

        dqn.replay(memory,replay_size,gamma)
        state = next_state
    print(f"Episode: {episode} Reward: {total_reward_episode[episode]}")
    epsilon = max(epsilon*epsilon_decay,0.01)
    # Save the model every 50 episodes
    if (episode + 1) % 50 == 0:
        dqn.save_model(model_path)
        print(f"Model saved after episode {episode + 1}")

Plotting The Progress

# Plot the rewards
plt.plot(total_reward_episode)
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.title('Double DQN on Lunar Lander')
plt.show()

Summarizing Double Deep Q-Network

  • Purpose:
    • Improve stability and accuracy of Q-Learning by addressing overestimation bias in standard DQN.
  • Network Architecture:
    • Online Network: Used for selecting actions.
    • Target Network: Used for evaluating the value of the selected actions.
  • Experience Replay:
    • Stores experiences (state,action, reward, next state, done) to break correlation between consecutive samples.
    • Randomly samples from this memory to update the network, ensuring diverse and stable learning.
  • Training Process:
    • Action Selection: The online network selects the action.
    • Action Evaluation: The target network evaluates the Q-values of the next state to compute the target for the Bellman equation.
    • Update: The online network is updated with these targets, while the target network is periodically synchronized with the online network.
  • Epsilon-Greedy Policy:
    • Balances exploration (random actions) and exploitation (best known actions) to improve learning efficiency.
  • Periodic Saving:
    • The model is saved periodically to ensure progress is not lost and training can be resumed.

DDQN mitigates the issue of overestimation by decoupling action selection and action evaluation, leading to more robust learning and better performance in reinforcement learning tasks.

Link to Github