Bipedal Walker with PPO: A Step-by-Step Guide to Solving the RL Challenge
In this tutorial we will learn how to master a Bipedal Walker with PPO (Proximal Policy Optimization). The tutorial begins with general diagram of PPO algorithm followed by details of each methods that are pivotal in building an agent to solve the given problem.
You can find the code at github link.
You can also find the video of our agent in action Youtube Video.
PPO Training Flow Diagram
Below is an outline of how the training happens and how the different methods in the PPO class interact:
+----------------------+
| train() |
+----------+-----------+
|
v
+-----------------------------------+
| run_episode() |
+-----------------------------------+
| |
| |
v v
+---------------------+ +-----------------------+
| Take action using | | Collect transitions |
| policy_net.forward() | | (state, action, reward)|
+---------------------+ +-----------------------+
|
v
+-------------------------------+
| Compute GAE and returns |
| compute_gae(), |
| compute_discounted_return() |
+-------------------------------+
|
v
+--------------------------------------------+
| ppo_update(): Update policy & value network|
+--------------------------------------------+
|
+-----------------+---------------------+
| |
v v
+---------------------------+ +--------------------------+
| Compute loss using | | Compute loss using value |
| policy_net.evaluate() | | network for critic |
+---------------------------+ +--------------------------+
| |
v v
+---------------------------+ +--------------------------+
| Optimize policy using | | Optimize critic using |
| opt_policy.step() | | opt_value.step() |
+---------------------------+ +--------------------------+
|
v
+-----------------------------+
| Store metrics: |
| episode_rewards, policy_loss |
+-----------------------------+
|
v
+--------------------------------------------+
| save_metrics(): Save model periodically |
+--------------------------------------------+
Detailed Interaction:
- train() Method:
- Calls run_episode() to interact with the environment and gather experience (states, actions, rewards, etc.).
- Calls ppo_update() to update the policy and value networks using the collected experience.
- run_episode() Method:
- Resets the environment and iterates through the episode steps.
- Calls the policy network (policy_net.forward()) to sample actions based on the current state.
- Collects experience (state, action, reward, value, log-probabilities).
- Computes returns and advantages using Generalized Advantage Estimation (GAE).
- ppo_update() Method:
- Samples mini-batches of data from the collected experience.
- Computes the policy and value loss:
- Policy loss: Compares the old and new log-probabilities to ensure the update is within a trust region (using a clip to the ratio of new/old probabilities).
- Value loss: Updates the critic based on the prediction error of the value function.
- Optimizes the policy and value networks using Adam optimizers.
- save_metrics():
- Saves the policy network, value network, and tracked metrics (rewards, losses, entropy) to a file for checkpointing.
- compute_discounted_return() and compute_gae():
- These methods compute the discounted returns and advantages, which are crucial for training the agent.
This interaction allows the PPO agent to optimize both the policy and the value function iteratively by collecting experience, calculating the necessary loss terms, and updating the networks.
Run Episode
Here’s a detailed breakdown of how the run_episode() method functions, with its interaction flow. This will help to visualize how the different steps in the method interact with each other.
+----------------------------+
| run_episode() |
+----------------------------+
|
v
+-----------------------------------+
| Environment reset: env.reset() |
+-----------------------------------+
|
v
+---------------------------------------------+
| For each step in the episode: |
+--------------------+------------------------+
|
v
+-------------------------------------+
| Convert state to tensor |
+-------------------------------------+
|
v
+------------------------------------------+
| Call policy_net(state): |
| Get action and log probability |
+------------------------------------------+
|
v
+-------------------------------------+
| Call value_net(state): |
| Get value estimate for current state|
+-------------------------------------+
|
v
+---------------------------------------------+
| Environment step: env.step(action) |
| Get new state, reward, done flag |
+---------------------------------------------+
|
v
+---------------------------------------------+
| Store experience: |
| (state, action, reward, value, log_prob) |
+---------------------------------------------+
|
v
+----------------------+
| Check 'done' flag |
+----------+-----------+
|
+------------+-------------+
| |
v v
+--------------+ +--------------------+
| If done: | | If not done: |
| Break loop | | Continue loop |
+--------------+ +--------------------+
|
v
+-----------------------------------------------+
| Compute last value (value_net(last_state)) |
+-----------------------------------------------+
|
v
+---------------------------------------------------+
| Compute GAE and returns: |
| compute_gae() or compute_discounted_return() |
+---------------------------------------------------+
|
v
+---------------------------------------------------+
| Return experience: |
| (states, actions, log_probs, values, returns, advs)|
+---------------------------------------------------+
Detailed Interaction
- Environment Reset:
- The environment is reset to its initial state at the start of the episode.
- For Each Step in the Episode:
- Convert State: The current state is converted into a PyTorch tensor to be passed to the neural networks.
- Policy Network Call:
- The policy network (policy_net) is called to determine the action and log probability of that action based on the current state.
- Value Network Call:
- The value network (value_net) is called to estimate the value of the current state, which is important for the advantage estimation.
- Environment Step:
- The agent performs the action in the environment, which returns the next state, the reward, whether the episode is done, and other possible info.
- Store Experience:
- The current state, action, reward, value, and log probability are stored for later training (to be used in PPO updates).
- Check done Flag:
- If the done flag is True (i.e., the episode has ended), the loop breaks. Otherwise, it continues until the maximum steps are reached or the episode ends.
- Last Value:
- Once the episode is over, the value of the final state is estimated using the value network. This helps with calculating the returns and advantages for training.
- Compute GAE/Returns:
- After the episode ends, Generalized Advantage Estimation (GAE) or discounted returns are computed to help calculate the policy gradients and value loss during training.
- Return Experience:
- The method returns all the collected experience, including states, actions, log probabilities, value estimates, returns, and advantages, to be used in the PPO update.
This interaction flow ensures that the PPO agent collects the required data from the environment and prepares it for the training step (ppo_update()).
Computing Generalized Advantage Estimation (GAE)
Here’s a detailed flow diagram and breakdown for how the compute_gae() method works and how it interacts with other parts of the PPO system.
+----------------------------------------+
| compute_gae() |
+----------------------------------------+
|
v
+------------------------------------------------+
| Initialize: advs = zeros_like(rewards) |
| last_gae_lam = 0 |
+------------------------------------------------+
|
v
+------------------------------------------+
| Iterate over timesteps in reverse order: |
+-----------------------+------------------+
|
v
+----------------------------------------------+
| If last step: |
| next_value = last_value |
| Else: |
| next_value = value of the next step |
+----------------------------------------------+
|
v
+---------------------------------------------+
| Compute TD error (delta): |
| delta = reward + gamma * next_value - value |
+---------------------------------------------+
|
v
+--------------------------------------------------------+
| Compute GAE lambda (generalized advantage estimation): |
| advs[t] = delta + gamma * lambda * last_gae_lam |
| last_gae_lam = advs[t] |
+--------------------------------------------------------+
|
v
+--------------------------------------------+
| Return final advantages: advs + values |
+--------------------------------------------+
Detailed Interaction Flow:
- Initialization:
advs
: A zero array is initialized with the same shape as the rewards. This will store the advantage estimates for each time step.last_gae_lam
: A variable is initialized to store the accumulated GAE value over the loop. It will help propagate advantages backward through time.
- Reverse Loop Over Timesteps:
- The method loops over the timesteps in reverse order, starting from the end of the episode and working backward. This reverse iteration is important for propagating future rewards and value estimates backward through time, as GAE incorporates both past and future information.
- Next Value Estimation:
- If it’s the last step of the episode, the next value is set to the final value predicted by the value network at the end of the episode (last_value).
- For earlier steps, the next value is simply the value of the next step in the stored values.
- Compute Temporal Difference (TD) Error (
delta
):- TD error measures how much better or worse the actual reward is compared to the predicted value:
- delta = reward + gamma * next_value – value
- This term represents the error in the value estimate for a particular step and is crucial in calculating the advantage.
- Compute GAE for Each Timestep:
- The GAE lambda is computed using:
- advs[t] = delta + gamma * lambda * last_gae_lam
- last_gae_lam = advs[t]
- Generalized Advantage Estimation (GAE) smoothens the estimation by using a decaying sum of future TD errors. The parameter
lambda
controls the trade-off between bias and variance (higher lambda values mean longer-term rewards are considered). - The calculated
advs[t]
is then stored in theadvs
array for that particular timestep.
- The GAE lambda is computed using:
- Return Final Advantages:
- Once all timesteps are processed in reverse order, the method returns the sum of the computed advantages (
advs
) and the original value estimates (values
). These combined values are the final returns that are used during the PPO update.
- Once all timesteps are processed in reverse order, the method returns the sum of the computed advantages (
The GAE method iterates over the rewards and value estimates from the end of the episode to the start.It calculates the advantage for each time step by incorporating future rewards and the temporal difference error.The resulting advantage values, when added to the original value estimates, provide the final returns used in the PPO update.This process helps to reduce the variance in the advantage estimates while keeping the bias under control, providing more stable training for the PPO agent.
The update method
Here’s a flow diagram and breakdown for how the ppo_update()
method works and how it interacts with the rest of the PPO system:
+---------------------------------------------------+
| ppo_update() |
+---------------------------------------------------+
|
v
+----------------------------------------------------------+
| Convert rollout data (states, actions, etc.) to tensors |
+----------------------------------------------------------+
|
v
+---------------------------------------------------------------+
| Calculate episode length and mini-batch sample size (optional) |
+---------------------------------------------------------------+
|
v
+------------------------------+---------------------------------+
| Shuffle indices for mini-batch| Start multi-epoch update loop |
| training | (loop over `n_epoch`) |
+------------------------------+---------------------------------+
|
v
+------------------------------------------------+
| Select a mini-batch of states, actions, |
| returns, advantages, etc., from the rollout |
+------------------------------------------------+
|
v
+---------------------------------------------+-----------------------------+
| Calculate log probabilities (policy) | Compute values (value_net) |
| and entropy for actions using policy_net | |
+---------------------------------------------+-----------------------------+
|
v
+------------------------------------------------------------+
| Compute value loss (Critic loss): |
| - Use Clipped and Unclipped value loss (clip value targets)|
+------------------------------------------------------------+
|
v
+----------------------------------------------------------------+
| Compute policy loss (Actor loss): |
| - Calculate ratio of new and old policy probabilities |
| - Compute clipped and unclipped policy loss (trust region) |
| - Subtract entropy-weighted term to encourage exploration |
+----------------------------------------------------------------+
|
v
+------------------------------------------------+
| Backpropagate and optimize both policy and |
| value networks using Adam optimizers |
+------------------------------------------------+
|
v
+------------------------------------------------+
| Repeat for each mini-batch and epoch iteration |
+------------------------------------------------+
|
v
+---------------------------------------------+
| Return average policy loss, value loss, and |
| entropy for logging |
+---------------------------------------------+
Detailed Steps of ppo_update()
:
- Convert Rollout Data to Tensors:
- Convert the numpy arrays (states, actions, returns, advantages, log probabilities, etc.) from the rollout data into PyTorch tensors, which are then moved to the specified device (CPU or GPU). This step ensures that data is ready for use in the training process
- Calculate Episode Length and Mini-Batch Size:
- The method calculates the total length of the episode (number of timesteps in the rollout).
- It then determines the mini-batch size and the number of mini-batches to use for training, ensuring that the mini-batches fit within the total episode length. If the mini-batch size exceeds the episode length, the entire episode is treated as one mini-batch.
- Shuffle Indices for Mini-Batch Training:
- A set of indices is generated and shuffled to create random mini-batches for each epoch of the PPO update. This helps in decorrelating the training samples, leading to more stable learning.
- Start Multi-Epoch Update Loop:
- The PPO algorithm performs multiple epochs of gradient updates on the policy and value networks. For each epoch, it processes several mini-batches of data.
- Select a Mini-Batch:
- For each mini-batch, a subset of the states, actions, returns, advantages, and log probabilities from the rollout data is selected. This is done using the shuffled indices generated earlier.
- Policy and Value Network Evaluation:
- The policy network is used to:
- Calculate the log probabilities of the selected actions for the mini-batch states.
- Compute the entropy of the action distribution, which encourages exploration.
- The value network is used to compute the predicted values for the selected mini-batch states.
- The policy network is used to:
- Compute Value Loss (Critic Loss):
- The value loss is calculated using the difference between the predicted values and the returns.
- A clipped version of the value predictions is also computed to ensure that large updates are avoided.
- The value loss is the maximum of the squared differences between the clipped and unclipped values and the returns, ensuring stability.
- Compute Policy Loss (Actor Loss):
- The policy loss is calculated using the ratio of the new and old policy probabilities:
- ratio = exp(new_log_prob – old_log_prob)
- Two versions of the policy loss are calculated:
- The first version directly uses the advantage-weighted ratio (
adv * ratio
). - The second version clips the ratio to stay within the trust region (
1 ± clip_val
), preventing drastic changes in the policy.
- The first version directly uses the advantage-weighted ratio (
- The maximum of the two losses is taken.
- Entropy is added to the loss to encourage exploration. The entropy term is weighted by the entropy coefficient (
ent_weight
).
- Backpropagation and Optimization:
- Gradients are computed using backpropagation on both the policy and value networks.
- The gradients are clipped to avoid excessive updates, and the networks are updated using their respective Adam optimizers (
opt_policy
for the policy andopt_value
for the value network).
- Repeat for Each Mini-Batch and Epoch:
- This process is repeated for each mini-batch within the current epoch. Once all mini-batches have been processed, the method moves on to the next epoch, shuffling the indices and repeating the mini-batch update process.
- Return Loss and Entropy:
- After the training is complete, the method returns the average policy loss, value loss, and entropy for logging and evaluation purposes.
Summary of Update method
- The ppo_update() method performs the core PPO training step.
- It updates the policy network by computing and minimizing the policy loss, which involves balancing between maximizing rewards (through the ratio) and ensuring stability (through clipping).
- It updates the value network by minimizing the value loss, which measures how accurate the value predictions are compared to the actual returns.
- The process is repeated for multiple mini-batches and epochs, ensuring robust and stable learning for both networks.
Policy Network
The PolicyNet
implementation you provided is a basic neural network designed to output actions for a continuous action space, where the actions are modeled using a multivariate normal distribution. This is a common approach in reinforcement learning when dealing with continuous action spaces, as it allows the agent to sample actions based on the learned mean and standard deviation.
The flow diagram of the policy network.
+---------------------+
| Input State |
+----------+----------+
|
v
+---------------------+
| Main Network | (3 Layers: Linear -> ReLU)
+----------+----------+
|
v
+---------------------+ +---------------------+
| Extracted Features |---------->| Mean (fc_mean) |
+---------------------+ +---------------------+
| |
| v
| +---------------------+
| | LogStd (logstd) |
| +---------------------+
| |
| exponentiate
| |
v v
+---------------------+ +---------------------+
| Mean and Std |---------->| Normal Dist |
+---------------------+ +---------------------+
|
+-------------------------------+-------------------------------+
| | |
+----------------------+ +------------------------+ +------------------------+
| Sample Action | | Compute LogProb | | Compute Entropy |
+----------------------+ +------------------------+ +------------------------+
^ ^ ^
| | |
+--------------------+ +---------------------+ +---------------------+
| forward() Method |<-----| evaluate() Method | | choose_action() |
+--------------------+ +---------------------+ +---------------------+
Key Components:
self.main
:- This is the core part of the neural network that processes the input
state
to extract useful features. It consists of three fully connected (nn.Linear
) layers with ReLU activation functions in between.
- This is the core part of the neural network that processes the input
self.fc_mean
:- After processing the state, this layer maps the features to the mean of the action distribution for each action dimension.
self.logstd
:- This is a learnable parameter, initialized as zeros, which represents the log of the standard deviation of the action distribution. It is learned over time, allowing the model to adjust its confidence (variance) over the actions.
distribution
:- A normal distribution (Gaussian) is created using the mean and the standard deviation (which is derived from the learned
logstd
). This is used to sample actions during training and evaluation.
- A normal distribution (Gaussian) is created using the mean and the standard deviation (which is derived from the learned
Detailed Breakdown:
forward()
Method:- Purpose: Computes an action (either stochastically or deterministically) and the corresponding log probability.
- Flow:
- Takes the
state
as input.Passes thestate
through themain
network to extract features. - Computes the
mean
from the features and thestd
by exponentiatinglogstd
. - Creates a normal distribution (
dist.Normal(mean, std)
). - Either returns the
mean
(deterministic) or samples an action from the distribution (stochastic). - Computes and returns the
log_prob
of the sampled action.
- Takes the
choose_action()
Method:- Purpose: Only returns the action, either deterministically or stochastically, without computing the log probability.
- Flow:
- Takes the
state
as input.Passes thestate
through themain
network to extract features. - Computes the
mean
andstd
as inforward()
. - Creates the normal distribution.
- Returns either the
mean
(deterministic) or a sampled action from the distribution (stochastic).
- Takes the
evaluate()
Method:- Purpose: Evaluates a given action by returning its log probability and entropy.
- Flow:
- Takes the
state
andaction
as inputs. - Passes the
state
through themain
network to extract features. - Computes the
mean
andstd
as in the previous methods. - Creates a normal distribution.
- Computes and returns the
log_prob
of the givenaction
and theentropy
of the distribution.
- Takes the
Value Network
The ValueNet
class represents a neural network for approximating the value function in reinforcement learning. The input is the state (state
), and the output is the value of that state (scalar value). The network has three layers of fully connected linear transformations interspersed with ReLU activations, followed by a final linear layer that outputs a single value.
+---------------------+
| Input State |
+----------+----------+
|
v
+---------------------+
| Linear Layer (s_dim -> 128) |
+----------+----------+
|
v
+---------------------+
| ReLU Activation |
+----------+----------+
|
v
+---------------------+
| Linear Layer (128 -> 128) |
+----------+----------+
|
v
+---------------------+
| ReLU Activation |
+----------+----------+
|
v
+---------------------+
| Linear Layer (128 -> 1) |
+----------+----------+
|
v
+---------------------+
| Output: State Value |
+---------------------+
With all these steps we are able to create an agent which masters bipedal walker with PPO.
PPO for environments with discrete actions.
- PPO Implementation in PyTorch
- Master Snake Game AI with PPO: Step-by-Step Guide (Part I)
- Master Snake Game AI with PPO: Step-by-Step Guide (Part II)