Proximal Policy Gradient (PPO)
Overview
PPO is one of the most popular DRL algorithms. It runs reasonably fast by leveraging vector (parallel) environments and naturally works well with different action spaces, therefore supporting a variety of games. It also has good sample efficiency compared to algorithms such as DQN.
Original paper:
Reference resources:
- Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO
- What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study
- ⭐ The 37 Implementation Details of Proximal Policy Optimization
All our PPO implementations below are augmented with the same code-level optimizations presented in openai/baselines
's PPO. To achieve this, see how we matched the implementation details in our blog post The 37 Implementation Details of Proximal Policy Optimization.
Implemented Variants
Variants Implemented | Description |
---|---|
ppo.py , docs |
For classic control tasks like CartPole-v1 . |
ppo_atari.py , docs |
For playing Atari games. It uses convolutional layers and common atari-based pre-processing techniques. |
ppo_continuous_action.py , docs |
For continuous action space. Also implemented Mujoco-specific code-level optimizations |
Below are our single-file implementations of PPO:
ppo.py
The ppo.py has the following features:
- Works with the
Box
observation space of low-level features - Works with the
Discrete
action space - Works with envs like
CartPole-v1
Usage
poetry install
python cleanrl/ppo.py --help
python cleanrl/ppo.py --env-id CartPole-v1
Implementation details
ppo.py is based on the "13 core implementation details" in The 37 Implementation Details of Proximal Policy Optimization, which are as follows:
- Vectorized architecture ( common/cmd_util.py#L22)
- Orthogonal Initialization of Weights and Constant Initialization of biases ( a2c/utils.py#L58))
- The Adam Optimizer's Epsilon Parameter ( ppo2/model.py#L100)
- Adam Learning Rate Annealing ( ppo2/ppo2.py#L133-L135)
- Generalized Advantage Estimation ( ppo2/runner.py#L56-L65)
- Mini-batch Updates ( ppo2/ppo2.py#L157-L166)
- Normalization of Advantages ( ppo2/model.py#L139)
- Clipped surrogate objective ( ppo2/model.py#L81-L86)
- Value Function Loss Clipping ( ppo2/model.py#L68-L75)
- Overall Loss and Entropy Bonus ( ppo2/model.py#L91)
- Global Gradient Clipping ( ppo2/model.py#L102-L108)
- Debug variables ( ppo2/model.py#L115-L116)
- Separate MLP networks for policy and value functions ( common/policies.py#L156-L160, baselines/common/models.py#L75-L103)
Experiment results
To run benchmark experiments, see benchmark/ppo.sh. Specifically, execute the following command:
Below are the average episodic returns for ppo.py
. To ensure the quality of the implementation, we compared the results against openai/baselies
' PPO.
Environment | ppo.py |
openai/baselies ' PPO (Huang et al., 2022)1 |
---|---|---|
CartPole-v1 | 492.40 ± 13.05 | 497.54 ± 4.02 |
Acrobot-v1 | -89.93 ± 6.34 | -81.82 ± 5.58 |
MountainCar-v0 | -200.00 ± 0.00 | -200.00 ± 0.00 |
Learning curves:
Tracked experiments and game play videos:
Video tutorial
If you'd like to learn ppo.py
in-depth, consider checking out the following video tutorial:
ppo_atari.py
The ppo_atari.py has the following features:
- For playing Atari games. It uses convolutional layers and common atari-based pre-processing techniques.
- Works with the Atari's pixel
Box
observation space of shape(210, 160, 3)
- Works with the
Discrete
action space
Usage
poetry install -E atari
python cleanrl/ppo_atari.py --help
python cleanrl/ppo_atari.py --env-id BreakoutNoFrameskip-v4
Implementation details
ppo_atari.py is based on the "9 Atari implementation details" in The 37 Implementation Details of Proximal Policy Optimization, which are as follows:
- The Use of
NoopResetEnv
( common/atari_wrappers.py#L12) - The Use of
MaxAndSkipEnv
( common/atari_wrappers.py#L97) - The Use of
EpisodicLifeEnv
( common/atari_wrappers.py#L61) - The Use of
FireResetEnv
( common/atari_wrappers.py#L41) - The Use of
WarpFrame
(Image transformation) common/atari_wrappers.py#L134 - The Use of
ClipRewardEnv
( common/atari_wrappers.py#L125) - The Use of
FrameStack
( common/atari_wrappers.py#L188) - Shared Nature-CNN network for the policy and value functions ( common/policies.py#L157, common/models.py#L15-L26)
- Scaling the Images to Range [0, 1] ( common/models.py#L19)
Experiment results
To run benchmark experiments, see benchmark/ppo.sh. Specifically, execute the following command:
Below are the average episodic returns for ppo_atari.py
. To ensure the quality of the implementation, we compared the results against openai/baselies
' PPO.
Environment | ppo_atari.py |
openai/baselies ' PPO |
---|---|---|
BreakoutNoFrameskip-v4 | 416.31 ± 43.92 | 406.57 ± 31.554 |
PongNoFrameskip-v4 | 20.59 ± 0.35 | 20.512 ± 0.50 |
BeamRiderNoFrameskip-v4 | 2445.38 ± 528.91 | 2642.97 ± 670.37 |
Learning curves:
Tracked experiments and game play videos:
Video tutorial
If you'd like to learn ppo_atari.py
in-depth, consider checking out the following video tutorial:
ppo_continuous_action.py
The ppo_continuous_action.py has the following features:
- For continuous action space. Also implemented Mujoco-specific code-level optimizations
- Works with the
Box
observation space of low-level features - Works with the
Box
(continuous) action space
Usage
poetry install -E atari
python cleanrl/ppo_continuous_action.py --help
python cleanrl/ppo_continuous_action.py --env-id Hopper-v2
Implementation details
ppo_continuous_action.py is based on the "9 details for continuous action domains (e.g. Mujoco)" in The 37 Implementation Details of Proximal Policy Optimization, which are as follows:
- Continuous actions via normal distributions ( common/distributions.py#L103-L104)
- State-independent log standard deviation ( common/distributions.py#L104)
- Independent action components ( common/distributions.py#L238-L246)
- Separate MLP networks for policy and value functions ( common/policies.py#L160, baselines/common/models.py#L75-L103
- Handling of action clipping to valid range and storage ( common/cmd_util.py#L99-L100)
- Normalization of Observation ( common/vec_env/vec_normalize.py#L4)
- Observation Clipping ( common/vec_env/vec_normalize.py#L39)
- Reward Scaling ( common/vec_env/vec_normalize.py#L28)
- Reward Clipping ( common/vec_env/vec_normalize.py#L32)
Experiment results
To run benchmark experiments, see benchmark/ppo.sh. Specifically, execute the following command:
Below are the average episodic returns for ppo_continuous_action.py
. To ensure the quality of the implementation, we compared the results against openai/baselies
' PPO.
Environment | ppo_continuous_action.py |
openai/baselies ' PPO |
---|---|---|
Hopper-v2 | 2231.12 ± 656.72 | 2518.95 ± 850.46 |
Walker2d-v2 | 3050.09 ± 1136.21 | 3208.08 ± 1264.37 |
HalfCheetah-v2 | 1822.82 ± 928.11 | 2152.26 ± 1159.84 |
Learning curves:
Tracked experiments and game play videos:
Video tutorial
If you'd like to learn ppo_continuous_action.py
in-depth, consider checking out the following video tutorial:
-
Huang, Shengyi; Dossa, Rousslan Fernand Julien; Raffin, Antonin; Kanervisto, Anssi; Wang, Weixun (2022). The 37 Implementation Details of Proximal Policy Optimization. ICLR 2022 Blog Track https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/ ↩