implementation of ppo(proximal policy optimization) using pytorch
red line represents the goal of the environment, specified by open ai gym
note that not all of these goals are reached,
but does achieve similar results to figure 3 of the original paper,
and better than results in Benchmarks for Spinning Up Implementations.
note that I didn't specify seed, so you may get a different result,
however, according to my experience, this code could achieve similar results across different seeds,
so you can get a result that is not so bad after trying with a few seeds(or even not specified).
python main.py --env-name "Pendulum-v0" --learning-rate 0.0003 --learn-interval 1000 --batch-size 200 --total-steps 300000 --num-process 3
reward and running reward | multiple running rewards |
---|---|
python main.py --env-name "HalfCheetah-v3" --total-steps 5000000 --learn-interval 2000 --learning-rate 0.0007 --batch-size 2000
reward and running reward | multiple running rewards |
---|---|
python main.py --env-name "Swimmer-v3" --total-steps 1000000 --learn-interval 2000 --learning-rate 0.0005 --batch-size 1000 --std-decay
reward and running reward | multiple running rewards |
---|---|
python main.py --env-name "Hopper-v3" --total-steps 5000000 --learn-interval 2000 --learning-rate 0.0005 --batch-size 1000 --std-decay
reward and running reward | multiple running rewards |
---|---|
python main.py --env-name "Walker2d-v3" --total-steps 5000000 --learn-interval 2000 --learning-rate 0.0005 --batch-size 1000 --std-decay
reward and running reward | multiple running rewards |
---|---|
- discrete action