Got Population Based Training PPO running in RLLib. It seems to have maxed out rewards. (Asymptotically approaching 0.74).
PPO isn’t great for this. But let’s see if we can replay with GUI after this.
I asked for these
hyperparam_mutations={
"lambda": lambda: random.uniform(0.9, 1.0),
"clip_param": lambda: random.uniform(0.01, 0.5),
"lr": [1e-3, 5e-4, 1e-4, 5e-5, 1e-5]
# ,
# "num_sgd_iter": lambda: random.randint(1, 30),
# "sgd_minibatch_size": lambda: random.randint(128, 16384),
# "train_batch_size": lambda: random.randint(2000, 160000),
})
cat pbt_global.txt
["5", "7", 17, 18, {"env": "RobotableEnv-v0", "kl_coeff": 1.0, "num_workers": 2, "num_gpus": 0, "model": {"free_log_std": true}, "lambda": 0.95, "clip_param": 0.2, "lr": 0.0001, "num_sgd_iter": 20, "sgd_minibatch_size": 500, "train_batch_size": 10000}, {"env": "RobotableEnv-v0", "kl_coeff": 1.0, "num_workers": 2, "num_gpus": 0, "model": {"free_log_std": true}, "lambda": 0.76, "clip_param": 0.16000000000000003, "lr": 5e-05, "num_sgd_iter": 20, "sgd_minibatch_size": 500, "train_batch_size": 10000}]
["3", "1", 35, 32, {"env": "RobotableEnv-v0", "kl_coeff": 1.0, "num_workers": 2, "num_gpus": 0, "model": {"free_log_std": true}, "lambda": 0.95, "clip_param": 0.2, "lr": 0.0001, "num_sgd_iter": 20, "sgd_minibatch_size": 500, "train_batch_size": 10000}, {"env": "RobotableEnv-v0", "kl_coeff": 1.0, "num_workers": 2, "num_gpus": 0, "model": {"free_log_std": true}, "lambda": 1.14, "clip_param": 0.1096797541550122, "lr": 5e-05, "num_sgd_iter": 20, "sgd_minibatch_size": 500, "train_batch_size": 10000}]
["3", "7", 35, 36, {"env": "RobotableEnv-v0", "kl_coeff": 1.0, "num_workers": 2, "num_gpus": 0, "model": {"free_log_std": true}, "lambda": 0.95, "clip_param": 0.2, "lr": 0.0001, "num_sgd_iter": 20, "sgd_minibatch_size": 500, "train_batch_size": 10000}, {"env": "RobotableEnv-v0", "kl_coeff": 1.0, "num_workers": 2, "num_gpus": 0, "model": {"free_log_std": true}, "lambda": 0.76, "clip_param": 0.24, "lr": 0.001, "num_sgd_iter": 20, "sgd_minibatch_size": 500, "train_batch_size": 10000}]
["5", "6", 37, 35, {"env": "RobotableEnv-v0", "kl_coeff": 1.0, "num_workers": 2, "num_gpus": 0, "model": {"free_log_std": true}, "lambda": 0.95, "clip_param": 0.2, "lr": 0.0001, "num_sgd_iter": 20, "sgd_minibatch_size": 500, "train_batch_size": 10000}, {"env": "RobotableEnv-v0", "kl_coeff": 1.0, "num_workers": 2, "num_gpus": 0, "model": {"free_log_std": true}, "lambda": 1.14, "clip_param": 0.16000000000000003, "lr": 5e-05, "num_sgd_iter": 20, "sgd_minibatch_size": 500, "train_batch_size": 10000}]
== Status ==
Memory usage on this node: 2.7/3.8 GiB
PopulationBasedTraining: 28 checkpoints, 3 perturbs
Resources requested: 3/4 CPUs, 0/0 GPUs, 0.0/0.93 GiB heap, 0.0/0.29 GiB objects
Result logdir: /root/ray_results/PBT_ROBOTABLE
Number of trials: 8 (7 PAUSED, 1 RUNNING)
+---------------------------------+----------+-----------------------+--------+------------------+--------+----------+
| Trial name | status | loc | iter | total time (s) | ts | reward |
|---------------------------------+----------+-----------------------+--------+------------------+--------+----------|
| PPO_RobotableEnv-v0_c67a8_00000 | PAUSED | | 36 | 1069.1 | 360000 | 0.735323 |
| PPO_RobotableEnv-v0_c67a8_00001 | PAUSED | | 36 | 1096.3 | 360000 | 0.736305 |
| PPO_RobotableEnv-v0_c67a8_00002 | PAUSED | | 33 | 987.687 | 330000 | 0.735262 |
| PPO_RobotableEnv-v0_c67a8_00003 | PAUSED | | 36 | 1096.22 | 360000 | 0.731993 |
| PPO_RobotableEnv-v0_c67a8_00004 | PAUSED | | 37 | 1103.48 | 370000 | 0.739188 |
| PPO_RobotableEnv-v0_c67a8_00005 | RUNNING | 192.168.101.127:14690 | 37 | 1101.5 | 370000 | 0.727506 |
| PPO_RobotableEnv-v0_c67a8_00006 | PAUSED | | 35 | 1067.26 | 350000 | 0.739985 |
| PPO_RobotableEnv-v0_c67a8_00007 | PAUSED | | 36 | 1085.05 | 360000 | 0.739295 |
+---------------------------------+----------+-----------------------+--------+------------------+--------+----------+
2020-07-19 17:27:53,966 INFO pbt.py:78 -- [explore] perturbed config from {'env': 'RobotableEnv-v0', 'kl_coeff': 1.0, 'num_workers': 2, 'num_gpus': 0, 'model': {'free_log_std': True}, 'lambda': 0.95, 'clip_param': 0.2, 'lr': 0.0001, 'num_sgd_iter': 20, 'sgd_minibatch_size': 500, 'train_batch_size': 10000} -> {'env': 'RobotableEnv-v0', 'kl_coeff': 1.0, 'num_workers': 2, 'num_gpus': 0, 'model': {'free_log_std': True}, 'lambda': 1.14, 'clip_param': 0.16000000000000003, 'lr': 5e-05, 'num_sgd_iter': 20, 'sgd_minibatch_size': 500, 'train_batch_size': 10000}
2020-07-19 17:27:53,966 INFO pbt.py:316 -- [exploit] transferring weights from trial PPO_RobotableEnv-v0_c67a8_00006 (score 0.7399848299949074) -> PPO_RobotableEnv-v0_c67a8_00005 (score 0.7241841897925536)
Result for PPO_RobotableEnv-v0_c67a8_00005:
custom_metrics: {}
date: 2020-07-19_17-27-53
done: false
episode_len_mean: 114.58
episode_reward_max: 0.7808001167724908
episode_reward_mean: 0.7241841897925536
episode_reward_min: 0.6627154081217708
episodes_this_iter: 88
episodes_total: 2500
experiment_id: e3408f32ed2a433d8c7edb87d33609ba
experiment_tag: 5@perturbed[clip_param=0.16,lambda=1.14,lr=5e-05]
hostname: chrx
info:
learner:
default_policy:
cur_kl_coeff: 0.0625
cur_lr: 4.999999873689376e-05
entropy: 5.101933479309082
entropy_coeff: 0.0
kl: 0.004210006445646286
model: {}
policy_loss: -0.0077978381887078285
total_loss: -0.007088268641382456
vf_explained_var: 0.9757658243179321
vf_loss: 0.0004464423400349915
num_steps_sampled: 380000
num_steps_trained: 380000
iterations_since_restore: 5
node_ip: 192.168.101.127
num_healthy_workers: 2
off_policy_estimator: {}
perf:
cpu_util_percent: 66.7095238095238
ram_util_percent: 72.5452380952381
pid: 14690
policy_reward_max: {}
policy_reward_mean: {}
policy_reward_min: {}
sampler_perf:
mean_env_wait_ms: 1.5935033550679747
mean_inference_ms: 1.8385610163959398
mean_processing_ms: 1.195529456155168
time_since_restore: 147.82027745246887
time_this_iter_s: 29.546902656555176
time_total_s: 1131.04909491539
timers:
learn_throughput: 1880.23
learn_time_ms: 5318.497
load_throughput: 350730.091
load_time_ms: 28.512
sample_throughput: 414.501
sample_time_ms: 24125.418
update_time_ms: 4.191
timestamp: 1595179673
timesteps_since_restore: 0
timesteps_total: 380000
training_iteration: 38
trial_id: c67a8_00005
2020-07-19 17:27:54,989 WARNING util.py:137 -- The `experiment_checkpoint` operation took 0.8819785118103027 seconds to complete, which may be a performance bottleneck.
== Status ==
Memory usage on this node: 2.6/3.8 GiB
PopulationBasedTraining: 28 checkpoints, 4 perturbs
Resources requested: 0/4 CPUs, 0/0 GPUs, 0.0/0.93 GiB heap, 0.0/0.29 GiB objects
Result logdir: /root/ray_results/PBT_ROBOTABLE
Number of trials: 8 (8 PAUSED)
+---------------------------------+----------+-------+--------+------------------+--------+----------+
| Trial name | status | loc | iter | total time (s) | ts | reward |
|---------------------------------+----------+-------+--------+------------------+--------+----------|
| PPO_RobotableEnv-v0_c67a8_00000 | PAUSED | | 36 | 1069.1 | 360000 | 0.735323 |
| PPO_RobotableEnv-v0_c67a8_00001 | PAUSED | | 36 | 1096.3 | 360000 | 0.736305 |
| PPO_RobotableEnv-v0_c67a8_00002 | PAUSED | | 33 | 987.687 | 330000 | 0.735262 |
| PPO_RobotableEnv-v0_c67a8_00003 | PAUSED | | 36 | 1096.22 | 360000 | 0.731993 |
| PPO_RobotableEnv-v0_c67a8_00004 | PAUSED | | 37 | 1103.48 | 370000 | 0.739188 |
| PPO_RobotableEnv-v0_c67a8_00005 | PAUSED | | 38 | 1131.05 | 380000 | 0.724184 |
| PPO_RobotableEnv-v0_c67a8_00006 | PAUSED | | 35 | 1067.26 | 350000 | 0.739985 |
| PPO_RobotableEnv-v0_c67a8_00007 | PAUSED | | 36 | 1085.05 | 360000 | 0.739295 |
+---------------------------------+----------+-------+--------+------------------+--------+----------+
(pid=14800) 2020-07-19 17:27:58,611 INFO trainer.py:585 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
(pid=14800) 2020-07-19 17:27:58,611 INFO trainer.py:612 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
(pid=14800) pybullet build time: Mar 17 2020 17:46:41
(pid=14800) /usr/local/lib/python3.6/dist-packages/gym/logger.py:30: UserWarning: WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=14800) warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
(pid=14913) pybullet build time: Mar 17 2020 17:46:41
(pid=14913) 2020-07-19 17:28:00,118 INFO trainer.py:585 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
(pid=14913) 2020-07-19 17:28:00,118 INFO trainer.py:612 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
(pid=14913) /usr/local/lib/python3.6/dist-packages/gym/logger.py:30: UserWarning: WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=14913) warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
(pid=14992) pybullet build time: Mar 17 2020 17:46:41
(pid=14993) pybullet build time: Mar 17 2020 17:46:41
(pid=14992) /usr/local/lib/python3.6/dist-packages/gym/logger.py:30: UserWarning: WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=14992) warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
(pid=14800) 2020-07-19 17:28:10,106 INFO trainable.py:181 -- _setup took 11.510 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
(pid=14993) /usr/local/lib/python3.6/dist-packages/gym/logger.py:30: UserWarning: WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=14993) warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
(pid=14800) 2020-07-19 17:28:10,126 WARNING util.py:37 -- Install gputil for GPU system monitoring.
(pid=14800) 2020-07-19 17:28:10,717 INFO trainable.py:423 -- Restored on 192.168.101.127 from checkpoint: /root/ray_results/PBT_ROBOTABLE/PPO_RobotableEnv-v0_5_2020-07-19_15-00-03bbqeih3t/tmpf1h5txefrestore_from_object/checkpoint-35
(pid=14800) 2020-07-19 17:28:10,717 INFO trainable.py:430 -- Current state after restoring: {'_iteration': 35, '_timesteps_total': None, '_time_total': 1067.2641203403473, '_episodes_total': 2289}
(pid=14913) 2020-07-19 17:28:12,388 INFO trainable.py:181 -- _setup took 12.284 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
(pid=14913) 2020-07-19 17:28:12,388 WARNING util.py:37 -- Install gputil for GPU system monitoring.
(pid=14913) 2020-07-19 17:28:12,760 INFO trainable.py:423 -- Restored on 192.168.101.127 from checkpoint: /root/ray_results/PBT_ROBOTABLE/PPO_RobotableEnv-v0_2_2020-07-19_14-52-33cutk2k27/tmplqac6svyrestore_from_object/checkpoint-33
(pid=14913) 2020-07-19 17:28:12,760 INFO trainable.py:430 -- Current state after restoring: {'_iteration': 33, '_timesteps_total': None, '_time_total': 987.687007188797, '_episodes_total': 2059}
(pid=15001) pybullet build time: Mar 17 2020 17:46:41
(pid=15001) /usr/local/lib/python3.6/dist-packages/gym/logger.py:30: UserWarning: WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=15001) warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
(pid=15088) pybullet build time: Mar 17 2020 17:46:41
(pid=15088) /usr/local/lib/python3.6/dist-packages/gym/logger.py:30: UserWarning: WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=15088) warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
Result for PPO_RobotableEnv-v0_c67a8_00002:
custom_metrics: {}
date: 2020-07-19_17-28-54
done: false
episode_len_mean: 110.78888888888889
episode_reward_max: 0.8009732276880979
episode_reward_mean: 0.7387077080695522
episode_reward_min: 0.6640543988817607
episodes_this_iter: 90
episodes_total: 2149
experiment_id: edcd859a3ae34d668bb9be1899dde41a
experiment_tag: '2'
hostname: chrx
info:
learner:
default_policy:
cur_kl_coeff: 1.0
cur_lr: 9.999999747378752e-05
entropy: 5.111008644104004
entropy_coeff: 0.0
kl: 0.0031687873415648937
model: {}
policy_loss: -0.012367220595479012
total_loss: -0.008663905784487724
vf_explained_var: 0.9726411700248718
vf_loss: 0.0005345290992408991
num_steps_sampled: 340000
num_steps_trained: 340000
iterations_since_restore: 1
node_ip: 192.168.101.127
num_healthy_workers: 2
off_policy_estimator: {}
perf:
cpu_util_percent: 68.11833333333333
ram_util_percent: 71.13666666666667
pid: 14913
policy_reward_max: {}
policy_reward_mean: {}
policy_reward_min: {}
sampler_perf:
mean_env_wait_ms: 1.6718764134441182
mean_inference_ms: 1.9752634594235934
mean_processing_ms: 1.2958259778937158
time_since_restore: 41.650487661361694
time_this_iter_s: 41.650487661361694
time_total_s: 1029.3374948501587
timers:
learn_throughput: 1680.106
learn_time_ms: 5952.007
load_throughput: 74973.795
load_time_ms: 133.38
sample_throughput: 285.094
sample_time_ms: 35076.171
update_time_ms: 4.517
timestamp: 1595179734
timesteps_since_restore: 0
timesteps_total: 340000
training_iteration: 34
trial_id: c67a8_00002
2020-07-19 17:28:55,042 WARNING util.py:137 -- The `experiment_checkpoint` operation took 0.5836038589477539 seconds to complete, which may be a performance bottleneck.
== Status ==
Memory usage on this node: 2.7/3.8 GiB
PopulationBasedTraining: 28 checkpoints, 4 perturbs
Resources requested: 3/4 CPUs, 0/0 GPUs, 0.0/0.93 GiB heap, 0.0/0.29 GiB objects
Result logdir: /root/ray_results/PBT_ROBOTABLE
Number of trials: 8 (7 PAUSED, 1 RUNNING)
+---------------------------------+----------+-----------------------+--------+------------------+--------+----------+
| Trial name | status | loc | iter | total time (s) | ts | reward |
|---------------------------------+----------+-----------------------+--------+------------------+--------+----------|
| PPO_RobotableEnv-v0_c67a8_00000 | PAUSED | | 36 | 1069.1 | 360000 | 0.735323 |
| PPO_RobotableEnv-v0_c67a8_00001 | PAUSED | | 36 | 1096.3 | 360000 | 0.736305 |
| PPO_RobotableEnv-v0_c67a8_00002 | RUNNING | 192.168.101.127:14913 | 34 | 1029.34 | 340000 | 0.738708 |
| PPO_RobotableEnv-v0_c67a8_00003 | PAUSED | | 36 | 1096.22 | 360000 | 0.731993 |
| PPO_RobotableEnv-v0_c67a8_00004 | PAUSED | | 37 | 1103.48 | 370000 | 0.739188 |
| PPO_RobotableEnv-v0_c67a8_00005 | PAUSED | | 38 | 1131.05 | 380000 | 0.724184 |
| PPO_RobotableEnv-v0_c67a8_00006 | PAUSED | | 35 | 1067.26 | 350000 | 0.739985 |
| PPO_RobotableEnv-v0_c67a8_00007 | PAUSED | | 36 | 1085.05 | 360000 | 0.739295 |
+---------------------------------+----------+-----------------------+--------+------------------+--------+----------+