Categories
deep dev

Ray / RLLib PBT & PPO

Got Population Based Training PPO running in RLLib. It seems to have maxed out rewards. (Asymptotically approaching 0.74).

PPO isn’t great for this. But let’s see if we can replay with GUI after this.

I asked for these

hyperparam_mutations={
    "lambda": lambda: random.uniform(0.9, 1.0),
    "clip_param": lambda: random.uniform(0.01, 0.5),
    "lr": [1e-3, 5e-4, 1e-4, 5e-5, 1e-5]
    # ,
    # "num_sgd_iter": lambda: random.randint(1, 30),
    # "sgd_minibatch_size": lambda: random.randint(128, 16384),
    # "train_batch_size": lambda: random.randint(2000, 160000),
})

cat pbt_global.txt
["5", "7", 17, 18, {"env": "RobotableEnv-v0", "kl_coeff": 1.0, "num_workers": 2, "num_gpus": 0, "model": {"free_log_std": true}, "lambda": 0.95, "clip_param": 0.2, "lr": 0.0001, "num_sgd_iter": 20, "sgd_minibatch_size": 500, "train_batch_size": 10000}, {"env": "RobotableEnv-v0", "kl_coeff": 1.0, "num_workers": 2, "num_gpus": 0, "model": {"free_log_std": true}, "lambda": 0.76, "clip_param": 0.16000000000000003, "lr": 5e-05, "num_sgd_iter": 20, "sgd_minibatch_size": 500, "train_batch_size": 10000}]

["3", "1", 35, 32, {"env": "RobotableEnv-v0", "kl_coeff": 1.0, "num_workers": 2, "num_gpus": 0, "model": {"free_log_std": true}, "lambda": 0.95, "clip_param": 0.2, "lr": 0.0001, "num_sgd_iter": 20, "sgd_minibatch_size": 500, "train_batch_size": 10000}, {"env": "RobotableEnv-v0", "kl_coeff": 1.0, "num_workers": 2, "num_gpus": 0, "model": {"free_log_std": true}, "lambda": 1.14, "clip_param": 0.1096797541550122, "lr": 5e-05, "num_sgd_iter": 20, "sgd_minibatch_size": 500, "train_batch_size": 10000}]

["3", "7", 35, 36, {"env": "RobotableEnv-v0", "kl_coeff": 1.0, "num_workers": 2, "num_gpus": 0, "model": {"free_log_std": true}, "lambda": 0.95, "clip_param": 0.2, "lr": 0.0001, "num_sgd_iter": 20, "sgd_minibatch_size": 500, "train_batch_size": 10000}, {"env": "RobotableEnv-v0", "kl_coeff": 1.0, "num_workers": 2, "num_gpus": 0, "model": {"free_log_std": true}, "lambda": 0.76, "clip_param": 0.24, "lr": 0.001, "num_sgd_iter": 20, "sgd_minibatch_size": 500, "train_batch_size": 10000}]

["5", "6", 37, 35, {"env": "RobotableEnv-v0", "kl_coeff": 1.0, "num_workers": 2, "num_gpus": 0, "model": {"free_log_std": true}, "lambda": 0.95, "clip_param": 0.2, "lr": 0.0001, "num_sgd_iter": 20, "sgd_minibatch_size": 500, "train_batch_size": 10000}, {"env": "RobotableEnv-v0", "kl_coeff": 1.0, "num_workers": 2, "num_gpus": 0, "model": {"free_log_std": true}, "lambda": 1.14, "clip_param": 0.16000000000000003, "lr": 5e-05, "num_sgd_iter": 20, "sgd_minibatch_size": 500, "train_batch_size": 10000}]




== Status ==
Memory usage on this node: 2.7/3.8 GiB
PopulationBasedTraining: 28 checkpoints, 3 perturbs
Resources requested: 3/4 CPUs, 0/0 GPUs, 0.0/0.93 GiB heap, 0.0/0.29 GiB objects
Result logdir: /root/ray_results/PBT_ROBOTABLE
Number of trials: 8 (7 PAUSED, 1 RUNNING)
+---------------------------------+----------+-----------------------+--------+------------------+--------+----------+
| Trial name                      | status   | loc                   |   iter |   total time (s) |     ts |   reward |
|---------------------------------+----------+-----------------------+--------+------------------+--------+----------|
| PPO_RobotableEnv-v0_c67a8_00000 | PAUSED   |                       |     36 |         1069.1   | 360000 | 0.735323 |
| PPO_RobotableEnv-v0_c67a8_00001 | PAUSED   |                       |     36 |         1096.3   | 360000 | 0.736305 |
| PPO_RobotableEnv-v0_c67a8_00002 | PAUSED   |                       |     33 |          987.687 | 330000 | 0.735262 |
| PPO_RobotableEnv-v0_c67a8_00003 | PAUSED   |                       |     36 |         1096.22  | 360000 | 0.731993 |
| PPO_RobotableEnv-v0_c67a8_00004 | PAUSED   |                       |     37 |         1103.48  | 370000 | 0.739188 |
| PPO_RobotableEnv-v0_c67a8_00005 | RUNNING  | 192.168.101.127:14690 |     37 |         1101.5   | 370000 | 0.727506 |
| PPO_RobotableEnv-v0_c67a8_00006 | PAUSED   |                       |     35 |         1067.26  | 350000 | 0.739985 |
| PPO_RobotableEnv-v0_c67a8_00007 | PAUSED   |                       |     36 |         1085.05  | 360000 | 0.739295 |
+---------------------------------+----------+-----------------------+--------+------------------+--------+----------+


2020-07-19 17:27:53,966	INFO pbt.py:78 -- [explore] perturbed config from {'env': 'RobotableEnv-v0', 'kl_coeff': 1.0, 'num_workers': 2, 'num_gpus': 0, 'model': {'free_log_std': True}, 'lambda': 0.95, 'clip_param': 0.2, 'lr': 0.0001, 'num_sgd_iter': 20, 'sgd_minibatch_size': 500, 'train_batch_size': 10000} -> {'env': 'RobotableEnv-v0', 'kl_coeff': 1.0, 'num_workers': 2, 'num_gpus': 0, 'model': {'free_log_std': True}, 'lambda': 1.14, 'clip_param': 0.16000000000000003, 'lr': 5e-05, 'num_sgd_iter': 20, 'sgd_minibatch_size': 500, 'train_batch_size': 10000}
2020-07-19 17:27:53,966	INFO pbt.py:316 -- [exploit] transferring weights from trial PPO_RobotableEnv-v0_c67a8_00006 (score 0.7399848299949074) -> PPO_RobotableEnv-v0_c67a8_00005 (score 0.7241841897925536)
Result for PPO_RobotableEnv-v0_c67a8_00005:
  custom_metrics: {}
  date: 2020-07-19_17-27-53
  done: false
  episode_len_mean: 114.58
  episode_reward_max: 0.7808001167724908
  episode_reward_mean: 0.7241841897925536
  episode_reward_min: 0.6627154081217708
  episodes_this_iter: 88
  episodes_total: 2500
  experiment_id: e3408f32ed2a433d8c7edb87d33609ba
  experiment_tag: 5@perturbed[clip_param=0.16,lambda=1.14,lr=5e-05]
  hostname: chrx
  info:
    learner:
      default_policy:
        cur_kl_coeff: 0.0625
        cur_lr: 4.999999873689376e-05
        entropy: 5.101933479309082
        entropy_coeff: 0.0
        kl: 0.004210006445646286
        model: {}
        policy_loss: -0.0077978381887078285
        total_loss: -0.007088268641382456
        vf_explained_var: 0.9757658243179321
        vf_loss: 0.0004464423400349915
    num_steps_sampled: 380000
    num_steps_trained: 380000
  iterations_since_restore: 5
  node_ip: 192.168.101.127
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 66.7095238095238
    ram_util_percent: 72.5452380952381
  pid: 14690
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_env_wait_ms: 1.5935033550679747
    mean_inference_ms: 1.8385610163959398
    mean_processing_ms: 1.195529456155168
  time_since_restore: 147.82027745246887
  time_this_iter_s: 29.546902656555176
  time_total_s: 1131.04909491539
  timers:
    learn_throughput: 1880.23
    learn_time_ms: 5318.497
    load_throughput: 350730.091
    load_time_ms: 28.512
    sample_throughput: 414.501
    sample_time_ms: 24125.418
    update_time_ms: 4.191
  timestamp: 1595179673
  timesteps_since_restore: 0
  timesteps_total: 380000
  training_iteration: 38
  trial_id: c67a8_00005
  
2020-07-19 17:27:54,989	WARNING util.py:137 -- The `experiment_checkpoint` operation took 0.8819785118103027 seconds to complete, which may be a performance bottleneck.
== Status ==
Memory usage on this node: 2.6/3.8 GiB
PopulationBasedTraining: 28 checkpoints, 4 perturbs
Resources requested: 0/4 CPUs, 0/0 GPUs, 0.0/0.93 GiB heap, 0.0/0.29 GiB objects
Result logdir: /root/ray_results/PBT_ROBOTABLE
Number of trials: 8 (8 PAUSED)
+---------------------------------+----------+-------+--------+------------------+--------+----------+
| Trial name                      | status   | loc   |   iter |   total time (s) |     ts |   reward |
|---------------------------------+----------+-------+--------+------------------+--------+----------|
| PPO_RobotableEnv-v0_c67a8_00000 | PAUSED   |       |     36 |         1069.1   | 360000 | 0.735323 |
| PPO_RobotableEnv-v0_c67a8_00001 | PAUSED   |       |     36 |         1096.3   | 360000 | 0.736305 |
| PPO_RobotableEnv-v0_c67a8_00002 | PAUSED   |       |     33 |          987.687 | 330000 | 0.735262 |
| PPO_RobotableEnv-v0_c67a8_00003 | PAUSED   |       |     36 |         1096.22  | 360000 | 0.731993 |
| PPO_RobotableEnv-v0_c67a8_00004 | PAUSED   |       |     37 |         1103.48  | 370000 | 0.739188 |
| PPO_RobotableEnv-v0_c67a8_00005 | PAUSED   |       |     38 |         1131.05  | 380000 | 0.724184 |
| PPO_RobotableEnv-v0_c67a8_00006 | PAUSED   |       |     35 |         1067.26  | 350000 | 0.739985 |
| PPO_RobotableEnv-v0_c67a8_00007 | PAUSED   |       |     36 |         1085.05  | 360000 | 0.739295 |
+---------------------------------+----------+-------+--------+------------------+--------+----------+


(pid=14800) 2020-07-19 17:27:58,611	INFO trainer.py:585 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
(pid=14800) 2020-07-19 17:27:58,611	INFO trainer.py:612 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
(pid=14800) pybullet build time: Mar 17 2020 17:46:41
(pid=14800) /usr/local/lib/python3.6/dist-packages/gym/logger.py:30: UserWarning: WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=14800)   warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
(pid=14913) pybullet build time: Mar 17 2020 17:46:41
(pid=14913) 2020-07-19 17:28:00,118	INFO trainer.py:585 -- Tip: set framework=tfe or the --eager flag to enable TensorFlow eager execution
(pid=14913) 2020-07-19 17:28:00,118	INFO trainer.py:612 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
(pid=14913) /usr/local/lib/python3.6/dist-packages/gym/logger.py:30: UserWarning: WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=14913)   warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
(pid=14992) pybullet build time: Mar 17 2020 17:46:41
(pid=14993) pybullet build time: Mar 17 2020 17:46:41
(pid=14992) /usr/local/lib/python3.6/dist-packages/gym/logger.py:30: UserWarning: WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=14992)   warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
(pid=14800) 2020-07-19 17:28:10,106	INFO trainable.py:181 -- _setup took 11.510 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
(pid=14993) /usr/local/lib/python3.6/dist-packages/gym/logger.py:30: UserWarning: WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=14993)   warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
(pid=14800) 2020-07-19 17:28:10,126	WARNING util.py:37 -- Install gputil for GPU system monitoring.
(pid=14800) 2020-07-19 17:28:10,717	INFO trainable.py:423 -- Restored on 192.168.101.127 from checkpoint: /root/ray_results/PBT_ROBOTABLE/PPO_RobotableEnv-v0_5_2020-07-19_15-00-03bbqeih3t/tmpf1h5txefrestore_from_object/checkpoint-35
(pid=14800) 2020-07-19 17:28:10,717	INFO trainable.py:430 -- Current state after restoring: {'_iteration': 35, '_timesteps_total': None, '_time_total': 1067.2641203403473, '_episodes_total': 2289}
(pid=14913) 2020-07-19 17:28:12,388	INFO trainable.py:181 -- _setup took 12.284 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
(pid=14913) 2020-07-19 17:28:12,388	WARNING util.py:37 -- Install gputil for GPU system monitoring.
(pid=14913) 2020-07-19 17:28:12,760	INFO trainable.py:423 -- Restored on 192.168.101.127 from checkpoint: /root/ray_results/PBT_ROBOTABLE/PPO_RobotableEnv-v0_2_2020-07-19_14-52-33cutk2k27/tmplqac6svyrestore_from_object/checkpoint-33
(pid=14913) 2020-07-19 17:28:12,760	INFO trainable.py:430 -- Current state after restoring: {'_iteration': 33, '_timesteps_total': None, '_time_total': 987.687007188797, '_episodes_total': 2059}
(pid=15001) pybullet build time: Mar 17 2020 17:46:41
(pid=15001) /usr/local/lib/python3.6/dist-packages/gym/logger.py:30: UserWarning: WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=15001)   warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
(pid=15088) pybullet build time: Mar 17 2020 17:46:41
(pid=15088) /usr/local/lib/python3.6/dist-packages/gym/logger.py:30: UserWarning: WARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.
(pid=15088)   warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
Result for PPO_RobotableEnv-v0_c67a8_00002:
  custom_metrics: {}
  date: 2020-07-19_17-28-54
  done: false
  episode_len_mean: 110.78888888888889
  episode_reward_max: 0.8009732276880979
  episode_reward_mean: 0.7387077080695522
  episode_reward_min: 0.6640543988817607
  episodes_this_iter: 90
  episodes_total: 2149
  experiment_id: edcd859a3ae34d668bb9be1899dde41a
  experiment_tag: '2'
  hostname: chrx
  info:
    learner:
      default_policy:
        cur_kl_coeff: 1.0
        cur_lr: 9.999999747378752e-05
        entropy: 5.111008644104004
        entropy_coeff: 0.0
        kl: 0.0031687873415648937
        model: {}
        policy_loss: -0.012367220595479012
        total_loss: -0.008663905784487724
        vf_explained_var: 0.9726411700248718
        vf_loss: 0.0005345290992408991
    num_steps_sampled: 340000
    num_steps_trained: 340000
  iterations_since_restore: 1
  node_ip: 192.168.101.127
  num_healthy_workers: 2
  off_policy_estimator: {}
  perf:
    cpu_util_percent: 68.11833333333333
    ram_util_percent: 71.13666666666667
  pid: 14913
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf:
    mean_env_wait_ms: 1.6718764134441182
    mean_inference_ms: 1.9752634594235934
    mean_processing_ms: 1.2958259778937158
  time_since_restore: 41.650487661361694
  time_this_iter_s: 41.650487661361694
  time_total_s: 1029.3374948501587
  timers:
    learn_throughput: 1680.106
    learn_time_ms: 5952.007
    load_throughput: 74973.795
    load_time_ms: 133.38
    sample_throughput: 285.094
    sample_time_ms: 35076.171
    update_time_ms: 4.517
  timestamp: 1595179734
  timesteps_since_restore: 0
  timesteps_total: 340000
  training_iteration: 34
  trial_id: c67a8_00002
  
2020-07-19 17:28:55,042	WARNING util.py:137 -- The `experiment_checkpoint` operation took 0.5836038589477539 seconds to complete, which may be a performance bottleneck.
== Status ==
Memory usage on this node: 2.7/3.8 GiB
PopulationBasedTraining: 28 checkpoints, 4 perturbs
Resources requested: 3/4 CPUs, 0/0 GPUs, 0.0/0.93 GiB heap, 0.0/0.29 GiB objects
Result logdir: /root/ray_results/PBT_ROBOTABLE
Number of trials: 8 (7 PAUSED, 1 RUNNING)
+---------------------------------+----------+-----------------------+--------+------------------+--------+----------+
| Trial name                      | status   | loc                   |   iter |   total time (s) |     ts |   reward |
|---------------------------------+----------+-----------------------+--------+------------------+--------+----------|
| PPO_RobotableEnv-v0_c67a8_00000 | PAUSED   |                       |     36 |          1069.1  | 360000 | 0.735323 |
| PPO_RobotableEnv-v0_c67a8_00001 | PAUSED   |                       |     36 |          1096.3  | 360000 | 0.736305 |
| PPO_RobotableEnv-v0_c67a8_00002 | RUNNING  | 192.168.101.127:14913 |     34 |          1029.34 | 340000 | 0.738708 |
| PPO_RobotableEnv-v0_c67a8_00003 | PAUSED   |                       |     36 |          1096.22 | 360000 | 0.731993 |
| PPO_RobotableEnv-v0_c67a8_00004 | PAUSED   |                       |     37 |          1103.48 | 370000 | 0.739188 |
| PPO_RobotableEnv-v0_c67a8_00005 | PAUSED   |                       |     38 |          1131.05 | 380000 | 0.724184 |
| PPO_RobotableEnv-v0_c67a8_00006 | PAUSED   |                       |     35 |          1067.26 | 350000 | 0.739985 |
| PPO_RobotableEnv-v0_c67a8_00007 | PAUSED   |                       |     36 |          1085.05 | 360000 | 0.739295 |
+---------------------------------+----------+-----------------------+--------+------------------+--------+----------+