Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. 论文笔记

2020-06-13

1. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. >>

A. 0 >> Abstract

1. reinforcement learning often compromise the autonomy RL 损害学习过程的自主性 >>

a) Deep reinforcement learning 通过DRL缓解 >>

(1) sample complexity. 受限于高样本复杂度 >>

i) simple tasks >>

ii) simulated settings >>

2. this paper >>

a) deep Q-functions >>

(1) policy updates asynchronously. 异步汇集策略 >>

B. 1 >> I. INTRODUCTION

1. deep Q-functions >>

a) without user-provided demonstrations >>

2. challenges >>

high sample-complexity

a) 深度确定策略梯度算法 >> Deep Deterministic Policy Gradient algorithm (DDPG)

b) 规范化优势函数算法 >> (NAF)

c) 并行化 >> parallelizing the algorithm

3. main contribution >>

a) a demonstration of asynchronous deep reinforcement learning using our parallel NAF algorithm across a cluster of robots. 在一个机器人集群中使用我们的并行NAF算法来演示异步深度强化学习 >>

b) a simple and effective safety mechanism for constraining exploration at training time 一个简单有效的安全机制来约束训练时的探索 >>

1. used low-dimensional policy representations >>

Many of the RL

a) high- dimensional systems >>

recently

2. this work >>

a) model-free RL >>

b) function approximation methods >>

c) learning complex tasks >>

d) parallelized learning >>

(1) our work instead seeks to minimize the training time when training on real physical robots 在真实的物理机器人上进行训练时尽量减少训练时间 >>

i) experience is expensive 经验昂贵 >>

ii) neural network backward passes is comparatively cheap. 反向传播廉价相对 >>

iii) retain the use of a replay buffer 保留了重放缓冲区的使用 >>

iv) asynchronous execution and neural network training 着重异步执行和神经网络训练 >>

D. 3 >> III. BACKGROUND

E. 4 >> IV. ASYNCHRONOUS TRAINING OF NORMALIZED ADVANTAGE FUNCTIONS

1. an extension of NAF >>

a) how online training of the Q-function estimator can be performed asynchronously, 描述了如何通过在一个或多个机器人上执行当前策略来训练网络和一个或多个工作线程来异步执行Q函数估计器的在线训练 >>

b) constrained by the data collection rate in real time, rather than network training speed. 学习时间往往受到实时数据采集率的限制，而不是网络训练速度的限制 >>

c) safety constraint >>

2. A. Asynchronous Learning >>

a) The learner thread uses the replay buffer to perform asynchronous updates to the deep neural network Q-function approximator. 学习线程使用重放缓冲区对DNN的Q函数逼近器执行异步更新。 >>

(1) >>

i) 此线程在中心服务器上运行，并将更新的策略参数分派给每个工作线程。经验收集工作线程在各个机器人上运行，并将每个时间步的观察、操作和奖励发送到中央服务器以附加到重播缓冲区。训练线程和收集线程之间的这种解耦允许每个机器人上的控制器实时运行，而不会由于网络反向传播的计算成本而遇到延迟。此外，它简单地通过添加额外的工作线程，就可以在多个机器人之间并行化体验集合。我们只使用一个线程来训练网络，但是梯度计算也可以在我们的框架内以与[29]相同的方式分布。当trainer线程从集中式重播缓冲区保持训练时，收集器 >>

3. B. Safety Constraints >>

a) For all experiments, we set a maximum com- manded velocity allowed per joint, as well as strict position limits for each joint. >>

b) with no contacts >>

(1) sufficient >>

c) experiments with contacts >>

(1) additional heuristics >>

4. C. Network Architectures >>

jsonContent: meta: false pages: false posts: title: true date: true path: true text: false raw: false content: false slug: false updated: false comments: false link: false permalink: false excerpt: false categories: false tags: true