Motion Planning Among Dynamic, Decision-Making Agents with Deep Reinforcement Learning 论文笔记

2020-06-11

1. Motion Planning Among Dynamic, Decision-Making Agents with Deep Reinforcement Learning >>

A. 0 >> Abstract

1. 其他方法一个假设 >> key assumptions about other agents’ behavior that deviate from reality as the number of agents in the environment increases. 对其他代理行为的关键假设来实现。随着代理数量增加而偏离现实。

2. without assuming they follow any particular behavior rules 本文无需假设特定行为规则 >>

3. LSTM 该策略能够使用任意数量的其他代理的观察结果 >>

B. 1 >> I. INTRODUCTION

1. behaviors >>

a) cooperation or obliviousness >>

(1) only partially observe >>

2. the agent know and assume about other agents’ belief states, policies, and intents >>

3. Without communication >>

a) not directly measurable >>

b) can be inferred >>

4. 一个最简单的假设 >> static, and re-plan quickly enough to avoid collisions

5. 另一个 >> other agents are dynamic obstacles

a) constant velocity >>

6. 替代法，解决不能精确预测 >> reinforcement learning

7. 本实验 >> without assuming that other agents follow any particular behavior model 不假设其他代理遵循任何特定的行为模型

8. 另一个挑战 >> environment varies, 不同的环境的代理数量

9. 现有典型模型问题 >> Existing strategies define a maximum number of agents that the network can observe 前馈神经网络需要固定维数的输入

10. 解决方法 >> long short-term memory (LSTM)

a) This enables the algorithm to make decisions based on an arbitrary number of other agents in the robot’s vicinity. 使算法能够根据机器人附近人意数量的其他代理做出决策 >>

11. 本文贡献 >> main contributions

a) 1 >> does not assume the behavior of other decision-making agents

b) 2 >> use observations of an arbitrary number of other agents

c) 3 >> demonstrating the benefits

d) 4 >> demonstration

C. 2 >> II. BACKGROUND

a) non-communicating, dynamic agents >>

decentralized collision avoidance

(1) trajectory-based >>

i) compute plans on longer timescale to produce smoother paths >>

ii) 缺点 >> knowledge of unobservable states

computationally expensive

(2) reaction-based >>

i) one-step interaction rules based on geometry or physics >>

ii) 缺点 >> short- sighted

(3) non-learning based 很多是非基于学习的 >>

b) 前论文的贡献 >> deep reinforcement learning

(1) The expensive operation of modeling the complex interactions is learned in an offline training step, whereas the learned policy can be queried quickly online, 建模复杂交互的昂贵操作是在离线训练步骤中学习的，而学习到的策略可以在线快速查询，同时结合了这两类方法的优点。 >>

(2) Cooperation is embedded in the learned value function >>

i) and the algorithm compares possible actions by querying the value of future states after an arbitrary forward propagation of other agents 在学习值函数中嵌入了协作机制，通过查询其他代理任意前向传播后的未来状态值来比较可能的行为。 >>

c) Other deep RL approaches >>

(1) end-to-end training >>

i) agent-level understanding of the world 缺少对世界的世界级理解 >>

d) challenge of a variable number of agents in the environment >>

(1) 法1 >> define a maximum number of agents

i) 缺点 >> limited by the increased number of network parameters (and therefore training time) as more agents’ states are added

(2) 法2 >> using raw sensor inputs, maintains a fixed size input

i) 限制同上：随着更多代理的状态被添加，代理的最大数量受到网络参数数量增加（因此训练时间）的限制 >>

e) For dynamic environments >>

(1) in multi-sensor semantic labeling applied on a agent-by-agent basis. >>

2. B. Collision Avoidance with Deep RL (CADRL) >>

a) state vector >>

(1) observable >>

i) agent’s position, velocity, and radius >>

(2) unobservable >>

i) goal position, preferred speed, and orientation >>

b) A policy, π : (st, ˜ sto) ?→ut, is developed with the objective of minimizing expected time to goal E[tg] while avoiding collision with other agents, Ut的开发目标是在避免与其他代理发生碰撞的同时，将目标E的预期时间最小化 >>

(1) >>

c) ∆t = 1 >>

d) The challenge of choosing ∆t motivated the use of a different RL framework >>

3. C. Policy-Based Learning >>

a) RL frameworks >>

execute without any arbitrary assumptions about state transition dynamics

b) A3C [14] uses a single DNN to approximate both the value (critic) and policy (actor) functions, and is trained with two loss terms 最新的 actor-critic算法A3C使用单个DNN来近似值和策略函数，并使用两个损失项进行训练。 >>

(1) >>

D. 3 >> III. APPROACH

1. A. GA3C-CADRL >>

2. B. Handling a Variable Number of Agents >>

a) Long short-term memory (LSTM) [10] is recurrent architecture with advantageous properties for training3. >>

b) this paper leverages their ability to encode a sequence of information that is not time-dependent 本文利用其对不依赖时间的信息序列进行编码的能力 >>

c) Given a sufficiently large hidden state vector, there is enough space to encode a large number of agents’ states without the LSTM having to forget anything relevant. In the case of a large number of agent states, to mitigate the impact of the agent forgetting the early states, the states are fed in reverse order of distance to the agent, meaning the closest agents (fed last) should have the biggest effect on the final hidden state, hn. 给定一个足够大的隐藏状态向量，就有足够的空间来编码大量代理的状态，而LSTM不必忘记任何相关的信息。在存在大量代理状态的情况下，为了减轻代理遗忘早起状态的影响，状态按与代理距离的倒序馈送，这意味着最近的代理对最终隐藏状态Hn的影响最大。 >>

3. C. Training the Policy >>

E. 4 >> IV. RESULTS

1. A. Computational Details >>

a) i7-7700K CPU >>

0.4-0.5ms
GPU is not required for fast execution of a trained model.

b) GTX1060 >>

12 hours

2. B. Simulation Results >>

3. C. Hardware Experiment >>

F. 5 >> V. CONCLUSION

1. GA3C-CADRL >>

2. LSTM >>

3. without the use of a 3D Lidar. >>

4. 未来 >> new, more general formulation to study the effects of signaling intent more explicitly through an agent’s choice of action. 利用本文的公式通过代理的行为选择更明确地研究信号意图的影响。

jsonContent: meta: false pages: false posts: title: true date: true path: true text: false raw: false content: false slug: false updated: false comments: false link: false permalink: false excerpt: false categories: false tags: true

1. Motion Planning Among Dynamic, Decision-Making Agents with Deep Reinforcement Learning >>

A. 0 >> Abstract

1. 其他方法一个假设 >> key assumptions about other agents’ behavior that deviate from reality as the number of agents in the environment increases. 对其他代理行为的关键假设来实现。随着代理数量增加而偏离现实。

2. without assuming they follow any particular behavior rules 本文无需假设特定行为规则 >>

3. LSTM 该策略能够使用任意数量的其他代理的观察结果 >>

B. 1 >> I. INTRODUCTION

1. behaviors >>

a) cooperation or obliviousness >>

(1) only partially observe >>

2. the agent know and assume about other agents’ belief states, policies, and intents >>

3. Without communication >>

a) not directly measurable >>

b) can be inferred >>

4. 一个最简单的假设 >> static, and re-plan quickly enough to avoid collisions

5. 另一个 >> other agents are dynamic obstacles

a) constant velocity >>

6. 替代法，解决不能精确预测 >> reinforcement learning

7. 本实验 >> without assuming that other agents follow any particular behavior model 不假设其他代理遵循任何特定的行为模型

8. 另一个挑战 >> environment varies, 不同的环境的代理数量

9. 现有典型模型问题 >> Existing strategies define a maximum number of agents that the network can observe 前馈神经网络需要固定维数的输入

10. 解决方法 >> long short-term memory (LSTM)

a) This enables the algorithm to make decisions based on an arbitrary number of other agents in the robot’s vicinity. 使算法能够根据机器人附近人意数量的其他代理做出决策 >>

11. 本文贡献 >> main contributions

a) 1 >> does not assume the behavior of other decision-making agents

b) 2 >> use observations of an arbitrary number of other agents

c) 3 >> demonstrating the benefits

d) 4 >> demonstration

C. 2 >> II. BACKGROUND

1. A. Related Work >>

a) non-communicating, dynamic agents >>

(1) trajectory-based >>

i) compute plans on longer timescale to produce smoother paths >>

ii) 缺点 >> knowledge of unobservable states

(2) reaction-based >>

i) one-step interaction rules based on geometry or physics >>

ii) 缺点 >> short- sighted

(3) non-learning based 很多是非基于学习的 >>

b) 前论文的贡献 >> deep reinforcement learning

(2) Cooperation is embedded in the learned value function >>

i) and the algorithm compares possible actions by querying the value of future states after an arbitrary forward propagation of other agents 在学习值函数中嵌入了协作机制，通过查询其他代理任意前向传播后的未来状态值来比较可能的行为。 >>

c) Other deep RL approaches >>

(1) end-to-end training >>

i) agent-level understanding of the world 缺少对世界的世界级理解 >>

d) challenge of a variable number of agents in the environment >>

(1) 法1 >> define a maximum number of agents

i) 缺点 >> limited by the increased number of network parameters (and therefore training time) as more agents’ states are added

(2) 法2 >> using raw sensor inputs, maintains a fixed size input

i) 限制同上：随着更多代理的状态被添加，代理的最大数量受到网络参数数量增加（因此训练时间）的限制 >>

e) For dynamic environments >>

(1) in multi-sensor semantic labeling applied on a agent-by-agent basis. >>

2. B. Collision Avoidance with Deep RL (CADRL) >>

a) state vector >>

(1) observable >>

i) agent’s position, velocity, and radius >>

(2) unobservable >>

i) goal position, preferred speed, and orientation >>

b) A policy, π : (st, ˜ sto) ?→ut, is developed with the objective of minimizing expected time to goal E[tg] while avoiding collision with other agents, Ut的开发目标是在避免与其他代理发生碰撞的同时，将目标E的预期时间最小化 >>

(1) >>

c) ∆t = 1 >>

d) The challenge of choosing ∆t motivated the use of a different RL framework >>

3. C. Policy-Based Learning >>

a) RL frameworks >>

b) A3C [14] uses a single DNN to approximate both the value (critic) and policy (actor) functions, and is trained with two loss terms 最新的 actor-critic算法A3C使用单个DNN来近似值和策略函数，并使用两个损失项进行训练。 >>

(1) >>

D. 3 >> III. APPROACH

1. A. GA3C-CADRL >>

2. B. Handling a Variable Number of Agents >>

a) Long short-term memory (LSTM) [10] is recurrent architecture with advantageous properties for training3. >>

b) this paper leverages their ability to encode a sequence of information that is not time-dependent 本文利用其对不依赖时间的信息序列进行编码的能力 >>

3. C. Training the Policy >>

E. 4 >> IV. RESULTS

1. A. Computational Details >>

a) i7-7700K CPU >>

b) GTX1060 >>

2. B. Simulation Results >>

3. C. Hardware Experiment >>

F. 5 >> V. CONCLUSION

1. GA3C-CADRL >>

2. LSTM >>

3. without the use of a 3D Lidar. >>