Reinforcement Learning in Robotics- A Survey
Author: Jens Kober
Created: Jan 01, 2020 8:54 AM
Status: Reading
Tags: Reinforcement Learning
Type: Paper
Introduction
Brief summary
RL gives two things?
framework
,tools
for the designsophisticated
&hard-to-engineer
behavior.Paper’s Focus?
1.
Model based
andfree
;value function based
andpolicy search
methods.
How it works apparently? in detail?
trial-and-error to discover autonomously.
provide
feedback
in terms of ascalar objective function
that measuresone-step
performance.
what called it “policy”?
A function π that generates the
motor commands
. It based on some practical circumstances.RL’s aim?
to find a policy that
optimizes
the long term sum ofrewards
(algorithm is to find a near optimal policy).
RL in ML
Supervised Learning & Imitation Learning?
- supervised L: a sequence of independent examples
- Imitation L: follow in GIVEN situations
What is the Reduction algorithm?
One
proven
solution in some area,converting it into another
, is the key for ML.a CRUCIAL fact in learning application?
existed learning tech, reduce problems to
simpler classification
.Active research
solves it.what information provided for RL?
chosen
action, not what - might - have - been.supervised learning’s problem?
any mistakes
modify the future observation.the same and different compared with imitation learning?
same: both
interactive
,sequential
predictiondifferent: complex
reward
structures ( with only bandit style feedback on the actions actually chosen) . leadingtheoretically
andcomputationally hard
.same between RL and classical optimal control?
- to find an
optimal policy
also called the controller or control policy. - rely on a notion that is a
model
describestransitions between states
.
- to find an
different between RL and classical optimal control?
classical:
break down
because of model and computational approximations.RL: operate on
measured data
and rewards frominteraction
with theenviornment
.it uses
approximations
anddata-driven
tech.
RL in Robotics
- 2 interesting points in robot learning sys?
- often uses policy search
- many based on model based. Not for every adapted.
- 2 interesting points in robot learning sys?
Techs
brief summary
basic work process?
maximize the accumulated reward, the task is
restarted
after each end of an episode.If it doesNOT
haveclear beginning
, optimizing thewhole life-time
&discounted return
.reward R based on ?
a function of the
state
andobservation
.what is the
GOAL
of RL?to find a
mapping
from states to actions( policy π ).2 form of policy π? detail?
deterministic or probabilistic. D mode uses exact
same
action for a given state; P modedraws a sample from a distribution
.Why RL needs the
Exploration
? where it can be found?because the relations between states, actions, and rewards.
- directly embedded in the
policy
- separately and only as part of the
learning process
.
- directly embedded in the
Basic method in RL?
Markov Desicsion Process ( MDP ).
S: State;
A: Action;
R: Reward;
T: Transition probabilities or densities in the continuous state case
越迁概率
转移概率是马尔可夫链中的重要概念,若马氏链分为m个状态组成,历史资料转化为由这m个状态所组成的序列。从任意一个状态出发,经过任意一次转移,必然出现状态1、2、……,m中的一个,这种状态之间的转移称为转移概率。
the Meaning of
T (s′, a, s) = P (s′|s, a)
?next state s
** and the reward
only depend`on the previous state **s and action a.two different types for using rewards function’s depending?
- only on the current state R=R(s, a)
- on the transitions R = R( s`,a, s).
Goals of RL
to discover an optimal policy π∗ that maps states (or observations) to actions so as to maximize the expected return J.
finite-horizon model
maximize the expected reward .
a discount factor
γ
- affect how much the future is taken into account
- tuned manually
γ
could lead some problemscharacteristic:
myopic
greedy
poor performance, unstable optimal control law( low discount factor ).
conclusion:
inadmissible
average reward criterion
γ replaced by
1/H
problems:
- cannot distinguish whether
initially
could get high or low rewards
- cannot distinguish whether
bias optimal Lewis and Puterman
- definition
- optimal prefix & optimal long-term behavior.
- prefix = transient phase
- definition
a characteristic in RL:
same long-term reward but differ in transient in different policy
important shortcoming & most relevant model
important shortcoming:
— discounted formulation ( not average reward )
- Reason: stable behavior not good transient
an episodic control task, which runs H time-steps; reset; started over.
H, guaranteed to converge for the expected reward.
finite-horizon model are most
frequent
so, finite-horizon models are often the most relevant.
basic goals: (2)
- optimal strategy
- maximize the reward
“ exploration - exploration trade off “ :
- whether to play it
safe
and stick towell known actions
withhigh rewards
- or to dare trying
new
things in order to discovernew strategies
with evenhigh reward
.
- whether to play it
“ curse of dimensionality”:
continuous, scale exponentially accelerate for state variables.
off - policy & on - policy:
off - policy:
different from the desired final policy
on - policy:
information about the environment
great implication caused by probability distributions
stochastic policies —> optimal stationary policies for selected problems ( may break curse of dimensionality )
RL in the Average Reward Setting( ARS )
2 situations leads ARS to more suitable
- not to choose a discount factor
- not to have to explicitly time
Policy 𝜋’s feature
stationary & memory less
RL’s aim
maximize 𝜋 & 𝜃
policy search & value function - based approach’s definition
policy search:
optimizing in the **
primal
**formulationsearching in the
dual formulation
Value Function Approaches
Karush - Kuhn - Tucker conditions:
- It means there are as many equations as the number of states multiplied by the number of actions.
Bellman Principle of Optimality :
Definition:
forget initial state and policy, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.
Method:
perform an optimal *action a,** follow the optimal policy 𝜋* to achieve global optimum.
Conclusion:
optimal value function V* corresponds to the long-term additional reward, gained by starting in states while taking *optimal actions a.**
traditional RL approaches:
identifying solutions to this equation ( value funtion ).
steps:
approximate
Lagrangianmultiplier
V* ( value function )reconstruct
the optimal policyALERT:
action a* decided by policy 𝜋.
Q𝜋 can be instead of Vπ, it differ because obviously
show
the effect ofparticular action.
conclusion:
to choose
a*
could reconstruct anoptimal, deterministic policy
π* to achievehighest V*
V, T ( known, discrete ) — > optimal policy ( as exhaustive search )
!!! : continous space — > policy, function were shown as a chart ( states, action are discrete ). if space too big, to reduce its dimension ( if possible ).
- using Q to avoid using Transition function.
Dynamic Programming-Based Methods ( model - based )
- T, R —> value function
- NOT predetermined, learned from data, potentially incrementally
TYPICAL METHODS:
Policy iteration
two phases:
policy evaluation:
determines the value function for the current policy
policy improvement:
greedily selects the best action in every state
Value iteration
- combines policy evaluation and policy improvement
Monte Carlo Methods
Temporal Difference Methods
Challenges
- Curse of Dimensionality
- Curse of Real-World Samples
- Curse of Under-Modeling and Model Uncertainty
- Curse of Goal Specification
Approaches
Tractability Through Representation
- Smart State-Action Discretization
- Value Function Approximation
- Pre-structured Policies
Tractability Through Prior Knowledge
- Prior Knowledge Through Demonstration
- Prior Knowledge Through Task Structuring
Tractability Through Models
- Core Issues and General Techniques in Mental Rehearsal
- Successful Learning Approaches with Forward Model