BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//talks.cam.ac.uk//v3//EN
BEGIN:VTIMEZONE
TZID:Europe/London
BEGIN:DAYLIGHT
TZOFFSETFROM:+0000
TZOFFSETTO:+0100
TZNAME:BST
DTSTART:19700329T010000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=-1SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:+0100
TZOFFSETTO:+0000
TZNAME:GMT
DTSTART:19701025T020000
RRULE:FREQ=YEARLY;BYMONTH=10;BYDAY=-1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
CATEGORIES:Signal Processing and Communications Lab Seminars
SUMMARY:Simple Reinforcement Learning Algorithms for Conti
nuous State and Action Space Systems - Prof. Rahul
Jain\, University of Southern California
DTSTART;TZID=Europe/London:20190617T120000
DTEND;TZID=Europe/London:20190617T130000
UID:TALK126373AThttp://talks.cam.ac.uk
URL:http://talks.cam.ac.uk/talk/index/126373
DESCRIPTION:Reinforcement Learning (RL) problems for continuou
s state and action space systems are quite challen
ging. Recently\, deep reinforcement learning metho
ds have been shown to be quite effective for certa
in RL problems in settings of very large/continuou
s state and action spaces. But such methods requir
e extensive hyper-parameter tuning\, huge amount o
f data\, and come with no performance guarantees.
We note that such methods are mostly trained `offl
ine’ on experience replay buffers. \n\nIn this tal
k\, I will describe a series of simple reinforceme
nt learning schemes for various settings. Our prem
ise is that we have access to a generative model t
hat can give us simulated samples of the next stat
e. We will start with finite state and action spac
e MDPs. An `empirical value learning’ (EVL) algori
thm can be derived quite simply by replacing the e
xpectation in the Bellman operator with an empiric
al estimate. We note that the EVL algorithm has r
emarkably good numerical performance for practical
purposes. We next extend this to continuous state
spaces by considering randomized function approxi
mation on a reproducible kernel Hilbert space (RKH
S). This allows for arbitrarily good approximation
with high probability for any problem due to its
universal function approximation property. Next\,
we consider continuous action spaces. In each iter
ation of EVL\, we sample actions from the continuo
us action space\, and take a supremum over the sam
pled actions. Under mild assumptions on the MDP\,
we show that this performs quite well numerically\
, with provable performance guarantees. Finally\,
we consider the `Online-EVL’ algorithm that learns
from a trajectory of state-action-reward sequence
. Under mild mixing conditions on the trajectory\,
we can provide performance bounds and also show t
hat it has competitive (and in fact marginally bet
ter) performance as compared to the Deep Q-Network
algorithm on a benchmark RL problem. I will concl
ude by a brief overview of the framework of probab
ilistic contraction analysis of iterated random op
erators that underpins the theoretical analysis. \
n\nThis talk is based on work with a number of peo
ple including Vivek Borkar (IIT Bombay)\, Peter G
lynn (Stanford)\, Abhishek Gupta (Ohio State)\, Wi
lliam Haskell (Purdue)\, Dileep Kalathil (Texas A&
M)\, and Hiteshi Sharma (USC).
LOCATION:LR3A\, Inglis Building\, CUED
CONTACT:Dr Ramji Venkataramanan
END:VEVENT
END:VCALENDAR