Deep RL Bootcamp Lecture 1: Motivation + Overview + Exact Solution Methods

[0 - 10] Starts off with a review of MDPs from Barto, and then sets up a simple grid world with an example of a deterministic policy with infinite horizon.

[10-20] OMG, talk gets a bit derailed by series of rando questions that don't seem to be clarifying thinking (answerable by the definition of a deterministic policy). Maybe too much coffee?

[20-30] Nice introduction to value iteration by showing how, in a stochastic model, it is safer to stay put if it is unlikely movement will yield a successful return. Also, there was a nice set up for the value iteration converging, as additional time will not help an optimal path towards an exit condition once it's been reached.

[30-40] This helped me solidify why a low gamma would lead us to prioritize immediate rewards, while a high gamma leads us to hold out for larger rewards in future steps. He mentioned how Q* encodes implicitly the best policy, so it is combo of the V* and \pi* from value iteration (because Q is defined as "and thereafter act optimally", which will require the best policy).

[40-50] Nice introduction to policy evaluation, where we see that \pi(a) gives us the next action as defined by the policy, so we're now working in a linear system. We can add an argmax(a) to the evaluation to get an improvement, as this will be the best a we could select where all future values will be generated optimally. And finally, we can solve iteratively or solve the system of linear equations (woot woot linear algebra). Oh, and a nice example of the geometric series (the gamma coeficient).

[50-60] A proof sketch for why optimal convergence is reached with policy iteration.