[0-10] Vlad Mnih frames the problem nicely about the moving target we get when when we use neural networks to generate our Q-function. Essentially, the generalization of states in neural networks makes the targets unstable. He proposes a clever solution, which is to create an experience replay buffer, which will, in a sense, create a steady data set that can be sampled from. This will break correlations we would have got in our online gradient updates, delivering us a "steadier learning signal". In general, in approaching their 2015 work on the Atari control problem they tried to see how they could frame the problem as a supervised learning problem.
[10-20] After a series of questions that I could not hear the questions to, Vlad switched gears to extol the virtues of target networks, sharing insight into how they care important or stability. This is particularly important when we might project our increased Q-value from a previous state onto an unsuspecting state that looks similar, yet over time we may be over-increasing.
[20-30] A walk through the DQN algorithm, which I've pasted below, and some really memorable words about optimization. "Optimization really matters in RL, because how you update your neural network determines which actions you will take, which determines which data you will see. Optimization algorithms affect the dynamics of your agent." To this end, he gave a shout out to RMSProp and Adam, which he and his colleagues have found to be preferable to SGD in many cases.
[30-40] Introduces the 49 Atari games they trained on, where they were mapping pixels to Q-values (leading to actions). Convolutional neural networks provided the mapping. In answering a question, here shared this is not a MDP, because we would have to define all the states, and that is not what's happening. I'm including a frame of the architecture below. Scores were best with experience replay and target network stabilization.
[40-50] Visual exploration of their simulations, with DQN working best on the reactive games. Particular attention placed to the ability to sacrifice immediate rewards for long-term gain.
[50-60] DQN is an adaptation of neural fitted Q iteration (NFQ), where we just train one CNN (as they are expensive to train) and we simulate the fixed network with the experience replay (that we sample from) and the target network. Vlad also shares the improved algorithms that have come out since their paper, including Double DQN, which balances performance across two networks (online and target network), as well as Prioritized Experience Replay, which replays transitions in proportion to the absolute Bellman error -- essentially prioritizing "states in which we're currently more wrong". And then finally, dueling DQN. This requires changing the architecture of the CNN to track advantage (q values minus state values), which makes it easier to "separate the values of separate actions".
Vlad also mentioned that you can increase exploration by adding noise to the parameters, which makes intuitive sense.