Deep RL Bootcamp: 3 by Siobhán Cronin

Deep RL Bootcamp Lecture 3: Deep Q-Networks

[0-10] Vlad Mnih frames the problem nicely about the moving target we get when when we use neural networks to generate our Q-function. Essentially, the generalization of states in neural networks makes the targets unstable. He proposes a clever solution, which is to create an experience replay buffer, which will, in a sense, create a steady data set that can be sampled from. This will break correlations we would have got in our online gradient updates, delivering us a "steadier learning signal". In general, in approaching their 2015 work on the Atari control problem they tried to see how they could frame the problem as a supervised learning problem. 

[10-20] After a series of questions that I could not hear the questions to, Vlad switched gears to extol the virtues of target networks, sharing insight into how they care important or stability. This is particularly important when we might project our increased Q-value from a previous state onto an unsuspecting state that looks similar, yet over time we may be over-increasing. 

[20-30] A walk through the DQN algorithm, which I've pasted below, and some really memorable words about optimization. "Optimization really matters in RL, because how you update your neural network determines which actions you will take, which determines which data you will see. Optimization algorithms affect the dynamics of your agent." To this end, he gave a shout out to RMSProp and Adam, which he and his colleagues have found to be preferable to SGD in many cases. 


[30-40] Introduces the 49 Atari games they trained on, where they were mapping pixels to Q-values (leading to actions). Convolutional neural networks provided the mapping. In answering a question, here shared this is not a MDP, because we would have to define all the states, and that is not what's happening. I'm including a frame of the architecture below. Scores were best with experience replay and target network stabilization.


[40-50] Visual exploration of their simulations, with DQN working best on the reactive games. Particular attention placed to the ability to sacrifice immediate rewards for long-term gain. 

[50-60] DQN is an adaptation of neural fitted Q iteration (NFQ), where we just train one CNN (as they are expensive to train) and we simulate the fixed network with the experience replay (that we sample from) and the target network. Vlad also shares the improved algorithms that have come out since their paper, including Double DQN, which balances performance across two networks (online and target network), as well as Prioritized Experience Replay, which replays transitions in proportion to the absolute Bellman error -- essentially prioritizing "states in which we're currently more wrong". And then finally, dueling DQN. This requires changing the architecture of the CNN to track advantage (q values minus state values), which makes it easier to "separate the values of separate actions". 

Vlad also mentioned that you can increase exploration by adding noise to the parameters, which makes intuitive sense. 

[REVIEW] Grid Long Short-Term Memory by Siobhán Cronin


Grid Long Short-Term Memory by Nal Kalchbrenner, Ivo Danihelka, & Alex Graves. ICLR 2016. 

I learned about this paper from this blog post by Christopher Boulez, which I found when trying to better understand how an LSTM might be used to find the parity of a string (generalized XOR).

The paper gives an excellent overview of how LSTM cells can be arranged in a multidimensional grid that can be applied to vector spaces. I'll return to that in a later post, but I wanted to focus on the appendix, which dives into the parity problem. The paper reports that a 1-LSTM network can "learn to compute parity for up to 250 input bits". To achieve this, "the k-bit string is given to the neural network as a whole through a single projection; considering one bit at a time and remembering the previous partial result in a recurrent or multi-step architecture". This shifts the problem from learning k-bit parity to simply learning 2-bit parity (phew!). 

A list of hyperparameters are given, which I'll use to implement it, so stay tuned!

Deep RL Bootcamp: 2 by Siobhán Cronin

Deep RL Bootcamp Lecture 2: Sampling-based Approximations and Function Fitting

[0-10] So...with Q-learning we have made two big assumptions: 1) the underlying dynamics that give us P(s' | s, a), and 2) the states and actions can't be too large because we need to store the probabilities to be able to look them up. We may not always be able to meet these assumptions, so enter sampling-based approximation (starting with tabular Q-learning), where we sample states and keep a running average as our Q-value for a given (s, a).

[10-20] Diving into model-free methods. I LOVED that he used the term "temperature" to describe the shift in weighting that we see in something like a simulated annealing algorithm. I'm starting to see that a lot in RL (especially with the discount function), and I like this somatic layer of temperature. In this case, we are seeking to balance explore-exploit with an \epsilon-greedy algorithm.

[20-30] A brief introduction to on-policy learning (where sampling is done based on a policy, and those samples are used to update that policy) vs. off-policy learning where we sample based on one policy, but are ultimately updating (learning about) another policy. The requirements for q-learning are also given, which are: all states and actions are visited infinitely often, learning rate sum equals infinity as we approach infinity and the sum of squares of the learning rate is defined as we approach infinity. Exploration is a core RL challenge, and there may be situations where even epsilon-greedy doesn't converge in a reasonable amount of time. 

[30-40] As long as there is an optimal path pursued towards a rewarded state, we won't need to propagate negative rewards back in time (this was shown in a gridworld example). At some point we converge to the Q-values of each state, so additional iterations won't change their values. 

[40-50] Briefly touched upon Temporal Difference Learning as a way to take samples and not just expected values from the policy evaluation. But the main purpose of this section was drive the story towards deep RL, where we will use neural networks to to approximate Q values without needing to hand-engineer features. To step in this direction, was saw how tabular Q-learning is a special case of approximate Q-learning. 

Deep RL Bootcamp: 1 by Siobhán Cronin

Deep RL Bootcamp Lecture 1: Motivation + Overview + Exact Solution Methods

[0 - 10] Starts off with a review of MDPs from Barto, and then sets up a simple grid world with an example of a deterministic policy with infinite horizon. 

[10-20] OMG, talk gets a bit derailed by series of rando questions that don't seem to be clarifying thinking (answerable by the definition of a deterministic policy). Maybe too much coffee? 

[20-30] Nice introduction to value iteration by showing how, in a stochastic model, it is safer to stay put if it is unlikely movement will yield a successful return. Also, there was a nice set up for the value iteration converging, as additional time will not help an optimal path towards an exit condition once it's been reached. 

[30-40] This helped me solidify why a low gamma would lead us to prioritize immediate rewards, while a high gamma leads us to hold out for larger rewards in future steps. He mentioned how Q* encodes implicitly the best policy, so it is combo of the V* and \pi* from value iteration (because Q is defined as "and thereafter act optimally", which will require the best policy). 

[40-50] Nice introduction to policy evaluation, where we see that \pi(a) gives us the next action as defined by the policy, so we're now working in a linear system. We can add an argmax(a) to the evaluation to get an improvement, as this will be the best a we could select where all future values will be generated optimally. And finally, we can solve iteratively or solve the system of linear equations (woot woot linear algebra). Oh, and a nice example of the geometric series (the gamma coeficient).

[50-60] A proof sketch for why optimal convergence is reached with policy iteration. 


Embodying MDPs by Siobhán Cronin

When I had my cognitive revival in 2016, I thought I was closing the door on the period of my life where my ideation would be rooted in somatic understanding of abstract concepts. While leading me to depths of empathy and artistic practice, the primacy of my embodied mode of thinking had prevented me from accessing other forms of abstraction and meaning-making. And then, all of a sudden, I had access to number, and could relate to the immaterial concepts of set theory, graph theory, network theory, computer science, calculus, and linear algebra. And yet...

My mind processes abstract concepts in concert with the nerves that run through my muscles, this rich estuary of embodied experience. Is my mind looking for data from my lived history to validate the abstract concept? Lately I have noticed this most when falling asleep, my mind flashing with snippets of the theorems and concepts I'm studying, and my body twitching in a wild fireworks display. On evenings of days when I've studied many new ideas, the twitching is so active my partner eventually gives up and moves to the couch. 

The twitching continues throughout the night, as my dreams unfold possibilities and dead ends. Sometimes the dead ends are benign, and sometimes they jolt me awake, such as last night when a figure in my dream leaned in and whispered 'Java is a prime number'. To a mind still forming around what is possible within computer science and math, such a concept equates to a nightmare.

But last night was dominated by a single thread of inquiry, as I begin to understand the RL and personal implications of Hindsight Experience Replay (HER), and that is how does my mind overlay reward functions and policies in the continuous time-space of my life? How does my body understand Markov Decision Processes (MDPs)? When have I seen individual layers, or layer clusters, exposed in meditation? When did policies that served my early development die off?

Perhaps the biggest gift of all I was given in my exile and in my long apprenticeship in the somatic research tribe is the ability to ask questions as research prompts, not as urgent requests for answers. And, so I will continue to ask. 

REVIEW: Reinforcement Learning by Siobhán Cronin

Reinforcement Learning: An Introduction (1998). Richard Sutton & Andrew Barto. 

At the onset, I'm curious to know how much has changed since this book's publication 20 years ago. That being said, as these are two leaders in the field, I'm interested in gaining a sense of their perspective on the history/origin of this subfield, and acclimating to some of the core concepts/constructs. 

Chapter 1

One of the key takeaways from this chapter was the distinction between the value function and rewards function in an RL problem. The value function is ultimately what we are optimizing for, as this is the total cumulative reward, or rather the reward associated with solving the problem at the highest level. The reward function, but contrast, evaluates awards state by state, and can be thought of as the short-term payoff for entering any given state. I like this distinction, because it opens up the idea of scoping the problem, and ensuring you are indeed building a system that achieves what you set out to achieve. That sounds straightforward enough, but given the nuances of building a policy to reinforce even the simplest of goals, the balance of these functions will undoubtedly become a focus of my work. These are the functions we are learning as we build our RL policy. 

There is a treatment of the agent and the environment, as well as a policy and model of the environment. The agent learns while interacting with the environment, which the authors point out is different from evolutionary methods. I'm curious about this distinction, as it is pointed out that genetic algorithms (and perhaps swarm algorithms?) can be used to solve RL problems. I'm putting a map together of how these various approaches relate. 

The distinction between greedy and exploratory moves, and this will come up again in our examination of explore/exploit tradeoffs. This touches upon the balance of reward vs. value function optimization, and also the broader leverage of stochastic process in search problems. The term temporal-difference learning is used, which doesn't seem like much at first blush, but relates in my mind to calculus (derivative, gradient, etc.) - measuring a rate of change between two times (states in these problems).

Chapter 2

n-armed bandit problems are introduced, where the choice to exploit prior knowledge of known advantageous probabilities is pitted against the choice of probabilities that could be advantageous, yet whose probabilities are unknown or more variable. This uncertainty is important. One of the choices with greater variability could in fact be better than the greedy choice in the long-term. How would we make this choice? I'm reminded of simulated annealing. Perhaps policies that take on my risk early on, and then play more conservatively later on?

Incremental implementation involves updating our estimates we go, we we always have an up-to-now vector to reference. 

  • Nonstationary problem tracking can be addressed with exponential recency-weighted average. The formula presented takes an alpha constant which decays exponentially with (1 - alpha)^k for k time steps. 
  • Optimistic initial values were introduced as method for encouraging exploration (the agent is continually disappointed with evaluations at each time step, as they are never as good as what was initially observed). 
  • Reinforcement comparison is raised in the context of an agent not knowing the relative size of a reward (is the reward received average? above average? below?). In this model, preference probabilities are increased by an increment calculated from the difference of the observed reward and the reference reward (often the average of previously received rewards). 
  • Pursuit methods update preference probabilities of each action so as to increase the likelihood of the greedy option (and decrease the other actions). 

A brief introduction to associative search, in which the n-armed bandit problem is presented in a context where the situation changes play to play, and the agent can learn to associate policies with different situations. 

Interval estimation methods are also introduced, which estimate a confidence interval for the value of each action. The action selected will be the one with the highest upper limit. Bayes optimization is also introduced, with updated posterior probability distributions of action values resulting from any action. I'm curious to see what has grown out of this direction, as posterior probability estimation has seen many recent advances. 


I keep finding myself circling back to this idea of where we drawing the line between agent and environment. Rather than settling on a specific boundary, can we maintain multiple boundaries at once and allow for these to interact (like we do in multilayer PSO, where each region is the parent region for a system of subregions?). 

The crux of this chapter can be summed up by the Bellman optimality equation for optimal state-value function. When I read through it, I had the same question that this poster on Stack did. I found Sean Easter's answer helpful, and the take home intuition is that we are replacing the general case with what happens in the next time step and all steps after that. I hope to walk through this more slowly with some proofwriting in the coming weeks. I also found this article helpful. 

Beyond that, I found these statements in the summary to be helpful in ... summarizing!

  • A policy is a stochastic rule by which the agent selects actions as a function of states
  • The agent's objective is to maximize the amount of reward it receives over time
  • The discounted formulation is appropriate for continuous tasks (discount rates determine the present value of future rewards)
  • An environment satisfies the Markov property if its state signal compactly summarizes the past without degrading the ability to predict the future
  • The Bellman optimality equations are special consistency conditions that the optimal value functions must satisfy
  • RL adds to Markov Decision Processes a focus on approximation and incomplete information for realistically large problems

Chapter 4

This chapter focusses on Dynamic Programming, and finds its anchor in generalized policy iteration (GPI). With GPI with interleave policy evaluation with policy improvement, which is a common approach for solving RL problems. Policy evaluation involves making the value function consistent with the current policy. Policy improvement involves making the policy greedy with respect to the value function. And then we iterate. 

Another key topic raised in this chapter is policy iteration vs. value iteration. For policy iteration, we use the current policy to find a better policy using V_current_policy, and then calculate V on the new policy to yield a better policy (oh how I wish Squarespace rendered Latex. I know, I know, time to actually build out a better blog). One issue with this approach is that convergence only happens in the limit. 

Value iteration, on the other hand, stops policy iteration after one sweep through the space (all states backed up once). A full backup involves replacing the old policy evaluated at s with the old value of the successor states (their rewards) and the one-step transitions possible under the policy we are evaluating. All possible next steps are backed-up, not just a sample, which is what makes it a full backup. 

"Dynamic programming algorithms are obtained by turning Bellman equations ... into update rules for improving approximations of the desired value functions."

Chapter 5

This chapter dives into Monte Carlo methods for estimating value functions using only sample sequences of states, actions, and rewards. I found it useful to distinguish on-line learning from simulated learning, as this sets the stage for our building. On-line learning requires no prior knowledge of the environment, while simulated learning requires a model to generate sample transitions. Rather than use a model to compute the value of each state, they simply average many returns that start in that state. 

Monte Carlo methods are increment episode by episode (not step by step). Value estimates and policy changes are made at the end of the episode, and it is assume that all episodes terminate no matter what action was taken. The book covers methods that average over complete returns. 

I lost the scent of the book when we hit on-policy and off-policy Monte Carlo, and had to go find some other resources to fill in the gaps. Essentially, off-policy has us applying evaluations from one policy to learn about another, while on-policy has us updating the original policy. 


At long last, we've arrived at Q-learning (which is where most lectures on Youtube begin). Was it worth the wait? We shall see. 

Temporal difference learning (TD) is introduced as the combination of Monte Carlo and DP, so let's see how this fusion behaves. Differences are explored. Whereas Monte Carlo must wait to the end of the episode to determine the V value (as that is when reward is calculated), TD can make this update in the next time step, using the observed reward and V estimate at t+1. TD updates are sample backups, as they are referencing the value of a successive state to compute a backed-up value, and then using it to change the value of the original state. This differs from the full backups of DP, where we take into account the complete distribution for all possible successors. 


Priming the canvas by Karl Cronin

I am starting to learn more about deep reinforcement learning, and am interested in seeing how swarm intelligence might stitch up with RL policy algorithms. I began studying particle swarm optimization (PSO) last fall, and have been looking for applied contexts to test out the myriad variations I've encountered. My hunch is that tuning deep learning systems could be a fruitful area to explore, yet I'll need to poke around a bit and see.

To prime the canvas, I'm kicking things off with PSO Baselines, a repo where I'm sussing out the differences between various PSO variations by implementing some papers. My hope is that by implementing a range of PSOs, I'll start to get a sense of the various dials researchers are turning, and perhaps in time I'll get some hint of an idea as to where I might be able to contribute some fresh thinking. 

This morning I hit an interesting snag, which prompted me to stop and consider the architecture of what I'm writing. In the back of my mind, I've been hoping to implement a few multi-objective PSO papers to see if there might be some universal multi-objective class I could write and add to PySwarms. I mentioned to LJ (PySwarms' lead developer) that I would attempt to do this, and perhaps I still can. The snag is that it seems each PSO algorithm I pick up is different from other PSOs in ways that require different base structure, so it not as straightforward as I originally thought to build out a generic multi-objective class.

My strategy was to start with a functional approach, but there are cases where it seems it would be easier to have objects (for instance, in local best, where we want to compare each particle to all other particles to determine neighbors, and then update each particle - this challenges my initial list of list data structure in my functional approach). Since several of the algorithms build on local best, it seems I need to tackle this directly up front, rather than skirting around the issue by implementing global best variations.  

Alignment for Advanced ML Systems by Siobhán Cronin

Jessica Taylor, Eliezer Yudkowsky, Patrick LaVictoire, and Andrew Critch. Machine Intelligence Research Institute. 

Alignment in this context means making sure agents arrive at and optimize objective functions that are in the spirit of what was intended; that is that goals are reached while making sure no one gets hurt. One of the key takeaways from this overview is that our solutions must scale with intelligence, so for any new discovery, how long will it "hold" in lock step with advances in intelligence?

This report is divided into eight research topics:

  • Inductive ambiguity identification
  • Robust human imitation 
  • Informed oversight
  • Generalizable environmental goals 
  • Conservative concepts
  • Impact measures
  • Mild optimization
  • Averting instrumental incentives

I'm interested in robust human imitation, and how we can expand this research direction into non-human agents and systems. I question the "trusted human" as a safety benchmark, and was happy to see Evans et al cited, along with the assertion that "in reality humans are often irrational, ill-informed, incompetent, and immoral". On a planet with an uncountable number of species, shouldn't we have some other benchmark behavioral systems on the advisory counsel?

In a similar vein, I'm interested in generalizable environmental goals, once again, considering possibilities we may have overlooked due to our anthropic bias. This makes me think of sensory illusions and artifacts, and how these can change as sensors change. The paper suggests more elaborate sensor systems, and this is where I come back to what I'm dreaming up with ocotohand - thinking about the kinetic possibilities of eight spiraling appendages.

Also, so I don't forget, I liked the reference to Everitt & Hutter (2016) research on data being used to guide agent towards utility function, not be measured for success directly. 

MLSS Cadiz: John Schulman Talks by Siobhán Cronin

Link to slides and labs

I found these talks to be super straightforward and helpful. A breath of fresh air. 

Part 1 Highlights:

A brief overview of applications, including robotics, inventory management, resource allocation (queuing), and routing problems (sequential decision making problem). 

Differentiating between policy optimization and dynamic programming. In particular, policy optimization including DFO/Evolutionary algorithms (derivative-free) and Policy Gradients (using gradients, improves with more parameters. Dynamic programming requires discrete finite states, and so must be approximated (for instance, approximating function with neural nets). Actor-critic methods lie at the intersection, as they use value functions. 

Deep RL is concerned with control and decision making (compared to motor and frontal cortex - perhaps basal ganglia as well?). "At any time in the problem you're solving an optimization problem with GD", using non-linear function approximators that don't make assumptions about the form of the functions themselves (whether linear or not). 

Markov Decision Processes (MDP) are defined by (S, A, P), with state space (S), action space (A), and the transition probability distribution (P(r, s' | s, a), with reward (r) and next state (s') given by current state (s) and action (a)). One might also define the initial sate distribution (\mu) or a discount factor (\gamma) that indicates the degree to which you care less about stuff in the future than the present. 

One can define define different settings, including episodic, where you sample an initial state from \mu and proceed until a terminal state (defined in the state space) is reached. This termination could be good (ex. taxi reaches its destination), bad, (ex. robot falls over), or can be defined by a fixed length (episode ends after 1000 cycles). In this episode setting we aim to maximize reward per episode. 

We can have deterministic policies (functions) or stochastic policies (probability distribution of action given state), and these are what we are optimizing. We can also have parameterized policies where we are optimizing a parameter vector associated with the policy. 

The cross-entropy method is an example of an evolutionary algorithm, that at every time t is maintaining a distribution over parameter vectors. It only views the input parameters (\theta) and their reward, and views the rest of the MDP as a black box. These work, as he said, "embarrassingly" well (I think because we are embarrassed that simple works better than complicated?). Implemented here:

Part 2 Highlights:

We want to calculate the gradient, but we don't know f(x), so we'll gather a lot of samples of x and use them to estimate the gradient. Two estimation strategies are explored, score function gradient estimator and importance sampling. This is a valid approach even if f(x) is discontinuous, so it's very flexible. 



Proximal Policy Optimization (PPO) by Siobhán Cronin

As I continue spelunking into the PSO cave in hopes of finding an algorithm design strain of gold I can follow, I'm getting more and more excited about the RL research at OpenAI. In particular, I perked my ears up when I read that Proximal Policy Optimzation became the default RL algorithm at OpenAI last summer. Let's find out why.

First off, what is it PPO?

The stage is set by considering policy gradient methods (PG) in using deep neural nets in control problems.  Andrej Karpathy has a great write-up on how these have come about, and I'll write a review of that separately. The key point is that we stochastically sample actions, and the actions that happen to lead to good outcomes encourage future such actions, and likewise future such bad outcomes are discouraged. Like all stochastic algorithms, results depend on how we proceed through the space. 



Can we be neighbors? by Siobhán Cronin

When I interviewed for my recent gig with Uncountable, I was asked to talk about something I was working on. Given my now public swarm intelligence obsession, I talked about particle swarms.

I'm sure my eyes lit up as I talked about the stochastic initial positions, the velocity calculation, the difference between local and global best in the original algorithm, the position updating, and how convergence to optima compares to other stochastic search methods. But I was asked a question that stumped me that I resolved to dedicate some time to - in local best, when calculating neighbors, what would govern our choice of one distance measurement over another?

Here are four contenders for measuring the nearness of any two points in an n-dimensional real vector space with fixed cartesian coordinates:

As the name suggests, this is the square root of the sum of squares for each corresponding input pair of our points. 

The sum of the lengths of the projections of the line segment between the points onto the coordinate axes. Or, how many "city blocks" (i.e. coordinate units) lie between the two points. 

The minimum number of substitutions required to change one vector into another. Simply tally up how many input position pairs differ.

This generalizes the Euclidean and Manhattan distance metrics, allowing us to to set the sum of distance units (exponent of 1 on differences, with 1/1' exponent to norm the sum) or the triangular distance (exponent of 2 on differences, with 1/2 exponent to norm the sum). So you can toggle between the two to your heart's content. Would we ever want higher dimensionality?

This is all well and good, but what contexts would prompt us to use one distance measurement over another? In particular, given the context here is particle swarms, which should be used when implementing PSOs?

This work suggests we consider the dimensionality of our data, and make decisions accordingly. Is this being done in practice? There are so many parameters to tune with PSOs, is there a straightforward guide to making informed decisions for each? Hmm....I feel some sketchnoting coming on.