avec le monde
comme le monde
nous somme tous deux tombés
"la chaise est toujours le chêne", elle murmure.
"comme la lueur et l'ombre sont la méme"
et où devrions-nous nous situer pour voir?
ces feux d'artifice
cette fontaine d'etre
My brother is a professor of political science, and during a recent holiday we got to talking about how the word "truthiness" has burst onto the scene in recent years, and how it has gummed up the works of tried and true practices of fact-checking. Perhaps what has troubled me most about hearing this word bandied about is that it seems to ignore the rich history of statistics - the hundreds of years of discovery that our non-partisan ancestors developed to help us appraise the likelihood of events.
As someone who has spent years in research labs, I know first hand that the field of statistics (like its big sister mathematics) can seem like a vast ocean one couldn't even dream of circumnavigating in a lifetime. Anyone who claims to know everything there is to know about statistics is lying to you. So proposing that we apply tools from statistics to the problem of determining the "truthiness" of a statement (and can't we just say veracity?) may seem like we are moving the conversation out of the realm of public debate and onto a chalkboard in the basement at MIT. And yet, I don't believe we have to venture too far into statistics to surface some useful frameworks for this conversation.
Typically when we want to verify a statement, we test how likely the statement is. Certainly this could be framed as, "how likely is it that this person would say this?", but what's most relevant is, "how likely is this statement true given the evidence at hand?". We are looking for the point, or a range, on the distribution of probabilities of the statement's veracity. If the evidence stacks up that the statement is not very true, we'll get a number close to 0. If the evidence stacks up showing the statement is very likely to be true, our number will be closer to 100%.
And there you have it. It seems our work is done. It's true we could back to that MIT basement and hash out approaches for defining a bespoke probability distribution for the problem at hand, but for most political fact-checking we can use one of the go-to distributions out of the box (Gaussian, binomial, geometric, Poisson), and arrive at a suitable result. Furthermore, I think you would find that creating a chart that tells you how probable an event is is something you would arrive at if left alone to ponder a problem you believe is important. The mind bends towards making meaning, and that includes charting patterns of events.
So if it's not statistical sophistication that we're stuck on, where's the challenge? How do we have conversations about truth when the very concept of truth (as verified by evidence) is called into question? And what do us scientists and mathematicians do in the meantime?
She made a sound with her mouth
like a horse
or an engine throttling
I heard her through the partition serving as a wall
through the nondescript office
past the window
out into the winter air where she stood with her mother, gearing up for a race.
I lay there holding impossibly still
while coaching my throat, anxiously holding a cough in its tubular arms like a bomb
"Maybe warm it up into a new composition?," I offered. "Something you can absorb?"
Hearing it spoken aloud in my inner language inspired me to give it a try myself.
As the little girl peeled a triumphant "VRRrroooomm",
the mother laughed,
and I dissolved into mist.
Swarm Reinforcement Learning Algorithms Based on a Particle Swarm Optimization. By Hitoshi Iima and Yasuaki Kuroe. IEEE 2008.
What they did
- Q-learning for individual learning, and then exchange information with other agents, taking on the best Q-learning results (need to be evaluated in some way). They developed three updating methods.
Autonomous Agent Response Learning by a Multi-Species Particle Swarm Optimization. By Chi-kin Chow and Hung-tat Tsui. IEEE 2004.
- Autonomous agents adapt their response behavior, which can be represented as a vector function of observations from the environment (p = R(o), where o is the observation vector) to adapt to their environment.
- Continuous representation of response functions are more relevant for real-world dynamics than weight tuning with Reinforcement Learning and Hidden Markov Models.
- Agents extract their response from a tuned award function, A(o, r), and that extraction can be framed as a multi-objective optimization problem (that's where MPSOs come in!).
What they did
- Defined their response as a Gaussian Mixture Model
- Generate a set of (O-R) samples that generate an award value greater or equal than that defined in training.
- A Local Award Function (LAF) is defined as A_o(r) = A(o, r). This is a decomposed award function based on the observations of the O-R samples. "By optimizing the LAF set, the response of O-R samples can be determined".
- "... the response learning algorithm can be formulated as a multi-objective optimization problem in which the optima are correlated".
- With the responses from the optimized LAF set, a response network is generated by training the samples with a support vector machine.
Adam: A Method for Stochastic Optimization (2015). Diederik P. Kingma and Jimmy Lei Ba. Conference paper at ICLR 2015.
- This method computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients.
- Combines AdaGrad (which works well with sparse gradients) and RMSProp (works well in on-line and non-stationary settings.
Portfolio Allocation for Bayesian Optimization (2011). Mathew Hoffman, Eric Brochu, and Nando de Freitas.
Most of the literature seems to suggest UCB (or LCB) as the optimal acquisition function in many cases, yet this research proposes the use of a portfolio of acquisition functions governed by a multi-armed bandit strategy, as opposed to only using a single acquisition function.
- The authors suggest that there may be no single acquisition function that will perform best over an entire optimization.
- "This can be treated as a hierarchical multi-armed bandit problem, in which each of the N arms is itself an infinite-armed bandit problem".
- Three strategies are suggested, but Hedge is recommended. "Hedge is an algorithm which at each time step t selects an action i with probability p_t(i) based on the cumulative rewards (gain) for that action." A gain vector is then updated from these rewards.
- You can't necessarily compare convergence rates of the portfolio method directly to the single acquisition functions, since "decisions made at iteration t affect the state of the problem and the resulting rewards at all future iterations". The authors suggest an approach (Theorem 1) for setting bounds on the cumulative regret. These bounds are generated in relation to points proposed by UCB, and the authors suggest possible refinements to this theorem to take into account bounds of other acquisition functions in the portfolio.
- Are the improvements cited enough for us to notice a difference in applied cases?
Taking the Human Out of the Loop: A Review of Bayesian Optimization (2016). Shahriari et al. Proceedings of the IEEE
- "Mathematically we are considering the problem of finding a global maximizer (or minimizer) of an unknown objective function f, where X is some design space of interest; ..."
- "...in global optimization, X is often a compact subset of R^d but the Bayesian optimization framework can be applied to more unusual search spaces that involve categorical or conditional inputs."
- "The Bayesian posterior represents our updates beliefs - given data - on the likely objective function we are optimizing. Equipped with this probabilistic model, we can sequentially induce acquisition functions that leverage the uncertainty in the posterior to guide exploration."
- "Intuitively, the acquisition function evaluates the utility of candidate points for the next evaluation of f; therefore x_n+1 is selected by maximizing \alpha_n"
- "The kernel trick allows us to specify an intuitive similarity between pairs of points, rather than a feature map, which in practice can be hard to define."
- Common kernels (Matérn)
- "The marginal likelihood is very useful in learning the hyperparameters. As long as the kernel is differentiable with respect to its hyperparameters, the marginal likelihood can be differentiated and can therefore be optimized."
Introduction to Gaussian Processes (1998) - David Mackay
- "From a Bayesian perspective, a choice of a neural network model can be viewed as defining a prior probability distribution over non-linear functions, and the neural network's learning process can be interpreted in terms of the posterior probability distribution over the unknown function. (Some learning algorithms search for the function with maximum posterior probability, and other Monte Carlo methods draw samples from this posterior probability)."
- "The idea of Gaussian process modeling is, without parameterizing y(x), to place a prior P(y(x)) directly on the space of functions. The simplest type of prior over functions is called a Gaussian process. It can be thought of as the generalization of the Gaussian distribution over a finite vector space to a function of infinite dimension."
- "Just as a Gaussian distribution is fully specified by its mean and covariance matrix, a Gaussian process is specified by a mean [often taken to be the zero function of x] and a covariance function [which expresses the expected covariance between the value of the function y at the points x and x']."
- "The actual function y(x) in any one data modeling problem is assumed to be a single sample from this Gaussian distribution."
- "...by concentrating on the joint probability distribution of the observed data and the quantities we wish to predict, it is possible to make predictions with resources that scale as polynomial functions of N, the number of data points."
- In nonparametric methods, predictions are obtained without giving the unknown function y(x) an explicit parameterization.
- An example of a nonparametric approach to regression is the spline smoothing method. In this case, the spline priors are Gaussian processes.
Multilayer Neural Networks and Gaussian Processes
- Neal showed that the properties of a neural network with one hidden layer converges to those of a Gaussian process as the number of hidden neurons tends to infinity if the standard 'weight decay' priors are assumed.
- The covariance function of this Gaussian process depends on the details of the priors assumed for the weights in the network and the activation functions of the hidden units.
- DIRECT: "The most obvious implementation fo these equations is to evaluate the inverse of the covariance matrix exactly. This can be done using a variety of methods such as Cholesky decomposition, LU decomposition or Gaussian-Jordan. Having obtained the explicit inverse, we then apply it directly to the appropriate vectors."
- APPROXIMATION METHODS
I have begun studying differential geometry, and have been looking for points of departure in everyday life to motivate my studies. Tonight while stirring a pot of mushroom soup, I noticed something interesting about the light's reflection in the beads of fat on the surface as I moved my head from side to side. I found that if I focussed on the surrounding surface, and allowed my eyes to take in the beads less directly, I could perceive each disk as a sphere resting on the surface.
After I finished eating, I served up another bowl of soup to observe this phenomena more closely. I found that if I imagined the lamps reflection were actually a floating light "projected" by each of the spheres appearing on the soup's surface, my eye could stick with the trick more faithfully. It seemed assigning the imagined spheres this agency helped my mind hold onto the mirage.
I can imagine the optics explanation of why my mind might be able to switch from seeing light reflected off a disk as thought it were reflecting off spheres would rest somewhere in the special relationship circles and spheres share, with no visible edges to break the illusion of an aerial view observed as a askance perspective. However I can't help wonder how these two different scenarios relate - the light reflecting off individual beads of fat on the soup's surface, and the imagined equivalently plausible scenario of suspended spheres (with our without the special light-producing agency).
Here's the soup:
How might we define the relationship from the light to the beads and compare this to the light to the spheres? What traditions in geometry or topology might helps us do that?
On a side note, I observed this evening that the reverse optic trick could be applied to the sun's reflection off the the surface of the moon. A quick web search for a geometric rendering of that led me to I discover that there are people who strongly believe we are circumambulating a flat earth. Wowzers!