Civilization advances by extending the number of important operations we can perform without thinking about them.
—Alfred North Whitehead
Without realizing it, you accomplish a lot more, hour by hour, than you would without the help of machine learning.
Machine learning is something new under the sun: a technology that builds itself. Ever since our remote ancestors started sharpening stones into tools, humans have been designing artifacts, whether they’re hand built or mass produced. But learning algorithms are artifacts that design other artifacts. “Computers are useless,” said Picasso. “They can only give you answers.” Computers aren’t supposed to be creative; they’re supposed to do what you tell them to. If what you tell them to do is be creative, you get machine learning. A learning algorithm is like a master craftsman: every one of its productions is different and exquisitely tailored to the customer’s needs. But instead of turning stone into masonry or gold into jewelry, learners turn data into algorithms. And the more data they have, the more intricate the algorithms can be.
Homo sapiens is the species that adapts the world to itself instead of adapting itself to the world. Machine learning is the newest chapter in this million-year saga: with it, the world senses what you want and changes accordingly, without you having to lift a finger. Like a magic forest, your surroundings—virtual today, physical tomorrow—rearrange themselves as you move through them. The path you picked out between the trees and bushes grows into a road. Signs pointing the way spring up in the places where you got lost.
How do we learn? Is there a better way? What can we predict? Can we trust what we’ve learned? Rival schools of thought within machine learning have very different answers to these questions. The main ones are five in number, and we’ll devote a chapter to each. Symbolists view learning as the inverse of deduction and take ideas from philosophy, psychology, and logic. Connectionists reverse engineer the brain and are inspired by neuroscience and physics. Evolutionaries simulate evolution on the computer and draw on genetics and evolutionary biology. Bayesians believe learning is a form of probabilistic inference and have their roots in statistics. Analogizers learn by extrapolating from similarity judgments and are influenced by psychology and mathematical optimization.
Each of the five tribes of machine learning has its own master algorithm, a general-purpose learner that you can in principle use to discover knowledge from data in any domain. The symbolists’ master algorithm is inverse deduction, the connectionists’ is backpropagation, the evolutionaries’ is genetic programming, the Bayesians’ is Bayesian inference, and the analogizers’ is the support vector machine. In practice, however, each of these algorithms is good for some things but not others. What we really want is a single algorithm combining the key features of all of them: the ultimate master algorithm.
If it exists, the Master Algorithm can derive all knowledge in the world—past, present, and future—from data. Inventing it would be one of the greatest advances in the history of science. It would speed up the progress of knowledge across the board, and change the world in ways that we can barely begin to imagine. The Master Algorithm is to machine learning what the Standard Model is to particle physics or the Central Dogma to molecular biology: a unified theory that makes sense of everything we know to date, and lays the foundation for decades or centuries of future progress.
The Machine-Learning Revolution
Every algorithm has an input and an output: the data goes into the computer, the algorithm does what it will with it, and out comes the result. Machine learning turns this around: in goes the data and the desired result and out comes the algorithm that turns one into the other. Learning algorithms—also known as learners—are algorithms that make other algorithms. With machine learning, computers write their own programs, so we don’t have to.
The power of machine learning is perhaps best explained by a low-tech analogy: farming. In an industrial society, goods are made in factories, which means that engineers have to figure out exactly how to assemble them from their parts, how to make those parts, and so on—all the way to raw materials. It’s a lot of work. Computers are the most complex goods ever invented, and designing them, the factories that make them, and the programs that run on them is a ton of work. But there’s another, much older way in which we can get some of the things we need: by letting nature make them. In farming, we plant the seeds, make sure they have enough water and nutrients, and reap the grown crops. Why can’t technology be more like this? It can, and that’s the promise of machine learning. Learning algorithms are the seeds, data is the soil, and the learned programs are the grown plants. The machine-learning expert is like a farmer, sowing the seeds, irrigating and fertilizing the soil, and keeping an eye on the health of the crop but otherwise staying out of the way.
Once we look at machine learning this way, two things immediately jump out. The first is that the more data we have, the more we can learn. No data? Nothing to learn. Big data? Lots to learn. The second thing is that machine learning is a sword with which to slay the complexity monster. Given enough data, a learning program that’s only a few hundred lines long can easily generate a program with millions of lines, and it can do this again and again for different problems. The reduction in complexity for the programmer is phenomenal.
We can think of machine learning as the inverse of programming, in the same way that the square root is the inverse of the square, or integration is the inverse of differentiation. Just as we can ask “What number squared gives 16?” or “What is the function whose derivative is x + 1?” we can ask, “What is the algorithm that produces this output?”
In the information-processing ecosystem, learners are the superpredators. Databases, crawlers, indexers, and so on are the herbivores, patiently munging on endless fields of data. Statistical algorithms, online analytical processing, and so on are the predators. Herbivores are necessary, since without them the others couldn’t exist, but superpredators have a more exciting life. A crawler is like a cow, the web is its worldwide meadow, each page is a blade of grass. When the crawler is done munging, a copy of the web is sitting on its hard disks. An indexer then makes a list of the pages where each word appears, much like the index at the end of a book. Databases, like elephants, are big and heavy and never forget. Among these patient beasts dart statistical and analytical algorithms, compacting and selecting, turning data into information. Learners eat up this information, digest it, and turn it into knowledge.
The Master Algorithm
The central hypothesis of this book: All knowledge—past, present, and future—can be derived from data by a single, universal learning algorithm. I call this learner the Master Algorithm.
Which one is the better model for the Master Algorithm: evolution or the brain? This is machine learning’s version of the nature versus nurture debate. And, just as nature and nurture combine to produce us, perhaps the true Master Algorithm contains elements of both.
Machine learning is what you get when the unreasonable effectiveness of mathematics meets the unreasonable effectiveness of data. Biology and sociology will never be as simple as physics, but the method by which we discover their truths can be.
The Master Algorithm is for induction, the process of learning, what the Turing machine is for deduction. It can learn to simulate any other algorithm by reading examples of its input-output behavior. Just as there are many models of computation equivalent to a Turing machine, there are probably many different equivalent formulations of a universal learner. The point, however, is to find the first such formulation, just as Turing found the first formulation of the general-purpose computer.
Of course, the Master Algorithm has at least as many skeptics as it has proponents. Doubt is in order when something looks like a silver bullet. The most determined resistance comes from machine learning’s perennial foe: knowledge engineering. According to its proponents, knowledge can’t be learned automatically; it must be programmed into the computer by human experts. Sure, learners can extract some things from data, but nothing you’d confuse with real knowledge.
In industry, there’s no sign that knowledge engineering will ever be able to compete with machine learning outside of a few niche areas. Why pay experts to slowly and painfully encode knowledge into a form computers can understand, when you can extract it from data at a fraction of the cost? What about all the things the experts don’t know but you can discover from data?
A related, frequently heard objection is “Data can’t replace human intuition.” In fact, it’s the other way around: human intuition can’t replace data. Intuition is what you use when you don’t know the facts, and since you often don’t, intuition is precious. But when the evidence is before you, why would you deny it? Statistical analysis beats talent scouts in baseball (as Michael Lewis memorably documented in Moneyball), it beats connoisseurs at wine tasting, and every day we see new examples of what it can do. Because of the influx of data, the boundary between evidence and intuition is shifting rapidly, and as with any revolution, entrenched ways have to be overcome.
A model of you will negotiate the world on your behalf, playing elaborate games with other people’s and entities’ models.
There’s a sense in which the microprocessor is a good analog for the Master Algorithm. A microprocessor is not the best hardware for running any particular algorithm. That would be an ASIC (application-specific integrated circuit) designed very precisely for that algorithm. Yet microprocessors are what we use for almost all applications, because their flexibility trumps their relative inefficiency. If we had to build an ASIC for every new application, the Information Revolution would never have happened. Similarly, the Master Algorithm is not the best algorithm for learning any particular piece of knowledge; that would be an algorithm that already encodes most of that knowledge (or all of it, making the data superfluous). The point, however, is to induce the knowledge from data, because it’s easier and costs less; so the more general the learning algorithm, the better.
Our search for the Master Algorithm is complicated, but also enlivened, by the rival schools of thought that exist within machine learning. The main ones are the symbolists, connectionists, evolutionaries, Bayesians, and analogizers. Each tribe has a set of core beliefs, and a particular problem that it cares most about. It has found a solution to that problem, based on ideas from its allied fields of science, and it has a master algorithm that embodies it. For symbolists, all intelligence can be reduced to manipulating symbols, in the same way that a mathematician solves equations by replacing expressions by other expressions. Symbolists understand that you can’t learn from scratch: you need some initial knowledge to go with the data. They’ve figured out how to incorporate preexisting knowledge into learning, and how to combine different pieces of knowledge on the fly in order to solve new problems. Their master algorithm is inverse deduction, which figures out what knowledge is missing in order to make a deduction go through, and then makes it as general as possible. For connectionists, learning is what the brain does, and so what we need to do is reverse engineer it. The brain learns by adjusting the strengths of connections between neurons, and the crucial problem is figuring out which connections are to blame for which errors and changing them accordingly. The connectionists’ master algorithm is backpropagation, which compares a system’s output with the desired one and then successively changes the connections in layer after layer of neurons so as to bring the output closer to what it should be. Evolutionaries believe that the mother of all learning is natural selection. If it made us, it can make anything, and all we need to do is simulate it on the computer. The key problem that evolutionaries solve is learning structure: not just adjusting parameters, like backpropagation does, but creating the brain that those adjustments can then fine-tune. The evolutionaries’ master algorithm is genetic programming, which mates and evolves computer programs in the same way that nature mates and evolves organisms. Bayesians are concerned above all with uncertainty. All learned knowledge is uncertain, and learning itself is a form of uncertain inference. The problem then becomes how to deal with noisy, incomplete, and even contradictory information without falling apart. The solution is probabilistic inference, and the master algorithm is Bayes’ theorem and its derivates. Bayes’ theorem tells us how to incorporate new evidence into our beliefs, and probabilistic inference algorithms do that as efficiently as possible. For analogizers, the key to learning is recognizing similarities between situations and thereby inferring other similarities. If two patients have similar symptoms, perhaps they have the same disease. The key problem is judging how similar two things are. The analogizers’ master algorithm is the support vector machine, which figures out which experiences to remember and how to combine them to make new predictions. Each tribe’s solution to its central problem is a brilliant, hard-won advance. But the true Master Algorithm must solve all five problems, not just one.
Hume’s Problem of Induction
Rationalists believe that the senses deceive and that logical reasoning is the only sure path to knowledge. Empiricists believe that all reasoning is fallible and that knowledge must come from observation and experimentation. The French are rationalists; the Anglo-Saxons (as the French call them) are empiricists. Pundits, lawyers, and mathematicians are rationalists; journalists, doctors, and scientists are empiricists. Murder, She Wrote is a rationalist TV crime show; CSI: Crime Scene Investigation is an empiricist one. In computer science, theorists and knowledge engineers are rationalists; hackers and machine learners are empiricists. The rationalist likes to plan everything in advance before making the first move. The empiricist prefers to try things and see how they turn out.
David Hume was the greatest of the empiricists and the greatest English-speaking philosopher of all time. Thinkers like Adam Smith and Charles Darwin count him among their key influences. You could also say he’s the patron saint of the symbolists. He also took the empiricist train of thought that Locke had started to its logical conclusion and asked a question that has since hung like a sword of Damocles over all knowledge, from the most trivial to the most advanced: How can we ever be justified in generalizing from what we’ve seen to what we haven’t? Every learning algorithm is, in a sense, an attempt to answer this question.
How about we just assume that the future will be like the past? This is certainly a risky assumption. (It didn’t work for the inductivist turkey.) On the other hand, without it all knowledge is impossible, and so is life. We’d rather stay alive, even if precariously. Unfortunately, even with that assumption we’re not out of the woods. It takes care of the “trivial” cases: If I’m a doctor and patient B has exactly the same symptoms as patient A, I assume that the diagnosis is the same. But if patient B’s symptoms don’t exactly match anyone else’s, I’m still in the dark. This is the machine-learning problem: generalizing to cases that we haven’t seen before.
Two hundred and fifty years after Hume set off his bombshell, it was given elegant mathematical form by David Wolpert, a physicist turned machine learner. His result, known as the “no free lunch” theorem, sets a limit on how good a learner can be. The limit is pretty low: no learner can be better than random guessing! Don’t give up on machine learning or the Master Algorithm just yet, though. We don’t care about all possible worlds, only the one we live in. If we know something about the world and incorporate it into our learner, it now has an advantage over random guessing. To this Hume would reply that that knowledge must itself have come from induction and is therefore fallible. That’s true, even if the knowledge was encoded into our brains by evolution, but it’s a risk we’ll have to take.
The practical consequence of the “no free lunch” theorem is that there’s no such thing as learning without knowledge. Data alone is not enough. Starting from scratch will only get you to scratch. Machine learning is a kind of knowledge pump: we can use it to extract a lot of knowledge from data, but first we have to prime the pump.
Whenever a learner finds a pattern in the data that is not actually true in the real world, we say that it has overfit the data. Overfitting is the central problem in machine learning.
Every powerful learner, whether symbolist, connectionist, or any other, has to worry about hallucinating patterns. The only safe way to avoid it is to severely restrict what the learner can learn, for example by requiring that it be a short conjunctive concept. Unfortunately, that throws out the baby with the bathwater, leaving the learner unable to see most of the true patterns that are visible in the data. Thus a good learner is forever walking the narrow path between blindness and hallucination.
Learning is a race between the amount of data you have and the number of hypotheses you consider. More data exponentially reduces the number of hypotheses that survive, but if you start with a lot of them, you may still have some bad ones left at the end. As a rule of thumb, if the learner only considers an exponential number of hypotheses (for example, all possible conjunctive concepts), then the data’s exponential payoff cancels it and you’re OK, provided you have plenty of examples and not too many attributes. On the other hand, if it considers a doubly exponential number (for example, all possible rule sets), then the data cancels only one of the exponentials and you’re still in trouble. You can even figure out in advance how many examples you’ll need to be pretty sure that the learner’s chosen hypothesis is very close to the true one, provided it fits all the data; in other words, for the hypothesis to be probably approximately correct.
In practice, Valiant-style analysis tends to be very pessimistic and to call for more data than you have. So how do you decide whether to believe what a learner tells you? Simple: you don’t believe anything until you’ve verified it on data that the learner didn’t see. If the patterns the learner hypothesized also hold true on new data, you can be pretty confident that they’re real. Otherwise you know the learner overfit. This is just the scientific method applied to machine learning: it’s not enough for a new theory to explain past evidence because it’s easy to concoct a theory that does that; the theory must also make new predictions, and you only accept it after they’ve been experimentally verified. (And even then only provisionally, because future evidence could still falsify it.)
But you don’t need to wait around for new data to arrive to decide whether you can trust your learner. Rather, you take the data you have and randomly divide it into a training set, which you give to the learner, and a test set, which you hide from it and use to verify its accuracy. Accuracy on held-out data is the gold standard in machine learning.
Of course, it’s not enough to be able to tell when you’re overfitting; we need to avoid it in the first place. That means stopping short of perfectly fitting the data even if we’re able to. One method is to use statistical significance tests to make sure the patterns we’re seeing are really there. For example, a rule covering three hundred positive examples versus one hundred negatives and a rule covering three positives versus one negative are both 75 percent accurate on the training data, but the first rule is almost certainly better than coin flipping, while the second isn’t, since four flips of an unbiased coin could easily result in three heads. When constructing a rule, if at some point we can’t find any conditions that significantly improve its accuracy then we just stop, even if it still covers some negative examples. This reduces the rule’s training-set accuracy, but probably makes it a more accurate generalization, which is what we really care about.
If your learner’s test-set accuracy disappoints, you need to diagnose the problem. Was it blindness or hallucination? In machine learning, the technical terms for these are bias and variance. A clock that’s always an hour late has high bias but low variance. If instead the clock alternates erratically between fast and slow but on average tells the right time, it has high variance but low bias.
You can estimate the bias and variance of a learner by comparing its predictions after learning on random variations of the training set. If it keeps making the same mistakes, the problem is bias, and you need a more flexible learner (or just a different one). If there’s no pattern to the mistakes, the problem is variance, and you want to either try a less flexible learner or get more data.
The symbolists’ core belief is that all intelligence can be reduced to manipulating symbols. A mathematician solves equations by moving symbols around and replacing symbols by other symbols according to predefined rules. The same is true of a logician carrying out deductions. According to this hypothesis, intelligence is independent of the substrate; it doesn’t matter if the symbol manipulations are done by writing on a blackboard, switching transistors on and off, firing neurons, or playing with Tinkertoys. If you have a setup with the power of a universal Turing machine, you can do anything. Software can be cleanly separated from hardware, and if your concern is figuring out how machines can learn, you (thankfully) don’t need to worry about the latter beyond buying a PC or cycles on Amazon’s cloud.
Symbolism is the shortest path to the Master Algorithm. It doesn’t require us to figure out how evolution or the brain works, and it avoids the mathematical complexities of Bayesianism. Sets of rules and decision trees are easy to understand, so we know what the learner is up to. This makes it easier to figure out what it’s doing right and wrong, fix the latter, and have confidence in the results.
Despite the popularity of decision trees, inverse deduction is the better starting point for the Master Algorithm. It has the crucial property that incorporating knowledge into it is easy—and we know Hume’s problem makes that essential. Also, sets of rules are an exponentially more compact way to represent most concepts than decision trees.
Inverse deduction is like having a superscientist systematically looking at the evidence, considering possible inductions, collating the strongest, and using those along with other evidence to construct yet further hypotheses—all at the speed of computers. It’s clean and beautiful, at least for the symbolist taste. On the other hand, it has some serious shortcomings. The number of possible inductions is vast, and unless we stay close to our initial knowledge, it’s easy to get lost in space. Inverse deduction is easily confused by noise: how do we figure out what the missing deductive steps are, if the premises or conclusions are themselves wrong? Most seriously, real concepts can seldom be concisely defined by a set of rules. They’re not black and white: there’s a large gray area between, say, spam and nonspam. They require weighing and accumulating weak evidence until a clear picture emerges. Diagnosing an illness involves giving more weight to some symptoms than others, and being OK with incomplete evidence.
Connectionists, in particular, are highly critical of symbolist learning. According to them, concepts you can define with logical rules are only the tip of the iceberg; there’s a lot going on under the surface that formal reasoning just can’t see, in the same way that most of what goes on in our minds is subconscious. You can’t just build a disembodied automated scientist and hope he’ll do something meaningful—you have to first endow him with something like a real brain, connected to real senses, growing up in the world, perhaps even stubbing his toe every now and then.
How Does Your Brain Learn?
Hebb’s rule, as it has come to be known, is the cornerstone of connectionism. Indeed, the field derives its name from the belief that knowledge is stored in the connections between neurons. It’s often paraphrased as “Neurons that fire together wire together.”
In symbolist learning, there is a one-to-one correspondence between symbols and the concepts they represent. In contrast, connectionist representations are distributed: each concept is represented by many neurons, and each neuron participates in representing many different concepts. Neurons that excite one another form what Hebb called a cell assembly. Concepts and memories are represented in the brain by cell assemblies. Each of these can include neurons from different brain regions and overlap with other assemblies. The cell assembly for “leg” includes the one for “foot,” which includes assemblies for the image of a foot and the sound of the word foot. If you ask a symbolist system where the concept “New York” is represented, it can point to the precise location in memory where it’s stored. In a connectionist system, the answer is “it’s stored a little bit everywhere.”
Another difference between symbolist and connectionist learning is that the former is sequential, while the latter is parallel. In inverse deduction, we figure out one step at a time what new rules are needed to arrive at the desired conclusion from the premises. In connectionist models, all neurons learn simultaneously according to Hebb’s rule. This mirrors the different properties of computers and brains. Computers do everything one small step at a time, like adding two numbers or flipping a switch, and as a result they need a lot of steps to accomplish anything useful; but those steps can be very fast, because transistors can switch on and off billions of times per second. In contrast, brains can perform a large number of computations in parallel, with billions of neurons working at the same time; but each of those computations is slow, because neurons can fire at best a thousand times per second.
[The S curve] is variously known as the logistic, sigmoid, or S curve. Peruse it closely, because it’s the most important curve in the world. At first the output increases slowly with the input, so slowly it seems constant. Then it starts to change faster, then very fast, then slower and slower until it becomes almost constant again. The transfer curve of a transistor, which relates its input and output voltages, is also an S curve. So both computers and the brain are filled with S curves. But it doesn’t end there. The S curve is the shape of phase transitions of all kinds: the probability of an electron flipping its spin as a function of the applied field, the magnetization of iron, the writing of a bit of memory to a hard disk, an ion channel opening in a cell, ice melting, water evaporating, the inflationary expansion of the early universe, punctuated equilibria in evolution, paradigm shifts in science, the spread of new technologies, white flight from multiethnic neighborhoods, rumors, epidemics, revolutions, the fall of empires, and much more. The Tipping Point could equally well (if less appealingly) be entitled The S Curve.
The universe is a vast symphony of phase transitions, from the cosmic to the microscopic, from the mundane to the life changing. The S curve is not just important as a model in its own right; it’s also the jack-of-all-trades of mathematics. If you zoom in on its midsection, it approximates a straight line. Many phenomena we think of as linear are in fact S curves, because nothing can grow without limit. The early part of an S curve is effectively an exponential, and near the saturation point it approximates exponential decay. When someone talks about exponential growth, ask yourself: How soon will it turn into an S curve? When will the population bomb peter out, Moore’s law lose steam, or the singularity fail to happen?
In the perceptron algorithm, the error signal is all or none: you got it either right or wrong. That’s not much to go on, particularly if you have a network of many neurons. You may know that the output neuron is wrong (oops, that wasn’t your grandmother), but what about some neuron deep inside the brain? What does it even mean for such a neuron to be right or wrong? If the neurons’ output is continuous instead of binary, the picture changes. For starters, we now know how much the output neuron is wrong by: the difference between it and the desired output. If the neuron should be firing away (“Oh hi, Grandma!”) and is firing a little, that’s better than if it’s not firing at all. More importantly, we can now propagate that error to the hidden neurons: if the output neuron should fire more and neuron A connects to it, then the more A is firing, the more we should strengthen their connection; but if A is inhibited by another neuron B, then B should fire less, and so on. Based on the feedback from all the neurons it’s connected to, each neuron decides how much more or less to fire. Based on that and the activity of its input neurons, it strengthens or weakens its connections to them. I need to fire more, and neuron B is inhibiting me? Lower its weight. And neuron C is firing away, but its connection to me is weak? Strengthen it. My “customer” neurons, downstream in the network, will tell me how well I’m doing in the next round. Whenever the learner’s “retina” sees a new image, that signal propagates forward through the network until it produces an output. Comparing this output with the desired one yields an error signal, which then propagates back through the layers until it reaches the retina. Based on this returning signal and on the inputs it had received during the forward pass, each neuron adjusts its weights. As the network sees more and more images of your grandmother and other people, the weights gradually converge to values that let it discriminate between the two. Backpropagation, as this algorithm is known, is phenomenally more powerful than the perceptron algorithm. A single neuron could only learn straight lines. Given enough hidden neurons, a multilayer perceptron, as it’s called, can represent arbitrarily convoluted frontiers. This makes backpropagation—or simply backprop—the connectionists’ master algorithm.
Backprop is an instance of a strategy that is very common in both nature and technology: if you’re in a hurry to get to the top of the mountain, climb the steepest slope you can find. The technical term for this is gradient ascent (if you want to get to the top) or gradient descent (if you’re looking for the valley bottom). Bacteria can find food by swimming up the concentration gradient of, say, glucose molecules, and they can flee from poisons by swimming down their gradient. All sorts of things, from aircraft wings to antenna arrays, can be optimized by gradient descent. Backprop is an efficient way to do it in a multilayer perceptron: keep tweaking the weights so as to lower the error, and stop when all tweaks fail. With backprop, you don’t have to figure out how to tweak each neuron’s weights from scratch, which would be too slow; you can do it layer by layer, tweaking each neuron based on how you tweaked the neurons it connects to. If you had to throw out your entire machine-learning toolkit in an emergency save for one tool, gradient descent is probably the one you’d want to hold on to.
One of the protagonists of this latest twist in the connectionist roller coaster is an unassuming little device called an autoencoder. An autoencoder is a multilayer perceptron whose output is the same as its input. In goes a picture of your grandmother and out comes—the same picture of your grandmother. At first this seems like a silly idea: What use could such a contraption possibly be? The key is to make the hidden layer much smaller than the input and output layers, so the network can’t just learn to copy the input to the hidden layer and the hidden layer to the output, in which case we may as well throw the whole thing out. But if the hidden layer is small, something interesting happens: the network is forced to encode the input in fewer bits, so it can be represented in the hidden layer, and then decode those bits back to full size.
Even if we had a complete map of the brain, we would still be at a loss to figure out what it does. The nervous system of the C. elegans worm consists of only 302 neurons and was completely mapped in 1986, but we still have only a fragmentary understanding of what it does. We need higher-level concepts to make sense of the morass of low-level details, weeding out the ones that are specific to wetware or just quirks of evolution. We don’t build airplanes by reverse engineering feathers, and airplanes don’t flap their wings. Rather, airplane designs are based on the principles of aerodynamics, which all flying objects must obey. We still do not understand those analogous principles of thought.
Perhaps connectomics is overkill. Some connectionists have been overheard claiming that backprop is the Master Algorithm and we just need to scale it up. But symbolists pour scorn on this notion. They point to a long list of things that humans can do but neural networks can’t. Take commonsense reasoning. It involves combining pieces of information that may have never been seen together before. Did Mary eat a shoe for lunch? No, because Mary is a person, people only eat edible things, and shoes are not edible. Symbolic systems have no trouble with this—they just chain the relevant rules—but multilayer perceptrons can’t do it; once they’re done learning, they just compute the same fixed function over and over again. Neural networks are not compositional, and compositionality is a big part of human cognition. Another big issue is that humans—and symbolic models like sets of rules and decision trees—can explain their reasoning, while neural networks are big piles of numbers that no one can understand.
Evolution: Nature’s Learning Algorithm
The key input to a genetic algorithm is a fitness function. Given a candidate program and some purpose it is meant to fill, the fitness function assigns the program a numeric score reflecting how well it fits the purpose. In natural selection, it’s questionable whether fitness can be interpreted this way: while the fitness of a wing for flight makes intuitive sense, evolution as a whole has no known purpose. Nevertheless, in machine learning having something like a fitness function is a no-brainer. If we need a program that can diagnose a patient, one that correctly diagnoses 60 percent of the patients in our database is better than one that only gets it right 55 percent of the time, and thus a possible fitness function is the fraction of correctly diagnosed cases. In this regard, genetic algorithms are a lot like selective breeding.
Notice how much genetic algorithms differ from multilayer perceptrons. Backprop entertains a single hypothesis at any given time, and the hypothesis changes gradually until it settles into a local optimum. Genetic algorithms consider an entire population of hypotheses at each step, and these can make big jumps from one generation to the next, thanks to crossover. Backprop proceeds deterministically after setting the initial weights to small random values. Genetic algorithms, in contrast, are full of random choices: which hypotheses to keep alive and cross over (with fitter hypotheses being more likely candidates), where to cross two strings, which bits to mutate. Backprop learns weights for a predefined network architecture; denser networks are more flexible but also harder to learn. Genetic algorithms make no a priori assumptions about the structures they will learn, other than their general form. Because of all this, genetic algorithms are much less likely than backprop to get stuck in a local optimum and in principle better able to come up with something truly new. But they are also much more difficult to analyze. How do we know a genetic algorithm will get somewhere meaningful instead of randomly walking around like the proverbial drunkard? The key is to think in terms of building blocks. Every subset of a string’s bits potentially encodes a useful building block, and when we cross over two strings, those building blocks come together into a larger one, which in turn becomes grist for the mill.
Evolutionaries and connectionists have something important in common: they both design learning algorithms inspired by nature. But then they part ways. Evolutionaries focus on learning structure; to them, fine-tuning an evolved structure by optimizing parameters is of secondary importance. In contrast, connectionists prefer to take a simple, hand-coded structure with lots of connections and let weight learning do all the work. This is machine learning’s version of the nature versus nurture controversy, and there are good arguments on both sides. On the one hand, evolution has produced many amazing things, none more amazing than you. With or without crossover, evolving structure is an essential part of the Master Algorithm. The brain can learn anything, but it can’t evolve a brain. If we thoroughly understood its architecture, we could just implement it in hardware, but we’re very far from that; getting an assist from computer-simulated evolution is a no-brainer. What’s more, we also want to evolve the brains of robots, systems with arbitrary sensors, and super-AIs. There’s no reason to stick with the design of the human brain if there are better ones for those tasks. On the other hand, evolution is excruciatingly slow. The entire life of an organism yields only one piece of information about its genome: its fitness, reflected in the organism’s number of offspring. That’s a colossal waste of information, which neural learning avoids by acquiring the information at the point of use (so to speak).
As in the nature versus nurture debate, neither side has the whole answer; the key is figuring out how to combine the two. The Master Algorithm is neither genetic programming nor backprop, but it has to include the key elements of both: structure learning and weight learning. Evolution searches for good structures, and neural learning fills them in: this combination is the easiest of the steps we’ll take toward the Master Algorithm.
In the Church of the Reverend Bayes
At heart, Bayes’ theorem is just a simple rule for updating your degree of belief in a hypothesis when you receive new evidence: if the evidence is consistent with the hypothesis, the probability of the hypothesis goes up; if not, it goes down.
Suppose you awake in the middle of the night on a strange planet. Even though all you can see is the starry sky, you have reason to believe that the sun will rise at some point, since most planets revolve around themselves and their sun. So your estimate of the corresponding probability should be greater than one-half (two-thirds, say). We call this the prior probability that the sun will rise, since it’s prior to seeing any evidence. It’s not based on counting the number of times the sun has risen on this planet in the past, because you weren’t there to see it; rather, it reflects your a priori beliefs about what will happen, based on your general knowledge of the universe. But now the stars start to fade, so your confidence that the sun does rise on this planet goes up, based on your experience on Earth. Your confidence is now a posterior probability, since it’s after seeing some evidence. The sky begins to lighten, and the posterior probability takes another leap. The crucial question is exactly how the posterior probability should evolve as you see more evidence. The answer is Bayes’ theorem. We can think of it in terms of cause and effect. Sunrise causes the stars to fade and the sky to lighten, but the latter is stronger evidence of daybreak, since the stars could fade in the middle of the night due to, say, fog rolling in. So the probability of sunrise should increase more after seeing the sky lighten than after seeing the stars fade.
Humans, it turns out, are not very good at Bayesian inference, at least when verbal reasoning is involved. The problem is that we tend to neglect the cause’s prior probability.
Bayes’ theorem is useful because what we usually know is the probability of the effects given the causes, but what we want to know is the probability of the causes given the effects. For example, we know what percentage of flu patients have a fever, but what we really want to know is how likely a patient with a fever is to have the flu. Bayes’ theorem lets us go from one to the other. Its significance extends far beyond that, however. For Bayesians, this innocent-looking formula is the F = ma of machine learning, the foundation from which a vast number of results and applications flow. And whatever the Master Algorithm is, it must be “just” a computational implementation of Bayes’ theorem. I put just in quotes because implementing Bayes’ theorem on a computer turns out to be fiendishly hard for all but the simplest problems.
Bayes’ theorem as a foundation for statistics and machine learning is bedeviled not just by computational difficulty but also by extreme controversy. You might be forgiven for wondering why: Isn’t it a straightforward consequence of the notion of conditional probability, as we saw in the flu example? Indeed, no one has a problem with the formula itself. The controversy is in how Bayesians obtain the probabilities that go into it and what those probabilities mean. For most statisticians, the only legitimate way to estimate probabilities is by counting how often the corresponding events occur. For example, the probability of fever is 0.2 because twenty out of one hundred observed patients had it. This is the “frequentist” interpretation of probability, and the dominant school of thought in statistics takes its name from it. But notice that in the sunrise example, and in Laplace’s principle of indifference, we did something different: we pulled a probability out of thin air. What exactly justifies assuming a priori that the probability the sun will rise is one-half, or two-thirds, or whatever? Bayesians’ answer is that a probability is not a frequency but a subjective degree of belief. Therefore it’s up to you what you make it, and all that Bayesian inference lets you do is update your prior beliefs with new evidence to obtain your posterior beliefs (also known as “turning the Bayesian crank”). Bayesians’ devotion to this idea is near religious, enough to withstand two hundred years of attacks and counting. And with the appearance on the stage of computers powerful enough to do Bayesian inference, and the massive data sets to go with it, they’re beginning to gain the upper hand.
Machine learning is the art of making false assumptions and getting away with it. As the statistician George Box famously put it: “All models are wrong, but some are useful.” An oversimplified model that you have enough data to estimate is better than a perfect one that you don’t. It’s astonishing how simultaneously very wrong and very useful some models can be. The economist Milton Friedman even argued in a highly influential essay that the best theories are the most oversimplified, provided their predictions are accurate, because they explain the most with the least.
Bayesian networks are for Bayesians what logic is for symbolists: a lingua franca that allows us to elegantly encode a dizzying variety of situations and devise algorithms that work uniformly in all of them. We can think of a Bayesian network as a “generative model,” a recipe for probabilistically generating a state of the world: first decide independently whether there’s a burglary and/or an earthquake, then based on that decide whether the alarm goes off, and then based on that whether Bob and Claire call. A Bayesian network tells a story: A happened, and it led to B; at the same time, C also happened, and B and C together caused D. To compute the probability of a particular story, we just multiply the probabilities of all of its different strands.
For Bayesians learning is just another kind of probabilistic inference. All you have to do is apply Bayes’ theorem with the hypotheses as the possible causes and the data as the observed effect.
For a Bayesian, in fact, there is no such thing as the truth; you have a prior distribution over hypotheses, after seeing the data it becomes the posterior distribution, as given by Bayes’ theorem, and that’s all. This is a radical departure from the way science is usually done. It’s like saying, “Actually, neither Copernicus nor Ptolemy was right; let’s just predict the planets’ future trajectories assuming Earth goes round the sun and vice versa and average the results.” Of course, it’s a weighted average, the weight of a hypothesis being its posterior probability, so a hypothesis that explains the data better will count for more. Still, as the joke goes, being Bayesian means never having to say you’re certain.
Despite the convergence of frequentist and Bayesian thinking on some issues, there remains the philosophical difference about the meaning of probability. Viewing it as subjective makes many scientists queasy, but it also enables many otherwise-forbidden uses. If you’re a frequentist, you can only estimate probabilities of events that can occur more than once. So a question like “What is the probability that Hillary Clinton will beat Jeb Bush in the next presidential election?” is unanswerable, because there’s never been an election pitting them against each other. But for a Bayesian, a probability is a subjective degree of belief, so he’s free to make an educated guess, and the inference calculus keeps all his guesses consistent.
You Are What You Resemble
Analogical reasoning has a distinguished intellectual pedigree. Aristotle expressed it in his law of similarity: if two things are similar, the thought of one will tend to trigger the thought of the other. Empiricists like Locke and Hume followed suit. Truth, said Nietzche, is a mobile army of metaphors. Kant was also a fan. William James believed that “this sense of sameness is the very keel and backbone of our thinking.” Some contemporary psychologists even argue that human cognition in its entirety is a fabric of analogies. We rely on it to find our way around a new town and to understand expressions like “see the light” and “stand tall.” Given all this, it’s not surprising that analogy plays a prominent role in machine learning. It got off to a slow start, though, and was initially overshadowed by neural networks. Its first algorithmic incarnation appeared in an obscure technical report written in 1951 by two Berkeley statisticians, Evelyn Fix and Joe Hodges, and was not published in a mainstream journal until decades later. But in the meantime, other papers on Fix and Hodges’s algorithm started to appear and then to multiply until it was one of the most researched in all of computer science. The nearest-neighbor algorithm, as it’s called, is the first stop on our tour of analogy-based learning. The second is support vector machines, an idea that took machine learning by storm around the turn of the millennium and was only recently overshadowed by deep learning. The third and last is full-blown analogical reasoning, which has been a staple of psychology and AI for several decades, and a background theme in machine learning for nearly as long.
Imagine trying to define what a face is with a decision tree. You could say it has two eyes, a nose, and a mouth, but what is an eye and how do you find it in an image? What if the person’s eyes are closed? Reliably defining a face all the way down to individual pixels is extremely difficult, particularly given all the different expressions, poses, contexts, and lighting conditions a face could appear in. Instead, nearest-neighbor takes a shortcut: if the image in its database most similar to the one Jane just uploaded is of a face, then so is Jane’s. For this to work, the database needs to contain an image that’s similar enough to the new one—for example, a face with similar pose, lighting, and so on—so the bigger the database, the better. For a simple two-dimensional problem like guessing the border between two countries, a tiny database suffices. For a very hard problem like identifying faces, where the color of each pixel is a dimension of variation, we need a huge database. But these days we have them.
There’s a serpent in this Eden, of course. It’s called the curse of dimensionality, and while it affects all learners to a greater or lesser degree, it’s particularly bad for nearest-neighbor. In low dimensions (like two or three), nearest-neighbor usually works quite well. But as the number of dimensions goes up, things fall apart pretty quickly. It’s not uncommon today to have thousands or even millions of attributes to learn from. With enough irrelevant attributes, accidental similarity in the irrelevant dimensions swamps out meaningful similarity in the important ones, and nearest-neighbor becomes no better than random guessing. A bigger problem is that, surprisingly, having more attributes can be harmful even when they’re all relevant. In fact, no learner is immune to the curse of dimensionality. It’s the second worst problem in machine learning, after overfitting.
Superficially, a support vector machine, or SVM for short, looks a lot like weighted k-nearest-neighbor: the frontier between the positive and negative classes is defined by a set of examples and their weights, together with a similarity measure. A test example belongs to the positive class if, on average, it looks more like the positive examples than the negative ones. The average is weighted, and the SVM remembers only the key examples required to pin down the frontier. These examples are called support vectors because they’re the vectors that “hold up” the frontier: remove one, and a section of the frontier slides to a different place. To learn an SVM, we need to choose the support vectors and their weights. The similarity measure, which in SVM-land is called the kernel, is usually chosen a priori.
Constrained optimization is the problem of maximizing or minimizing a function subject to constraints. The universe maximizes entropy subject to keeping energy constant. Problems of this type are widespread in business and technology. For example, we may want to maximize the number of widgets a factory produces, subject to the number of machine tools available, the widgets’ specs, and so on. With SVMs, constrained optimization became crucial for machine learning as well. Unconstrained optimization is getting to the top of the mountain, and that’s what gradient descent (or, in this case, ascent) does. Constrained optimization is going as high as you can while staying on the road. If the road goes up to the very top, the constrained and unconstrained problems have the same solution. More often, though, the road zigzags up the mountain and then back down without ever reaching the top. You know you’ve reached the highest point on the road when you can’t go any higher without driving off the road; in other words, when the path to the top is at right angles to the road. If the road and the path to the top form an oblique angle, you can always get higher by driving farther along the road, even if that doesn’t get you higher as quickly as aiming straight for the top of the mountain. So the way to solve a constrained optimization problem is to follow not the gradient but the part of it that’s parallel to the constraint surface—in this case the road—and stop when that part is zero.
In general, we have to deal with many constraints at once (one per example, in the case of SVMs). Suppose you wanted to get as close as possible to the North Pole but couldn’t leave your room. Each of the room’s four walls is a constraint, and the solution is to follow the compass until you bump into the corner where the northeast and northwest walls meet. We say that these two walls are the active constraints because they’re what prevents you from reaching the optimum, namely the North Pole. If your room has a wall facing exactly north, that’s the sole active constraint, and the solution is a point in the middle of it.
Practical successes aside, SVMs also turned a lot of machine-learning conventional wisdom on its head. For example, they gave the lie to the notion, sometimes misidentified with Occam’s razor, that simpler models are more accurate. On the contrary, an SVM can have an infinite number of parameters and still not overfit, provided it has a large enough margin.
Two things are similar if they agree with one another in some respects. If they agree in some respects, they will probably also agree in others. This is the essence of analogy. It also points to the two main subproblems in analogical reasoning: figuring out how similar two things are and deciding what else to infer from their similarities.
The most important question in any analogical learner is how to measure similarity. It could be as simple as Euclidean distance between data points, or as complex as a whole program with multiple levels of subroutines whose final output is a similarity value. Either way, the similarity function controls how the learner generalizes from known examples to new ones. It’s where we insert our knowledge of the problem domain into the learner, making it the analogizers’ answer to Hume’s question. We can apply analogical learning to all kinds of objects, not just vectors of attributes, provided we have a way of measuring the similarity between them.
The second part of analogical reasoning is figuring out what we can infer about the new object based on similar ones we’ve found. This can be very simple or very complex. In nearest-neighbor or SVMs, it just consists of predicting the new object’s class based on the classes of the nearest neighbors or support vectors. But in case-based reasoning, another type of analogical learning, the output can be a complex structure formed by composing parts of the retrieved objects. Suppose your HP printer is spewing out gibberish, and you call up their help desk. Chances are they’ve seen your problem many times before, so a good strategy is to find those records and piece together a potential solution for your problem from them. This is not just a matter of finding complaints with many similar attributes to yours: for example, whether you’re using your printer with Windows or Mac OS X may cause very different settings of the system and the printer to become relevant. And once you’ve found the most relevant cases, the sequence of steps needed to solve your problem may be a combination of steps from different cases, with some further tweaks specific to yours. Help desks are currently the most popular application of case-based reasoning.
Just as outsourcing keeps climbing the skills ladder, so does analogical learning. The first robo-lawyers that argue for a particular verdict based on precedents have already been built. One such system correctly predicted the outcomes of over 90 percent of the trade secret cases it examined.
Analogizers’ neatest trick, however, is learning across problem domains. Humans do it all the time: an executive can move from, say, a media company to a consumer-products one without starting from scratch because many of the same management skills still apply. Wall Street hires lots of physicists because physical and financial problems, although superficially very different, often have a similar mathematical structure. Yet all the learners we’ve seen so far would fall flat if we, say, trained them to predict Brownian motion and then asked them to predict the stock market. Stock prices and the velocities of particles suspended in a fluid are just different variables, so the learner wouldn’t even know where to start. But analogizers can do this using structure mapping, an algorithm invented by Dedre Gentner, a psychologist at Northwestern University. Structure mapping takes two descriptions, finds a coherent correspondence between some of their parts and relations, and then, based on that correspondence, transfers further properties from one structure to the other. For example, if the structures are the solar system and the atom, we can map planets to electrons and the sun to the nucleus and conclude, as Bohr did, that electrons revolve around the nucleus. The truth is more subtle, of course, and we often need to refine analogies after we make them. But being able to learn from a single example like this is surely a key attribute of a universal learner. If the Master Algorithm is not analogy, it must surely be something like it.
Instance-based learning, as it’s sometimes called, is supposedly better for modeling how we remember specific episodes in our lives; rules are the putative choice for reasoning with abstract concepts like “work” and “love.” But when I was a graduate student, it struck me that these two are really just points on a continuum, and we should be able to learn across all of it. Rules are in effect generalized instances where we’ve “forgotten” some attributes because they didn’t matter. Conversely, instances are very specific rules, with a condition on every attribute. As we go through life, similar episodes gradually become abstracted into rule-based structures, like “eating at a restaurant.” You know that going to a restaurant involves ordering from a menu and leaving a tip, and you follow those “rules of conduct” every time you eat out, but you probably don’t remember the specific restaurants where you first became aware of them.
Learning Without a Teacher
Even though children certainly get plenty of help from their parents, they learn mostly on their own, without supervision, and that’s what seems most miraculous. None of the algorithms we’ve seen so far can do it, but we’re about to see several that can—bringing us one step closer to the Master Algorithm.
Organizing the world into objects and categories is second nature to an adult but not to an infant. What we need is an algorithm that will spontaneously group together similar objects, or different images of the same object. This is the problem of clustering, and it’s one of the most intensively studied in machine learning. A cluster is a set of similar entities, or at a minimum, a set of entities that are more similar to each other than to members of other clusters. It’s human nature to cluster things, and it’s often the first step on the road to knowledge.
Confronted with big data—a very large number of objects—our first recourse is to group them into a more manageable number of clusters. A whole market is too coarse, and individual customers are too fine, so marketers divide markets into segments, which is their word for clusters. Anything described by numeric attributes—say, people’s heights, weights, girths, shoe sizes, hair lengths, and so on—makes it easy to compute the average member: his height is the average height of all the cluster members, his weight the average of all the weights, and so on. For categorical attributes, like gender, hair color, zip code, or favorite sport, the “average” is simply the most frequent value. The average member described by this set of attributes may or may not be a real person, but either way it’s a useful reference to have: if you’re brainstorming how to market a new product, picturing Peoria as the town where you’re launching it or Bob Burns as your target customer beats thinking of abstract entities like “the market” or “the consumer.”
As useful as such averages are, we can do even better; indeed the whole point of big data and machine learning is to avoid thinking at such a coarse level. Our clusters can be very specialized sets of people or even different aspects of the same person: Alice buying books for work, for leisure, or as Christmas presents; Alice in a good mood versus Alice with the blues. Amazon would like to distinguish the books Alice buys for herself from the ones she buys for her boyfriend, as this would allow it to make appropriate recommendations at appropriate times.
Suppose we decide that letting Robby roam around in the real world is too slow and cumbersome a way to learn. Instead, like a would-be pilot learning in a flight simulator, we’ll have him look at computer-generated images. We know what clusters the images come from, but we’re not telling Robby. Instead, we create each image by first choosing a cluster at random (toys, say) and then synthesizing an example of that cluster (small, fluffy, brown teddy bear with big black eyes, round ears, and a bow tie). We also choose the properties of the example at random: the size comes from a normal distribution with a mean of ten inches, the fur is brown with 80 percent probability and white otherwise, and so on. After Robby has seen lots of images generated in this way, he should have learned to cluster them into people, furniture, toys, and so on, because people are more like people than furniture and so on. But the interesting question is: If we look at it from Robby’s point of view, what’s the best algorithm to discover the clusters? The answer is surprising: Naïve Bayes, which we first met as an algorithm for supervised learning. The difference is that now Robby doesn’t know the classes, so he’ll have to guess them!
Robby is faced with a chicken-and-egg problem: if he knew the objects’ classes, he could learn the classes’ models by counting, and if he knew the models, he could infer the objects’ classes. We seem to be stuck again, but far from it: just start by guessing a class for each object any way you want—even at random—and you’re off to the races. From those classes and the data, you can learn the class models; based on these models you can reinfer the classes and so on. At first sight this looks like a crazy scheme: it may never finish, circling forever between inferring the classes from the models and the models from the classes, and even if it does finish, there’s no reason to believe it will settle on meaningful clusters. But in 1977 a trio of Harvard statisticians (Arthur Dempster, Nan Laird, and Donald Rubin) showed that the crazy scheme actually works: every time we go around the loop, the cluster model gets better, and the loop ends when the model is a local maximum of the likelihood. They called this scheme the EM algorithm, where the E stands for expectation (inferring the expected probabilities) and the M for maximization (estimating the maximum-likelihood parameters). They also showed that many previous algorithms were special cases of EM. Whenever we want to learn a statistical model but are missing some crucial information (e.g., the classes of the examples), we can use EM. This makes it one of the most popular algorithms in all of machine learning.
Grouping a large number of entities into a smaller number of clusters is only half the battle. The other half is shortening the description of each entity. The very first picture of Mom that Robby sees comprises perhaps a million pixels, each with its own color, but you hardly need a million variables to describe a face. Likewise, each thing you click on at Amazon provides an atom of information about you, but what Amazon would really like to know is your likes and dislikes, not your clicks. The former, which are fairly stable, are somehow immanent in the latter, which grow without limit as you use the site. Little by little, all those clicks should add up to a picture of your taste, in the same way that all those pixels add up to a picture of your face. The question is how to do the adding.
Machine learners call this process dimensionality reduction because it reduces a large number of visible dimensions (the pixels) to a few implicit ones (expression, facial features). Dimensionality reduction is essential for coping with big data—like the data coming in through your senses every second. A picture may be worth a thousand words, but it’s also a million times more costly to process and remember. Yet somehow your visual cortex does a pretty good job of whittling it down to a manageable amount of information, enough to navigate the world, recognize people and things, and remember what you saw. It’s one of the great miracles of cognition and so natural you’re not even conscious of doing it.
Principal-component analysis (PCA), as this process is known, is one of the key tools in the scientist’s toolkit. You could say PCA is to unsupervised learning what linear regression is to the supervised variety. The famous hockey-stick curve of global warming, for example, is the result of finding the principal component of various temperature-related data series (tree rings, ice cores, etc.) and assuming it’s the temperature. Biologists use PCA to summarize the expression levels of thousands of different genes into a few pathways. Psychologists have found that personality boils down to five dimensions—extroversion, agreeableness, conscientiousness, neuroticism, and openness to experience—which they can infer from your tweets and blog posts.
Parents don’t teach their children to crawl, walk, or run; they figure it out on their own. But none of the learning algorithms we’ve seen so far can do this. Humans do have one constant guide: their emotions. We seek pleasure and avoid pain. When you touch a hot stove, you instinctively recoil. That’s the easy part. The hard part is learning not to touch the stove in the first place. That requires moving to avoid a sharp pain that you have not yet felt. Your brain does this by associating the pain not just with the moment you touch the stove, but with the actions leading up to it. Edward Thorndike called this the law of effect: actions that lead to pleasure are more likely to be repeated in the future; actions that lead to pain, less so. Pleasure travels back through time, so to speak, and actions can eventually become associated with effects that are quite remote from them. Humans can do this kind of long-range reward seeking better than any other animal, and it’s crucial to our success.
There’s a whole subfield of machine learning dedicated to algorithms that explore on their own, flail, hit on rewards, and figure out how to get them again in the future, much like babies crawling around and putting things in their mouths. It’s called reinforcement learning, and your first housebot will probably use it a lot.
The notion that not all states have rewards (positive or negative) but every state has a value is central to reinforcement learning. In board games, only final positions have a reward (1, 0, or −1 for a win, tie, or loss, say). Other positions give no immediate reward, but they have value in that they can lead to rewards later. A chess position from which you can force checkmate in some number of moves is practically as good as a win and therefore has high value. We can propagate this kind of reasoning all the way to good and bad opening moves, even if at that distance the connection is far from obvious. In video games, the rewards are usually points, and the value of a state is the number of points you can accumulate starting from that state. In real life, a reward now is better than a reward later, so future rewards can be discounted by some rate of return, like investments. Of course, the rewards depend on what actions you choose, and the goal of reinforcement learning is to always choose the action that leads to the greatest rewards.
Somehow, you have to abstract over all the infinite paths the future could take and make a decision now. Reinforcement learning does that by estimating the value of each state—the sum total of the rewards you can expect to get starting from that state—and choosing the actions that maximize it. In effect, reinforcement learning is a kind of speeded-up evolution—trying, discarding, and refining actions within a single lifetime instead of over generations—and by that standard it’s extremely efficient.
Reinforcement learning has also influenced psychology and neuroscience. The brain does it, using the neurotransmitter dopamine to propagate differences between expected and actual rewards. Reinforcement learning explains Pavlovian conditioning, but unlike behaviorism, it allows animals to have internal mental states. Foraging bees use it, as do mice finding cheese in mazes. Your daily life is a stream of little-noticed miracles made possible in part by reinforcement learning. You get up, get dressed, eat breakfast, and drive to work, all the while thinking about something else. Below the surface, reinforcement learning continually orchestrates and fine-tunes this prodigious symphony of motion. Snippets of reinforcement learning, also known as habits, make up most of what you do.
We perceive and remember things in chunks, and we can only hold so many chunks in short-term memory at any given time (seven plus or minus two, according to the classic paper by George Miller). Crucially, grouping things into chunks allows us to process much more information than we otherwise could. That’s why telephone numbers have hyphens: 1–723–458–3897 is much easier to remember than 17234583897. Herbert Simon, Newell’s longtime collaborator and AI cofounder, had earlier found that the main difference between novice and expert chess players is that novices perceive chess positions one piece at a time while experts see larger patterns involving multiple pieces. Getting better at chess mainly involves acquiring more and larger such chunks. Newell and Rosenbloom hypothesized that a similar process is at work in all skill acquisition, not just chess. Chunking remains a preeminent example of a learning algorithm inspired by psychology, and the true Master Algorithm, whatever it turns out to be, must surely share its ability to improve with practice.
Chunking and reinforcement learning are not as widely used in business as supervised learning, clustering, or dimensionality reduction, but a simpler type of learning by interacting with the environment is: learning the effects of your actions (and acting accordingly). If the background color of your e-commerce site’s home page is currently blue and you’re wondering whether making it red would increase sales, try it out on a hundred thousand randomly chosen customers and compare the results with those of the regular site. This technique, called A/B testing, was at first used mainly in drug trials but has since spread to many fields where data can be gathered on demand, from marketing to foreign aid. It can also be generalized to try many combinations of changes at once, without losing track of which changes lead to which gains (or losses). Companies like Amazon and Google swear by it; you’ve probably participated in thousands of A/B tests without realizing it. A/B testing gives the lie to the oft-heard criticism that big data is only good for finding correlations, not causation.
The best way to understand an entity—whether it’s a person, an animal, a web page, or a molecule—is to understand how it relates to other entities. This requires a new kind of learning that doesn’t treat the data as a random sample of unrelated objects but as a glimpse into a complex network. Nodes in the network interact; what you do to one affects the others and comes back to affect you. Relational learners, as they’re called, may not quite have social intelligence, but they’re the next best thing. In traditional statistical learning, every man is an island, entire of itself. In relational learning, every man is a piece of the continent, a part of the main.
The neatest trick a relational learner can do is to turn a sporadic teacher into an assiduous one. For an ordinary classifier, examples without classes are useless. If I’m given a patient’s symptoms, but not the diagnosis, that doesn’t help me learn to diagnose. But if I know that some of the patient’s friends have the flu, that’s indirect evidence that he may have the flu as well. Diagnosing a few people in a network and then propagating those diagnoses to their friends, and their friends’ friends, is the next best thing to diagnosing everyone. The inferred diagnoses may be noisy, but the overall statistics of how symptoms correlate with the flu will probably be a lot more accurate and complete than if I had only a handful of isolated diagnoses to draw on. Children are very good at making the most of the sporadic supervision they get (provided they don’t choose to ignore it). Relational learners share some of that ability.
The Pieces of the Puzzle Fall into Place
Machine learning is both a science and a technology, and both characteristics give us hints on how to unify it. On the science side, unifying theories often begin with a deceptively simple observation. Two seemingly unrelated phenomena turn out to be just two faces of the same coin, and like the first domino to fall, that realization sets off a cascade of others. An apple falling to the ground, the moon hanging in the sky: both are caused by gravity, and—apocryphal story or not—once Newton figured out how, gravity turned out to also account for the tides, the precession of the equinoxes, the trajectories of comets, and much else. In everyday experience, electricity and magnetism are never seen together: a lightning spark here, a rock that attracts iron objects there, both quite rare. But once Maxwell figured out how a changing electric field gives rise to magnetism and vice versa, it became clear that light itself is an intimate marriage of the two, and today we know that, far from rare, electromagnetism pervades all matter.
In each of these cases, a bewildering variety of observations turned out to have a common cause, and once scientists identified it, they could in turn use it to predict many new phenomena. Similarly, even though the learners we’ve met in this book seem quite disparate—some based on the brain, some on evolution, some on abstract mathematical principles—they in fact have much in common, and the resulting theory of learning yields many new insights.
Although it is less well known, many of the most important technologies in the world are the result of inventing a unifier, a single mechanism that does what previously required many. The Internet, as the name implies, is a network that interconnects networks. Without it, every type of network would need a different protocol to talk to every other, much like we need a different dictionary for every pair of languages in the world. The Internet’s protocols are an Esperanto that gives each computer the illusion of talking directly to any other and that allows e-mail and the web to ignore the details of the physical infrastructure they flow over.
The Master Algorithm is the unifier of machine learning: it lets any application use any learner, by abstracting the learners into a common form that is all the applications need to know. Our first step toward the Master Algorithm will be surprisingly simple. As it turns out, it’s not hard to combine many different learners into one, using what is known as metalearning.
Think of each learner as an expert on a committee. Each looks carefully at the instance to be classified—what is the diagnosis for this patient?—and confidently makes its prediction. You’re not an expert yourself, but you’re the chair of the committee, and your job is to combine their recommendations into a final decision. What you have on your hands is in fact a new classification problem, where instead of the patient’s symptoms, the input is the experts’ opinions. But you can apply machine learning to this problem in the same way the experts applied it to the original one. We call this metalearning because it’s learning about the learners. The metalearner can itself be any learner, from a decision tree to a simple weighted vote. To learn the weights, or the decision tree, we replace the attributes of each original example by the learners’ predictions. Learners that often predict the correct class will get high weights, and inaccurate ones will tend to be ignored. This type of metalearning is called stacking.
An even simpler metalearner is bagging. Bagging generates random variations of the training set by resampling, applies the same learner to each one, and combines the results by voting. The reason to do this is that it reduces variance: the combined model is much less sensitive to the vagaries of the data than any single one, making this a remarkably easy way to improve accuracy.
One of the cleverest metalearners is boosting. Instead of combining different learners, boosting repeatedly applies the same classifier to the data, using each new model to correct the previous ones’ mistakes. It does this by assigning weights to the training examples; the weight of each misclassified example is increased after each round of learning, causing later rounds to focus more on it. The name boosting comes from the notion that this process can boost a classifier that’s only slightly better than random guessing, but consistently so, into one that’s almost perfect.
P = ew • n / Z
P is a probability, w is a vector of weights (notice it’s in boldface), n is a vector of numbers, and their dot product • is exponentiated and divided by Z, the sum of all products.
The equation gives the probability of an image (or whatever) according to a Markov network. But it’s more general than that because it’s not just the equation of a Markov network; rather, it’s the equation of a Markov logic network, as we call it. In a Markov logic network, or MLN for short, the numbers in n don’t have to be just zero or one, and they don’t refer to features—they refer to logical formulas.
On the probabilistic side, an MLN reduces to a Markov network when all the formulas talk about a single object. So Markov logic includes both logic and Markov networks as special cases, and it’s the unification we were looking for.
The unified learner we’ve arrived at uses MLNs as the representation, posterior probability as the evaluation function, and genetic search coupled with gradient descent as the optimizer. If we want, we can easily replace the posterior by some other accuracy measure, or genetic search by hill climbing. We’ve ascended a high peak, and now we can enjoy the view. I wouldn’t be so rash as to call this learner the Master Algorithm, however. For one, the proof of the pudding is in the eating, and although over the last decade this algorithm (or variations of it) has been successfully applied in many areas, there are many more to which it hasn’t, and so it’s not yet clear just how general purpose it is. Second, there are some important problems that it doesn’t solve. We christened it Alchemy to remind ourselves that, despite all its successes, machine learning is still in the alchemy stage of science.
Alchemy addresses Hume’s original question by having another input besides the data: your initial knowledge, in the form of a set of logical formulas, with or without weights. The formulas can be inconsistent, incomplete, or even just plain wrong; the learning and probabilistic reasoning will take care of that. The key point is that Alchemy doesn’t have to learn from scratch.
Alchemy is like an inductive Turing machine, which we can program to behave as a very powerful or a very restricted learner; it’s up to us. Alchemy provides a unifier for machine learning in the same way that the Internet provides one for computer networks, the relational model for databases, or the graphical user interface for everyday applications.
Alchemy addresses the problems that each of the five tribes of machine learning has worked on for so long.
Despite its successes, Alchemy has some significant shortcomings. It does not yet scale to truly big data, and someone without a PhD in machine learning will find it hard to use. Because of these problems, it’s not yet ready for prime time. But let’s see what we can do about them.
The world is not a random jumble of interactions; it has a hierarchical structure: galaxies, planets, continents, countries, cities, neighborhoods, your house, you, your head, your nose, a cell on its tip, the organelles in it, molecules, atoms, subatomic particles. The way to model it, then, is with an MLN that also has a hierarchical structure. This is an example of the assumption that the learner and its environment are alike. The MLN doesn’t have to know a priori which parts the world is composed of; all Alchemy has to do is assume that the world has parts and look for them, rather like a newly made bookshelf assumes that there are books but doesn’t yet know which ones will be placed on it. Hierarchical structure helps make inference tractable because subparts of the world interact mostly with other subparts of the same part: neighbors talk more to each other than to people in another country, molecules produced in one cell react mostly with other molecules in that cell, and so on.
Another property of the world that makes learning and inference easier is that the entities in it don’t come in arbitrary forms. Rather, they fall into classes and subclasses, with members of the same class being more alike than members of different ones. Alive or inanimate, animal or plant, bird or mammal, human or not: if we know all the distinctions relevant to the question at hand, we can lump together all the entities that lack them and that can save a lot of time. As before, the MLN doesn’t have to know a priori what the classes in the world are; it can learn them from data by hierarchical clustering.
Step-by-step, we can learn larger and larger MLNs, until we’re doing what a friend of mine at Google calls “planetary-scale machine learning”: modeling everyone in the world at once, with data continually streaming in and answers streaming out.
This Is the World on Machine Learning
Your digital future begins with a realization: every time you interact with a computer—whether it’s your smart phone or a server thousands of miles away—you do so on two levels. The first one is getting what you want there and then: an answer to a question, a product you want to buy, a new credit card. The second level, and in the long run the most important one, is teaching the computer about you. The more you teach it, the better it can serve you—or manipulate you. Life is a game between you and the learners that surround you. You can refuse to play, but then you’ll have to live a twentieth-century life in the twenty-first. Or you can play to win. What model of you do you want the computer to have? And what data can you give it that will produce that model? Those two questions should always be in the back of your mind whenever you interact with a learning algorithm—as they are when you interact with other people.
If a system keeps recommending the wrong things to you, try teaching it by finding and clicking on a bunch of the right ones and come back later to see if it did.
Suppose you took all your data and gave it to the—real, future—Master Algorithm, already seeded with everything we could teach it about human life. It would learn a model of you, and you could carry that model in a thumb drive in your pocket, inspect it at will, and use it for everything you pleased. It would surely be a wonderful tool for introspection, like looking at yourself in the mirror, but it would be a digital mirror that showed not just your looks but all things observable about you—a mirror that could come alive and converse with you. What would you ask it? Some of the answers you might not like, but that would be all the more reason to ponder them. And some would give you new ideas, new directions. The Master Algorithm’s model of you might even help you become a better person.
Self-improvement aside, probably the first thing you’d want your model to do is negotiate the world on your behalf: let it loose in cyberspace, looking for all sorts of things for you. From all the world’s books, it would suggest a dozen you might want to read next, with more insight than Amazon could dream of. Likewise for movies, music, games, clothes, electronics—you name it. It would keep your refrigerator stocked at all times, natch. It would filter your e-mail, voice mail, Facebook news, and Twitter feed and, when appropriate, reply on your behalf. It would take care of all the little annoyances of modern life for you, like checking credit-card bills, disputing improper charges, making arrangements, renewing subscriptions, and filling out tax returns. It would find a remedy for your ailment, run it by your doctor, and order it from Walgreens. It would bring interesting job opportunities to your attention, propose vacation spots, suggest which candidates to vote for on the ballot, and screen potential dates. And, after the match was made, it would team up with your date’s model to pick some restaurants you might both like. Which is where things start to get really interesting.
In this rapidly approaching future, you’re not going to be the only one with a “digital half” doing your bidding twenty-four hours a day. Everyone will have a detailed model of him- or herself, and these models will talk to each other all the time. If you’re looking for a job and company X is looking to hire, its model will interview your model. It will be a lot like a real, flesh-and-blood interview—your model will still be well advised to not volunteer negative information about you, and so on—but it will take only a fraction of a second. You’ll click on “Find Job” in your future LinkedIn account, and you’ll immediately interview for every job in the universe that remotely fits your parameters (profession, location, pay, etc.). LinkedIn will respond on the spot with a ranked list of the best prospects, and out of those, you’ll pick the first company that you want to have a chat with. Same with dating: your model will go on millions of dates so you don’t have to, and come Saturday, you’ll meet your top prospects at an OkCupid-organized party, knowing that you’re also one of their top prospects—and knowing, of course, that their other top prospects are also in the room. It’s sure to be an interesting night. In the world of the Master Algorithm, “my people will call your people” becomes “my program will call your program.” Everyone has an entourage of bots, smoothing his or her way through the world. Deals get pitched, terms negotiated, arrangements made, all before you lift a finger.
The process doesn’t end when you find a car, a house, a doctor, a date, or a job. Your digital half is continually learning from its experiences, just as you would. It figures out what works and doesn’t, whether it’s in job interviews, dating, or real-estate hunting. It learns about the people and organizations it interacts with on your behalf and then (even more important) from your real-world interactions with them. It predicted Alice would be a great date for you, but you had an awkward time, so it hypothesizes possible reasons, which it will test on your next round of dating. It shares its most important findings with you. (“You believe you like X, but in reality you tend to go for Y.”)
Your digital half has a model of the world: not just of the world in general but of the world as it relates to you. At the same time, of course, everyone else also has a continually evolving model of his or her world. Every party to an interaction learns from it and applies what it’s learned to its next interactions. You have your model of every person and organization you interact with, and they each have their model of you. As the models improve, their interactions become more and more like the ones you would have in the real world—except millions of times faster and in silicon. Tomorrow’s cyberspace will be a vast parallel world that selects only the most promising things to try out in the real one. It will be like a new, global subconscious, the collective id of the human race.
But who should you share data with? That’s perhaps the most important question of the twenty-first century. No one has anything even approaching a complete picture of you. That’s both good and bad. Good because if someone did, they’d have far too much power. Bad because as long as that’s the case there can be no 360-degree model of you. What you really want is a digital you that you’re the sole owner of and that others can access only on your terms.
All sorts of questions about society and policy can potentially be answered by learning from the data we generate in our daily lives. Social science is entering a golden age, where it finally has data commensurate with the complexity of the phenomena it studies, and the benefits to all of us could be enormous—provided the data is accessible to researchers, policy makers, and citizens. This does not mean letting others peek into your private life; it means letting them see the learned models, which should contain only statistical information. So between you and them there needs to be an honest data broker that guarantees your data won’t be misused, but also that no free riders share the benefits without sharing the data. In
These problems all have a common solution: a new type of company that is to your data what your bank is to your money. Banks don’t steal your money (with rare exceptions). They’re supposed to invest it wisely, and your deposits are FDIC-insured. Many companies today offer to consolidate your data somewhere in the cloud, but they’re still a far cry from your personal data bank. If they’re cloud providers, they try to lock you in—a big no-no. (Imagine depositing your money with Bank of America and not knowing if you’ll be able to transfer it to Wells Fargo somewhere down the line.) Some startups offer to hoard your data and then mete it out to advertisers in return for discounts, but to me that misses the point. Sometimes you want to give information to advertisers for free because it’s in your interests, sometimes you don’t want to give it at all, and what to share when is a problem that only a good model of you can solve.
The kind of company I’m envisaging would do several things in return for a subscription fee. It would anonymize your online interactions, routing them through its servers and aggregating them with its other users’. It would store all the data from all your life in one place—down to your 24/7 Google Glass video stream, if you ever get one. It would learn a complete model of you and your world and continually update it. And it would use the model on your behalf, always doing exactly what you would, to the best of the model’s ability. The company’s basic commitment to you is that your data and your model will never be used against your interests. Such a guarantee can never be foolproof—you yourself are not guaranteed to never do anything against your interests, after all. But the company’s life would depend on it as much as a bank’s depends on the guarantee that it won’t lose your money, so you should be able to trust it as much as you trust your bank.
One possible showstopper is that the government may subpoena your data or even preventively jail you, Minority Report–style, if your model looks like a criminal’s. To forestall that, your data company can keep everything encrypted, with the key in your possession. (These days you can even compute over encrypted data without ever decrypting it.) Or you can keep it all in your hard disk at home, and the company just rents you the software. If you don’t like the idea of a profit-making entity holding the keys to your kingdom, you can join a data union instead. A data union lets its members bargain on equal terms with companies about the use of their data.
Companies that host the digital you and data unions are what a mature future of data in society looks like to me. Whether we’ll get there is an open question. Today, most people are unaware of both how much data about them is being gathered and what the potential costs and benefits are. Companies seem content to continue doing it under the radar, terrified of a blowup. But sooner or later a blowup will happen, and in the ensuing fracas, draconian laws will be passed that in the end will serve no one. Better to foster awareness now and let everyone make their individual choices about what to share, what not, and how and where.
The best way to not lose your job is to automate it yourself. Then you’ll have time for all the parts of it that you didn’t before and that a computer won’t be able to do any time soon. (If there aren’t any, stay ahead of the curve and get a new job now.) If a computer has learned to do your job, don’t try to compete with it; harness it. Think of big data as an extension of your senses and learning algorithms as an extension of your brain. It’s not man versus machine; it’s man with machine versus man without. Data and intuition are like horse and rider, and you don’t try to outrun a horse; you ride it.
As the boundary between automatable and nonautomatable jobs advances across the economic landscape, what we’ll likely see is unemployment creeping up, downward pressure on the wages of more and more professions, and increasing rewards for the fewer and fewer that can’t yet be automated. When the unemployment rate rises above 50 percent, or even before, attitudes about redistribution will radically change. The newly unemployed majority will vote for generous lifetime unemployment benefits and the sky-high taxes needed to fund them. These won’t break the bank because machines will do the necessary production. Liberals and conservatives will still fight about the tax rate, but the goalposts will have permanently moved. With the total value of labor greatly reduced, the wealthiest nations will be those with the highest ratio of natural resources to population. (Move to Canada now.) For those of us not working, life will not be meaningless, any more than life on a tropical island where nature’s bounty meets all needs is meaningless. A gift economy will develop, of which the open-source software movement is a preview. People will seek meaning in human relationships, self-actualization, and spirituality, much as they do now. The need to earn a living will be a distant memory, another piece of humanity’s barbaric past that we rose above.
It’s natural to worry about intelligent machines taking over because the only intelligent entities we know are humans and other animals, and they definitely have a will of their own. But there is no necessary connection between intelligence and autonomous will; or rather, intelligence and will may not inhabit the same body, provided there is a line of control between them. In The Extended Phenotype, Richard Dawkins shows how nature is replete with examples of an animal’s genes controlling more than its own body, from cuckoo eggs to beaver dams. Technology is the extended phenotype of man. This means we can continue to control even if it becomes far more complex than we can understand.
The next step in the insidious progression of AI control is letting them make all the decisions because they’re, well, so much smarter. Beware. They may be smarter, but they’re in the service of whoever designed their score functions. This is the “Wizard of Oz” problem. Your job in a world of intelligent machines is to keep making sure they do what you want, both at the input (setting the goals) and at the output (checking that you got what you asked for). If you don’t, somebody else will. Machines can help us figure out collectively what we want, but if you don’t participate, you lose out—just like democracy, only more so. Contrary to what we like to believe today, humans quite easily fall into obeying others, and any sufficiently advanced AI is indistinguishable from God. People won’t necessarily mind taking their marching orders from some vast oracular computer; the question is who oversees the overseer. Is AI the road to a more perfect democracy or to a more insidious dictatorship? The eternal vigil has just begun.
The trajectory we’re on is not a singularity but a phase transition. Its critical point—the Turing point—will come when machine learning overtakes the natural variety. Natural learning itself has gone through three phases: evolution, the brain, and culture. Each is a product of the previous one, and each learns faster. Machine learning is the logical next stage of this progression. Computer programs are the fastest replicators on Earth: copying them takes only a fraction of a second. But creating them is slow, if it has to be done by humans. Machine learning removes that bottleneck, leaving a final one: the speed at which humans can absorb change. This too will eventually be removed, but not because we’ll decide to hand things off to our “mind children,” as Hans Moravec calls them, and go gently into the good night. Humans are not a dying twig on the tree of life. On the contrary, we’re about to start branching. In the same way that culture coevolved with larger brains, we will coevolve with our creations. We always have: humans would be physically different if we had not invented fire or spears. We are Homo technicus as much as Homo sapiens.