Algorithms to Live By: The Computer Science of Human Decisions

SEARCHING: THE SECRETARY PROBLEM

In any optimal stopping problem, the crucial dilemma is not which option to pick, but how many options to even consider.

In your search for a secretary, there are two ways you can fail: stopping early and stopping late. When you stop too early, you leave the best applicant undiscovered. When you stop too late, you hold out for a better applicant who doesn't exist. The optimal strategy will clearly require finding the right balance between the two, walking the tightrope between looking too much and not enough.

The optimal solution takes the form of what we'll call the Look-Then-Leap Rule: You set a predetermined amount of time for "looking"—that is, exploring your options, gathering data—in which you categorically don't choose anyone, no matter how impressive. After that point, you enter the "leap" phase, prepared to instantly commit to anyone who outshines the best applicant you saw in the look phase.

The 37% Rule: look at the first 37% of the applicants, choosing none, then be ready to leap for anyone better than all those you've seen so far. [Just a hair under 37%, actually. To be precise, the mathematically optimal proportion of applicants to look at is 1/e]

As it turns out, following this optimal strategy ultimately gives us a 37% chance of hiring the best applicant; it's one of the problem's curious mathematical symmetries that the strategy itself and its chance of success work out to the very same number.

There's a certain flexibility in the 37% Rule: it can be applied to either the number of applicants or the time over which one is searching.

If you can recall previous applicants, the optimal algorithm puts a twist on the familiar Look-Then-Leap Rule: a longer noncommittal period, and a fallback plan. For example, assume an immediate proposal is a sure thing but belated proposals are rejected half the time. Then the math says you should keep looking noncommittally until you've seen 61% of applicants, and then only leap if someone in the remaining 39% of the pool proves to be the best yet. If you're still single after considering all the possibilities—as Kepler was—then go back to the best one that got away. The symmetry between strategy and outcome holds in this case once again, with your chances of ending up with the best applicant under this second-chances-allowed scenario also being 61%.

Full information means that we don't need to look before we leap. We can instead use the Threshold Rule, where we immediately accept an applicant if she is above a certain percentile. We don't need to look at an initial group of candidates to set this threshold—but we do, however, need to be keenly aware of how much looking remains available.

The full-information game thus offers an unexpected and somewhat bizarre takeaway. Gold digging is more likely to succeed than a quest for love. If you're evaluating your partners based on any kind of objective criterion—say, their income percentile—then you've got a lot more information at your disposal than if you're after a nebulous emotional response ("love") that might require both experience and comparison to calibrate. Of course, there's no reason that net worth—or, for that matter, typing speed—needs to be the thing that you're measuring. Any yardstick that provides full information on where an applicant stands relative to the population at large will change the solution from the Look-Then-Leap Rule to the Threshold Rule and will dramatically boost your chances of finding the single best applicant in the group.

The "endogenous" time costs of searching, which aren't usually captured by optimal stopping models, might thus provide an explanation for why human decision-making routinely diverges from the prescriptions of those models. As optimal stopping researcher Neil Bearden puts it, "After searching for a while, we humans just tend to get bored. It's not irrational to get bored, but it's hard to model that rigorously." But this doesn't make optimal stopping problems less important; it actually makes them more important, because the flow of time turns all decision-making into optimal stopping.

Viewed this way, the secretary problem's most fundamental yet most unbelievable assumption—its strict seriality, its inexorable one-way march—is revealed to be the nature of time itself. As such, the explicit premise of the optimal stopping problem is the implicit premise of what it is to be alive. It's this that forces us to decide based on possibilities we've not yet seen, this that forces us to embrace high rates of failure even when acting optimally. No choice recurs. We may get similar choices again, but never that exact one. Hesitation—inaction—is just as irrevocable as action. What the motorist, locked on the one-way road, is to space, we are to the fourth dimension: we truly pass this way but once.

EXPLORE/EXPLOIT: THE LATEST VS. THE GREATEST

Simply put, exploration is gathering information, and exploitation is using the information you have to get a known good result.

When balancing favorite experiences and new ones, nothing matters as much as the interval over which we plan to enjoy them.

A sobering property of trying new things is that the value of exploration, of finding a new favorite, can only go down over time, as the remaining opportunities to savor it dwindle. Discovering an enchanting café on your last night in town doesn't give you the opportunity to return. The flip side is that the value of exploitation can only go up over time. The loveliest café that you know about today is, by definition, at least as lovely as the loveliest café you knew about last month. (And if you've found another favorite since then, it might just be more so.) So explore when you will have time to use the resulting knowledge, exploit when you're ready to cash in. The interval makes the strategy.

proposed a solution called the Win-Stay, Lose-Shift algorithm: choose an arm at random, and keep pulling it as long as it keeps paying off. If the arm doesn't pay off after a particular pull, then switch to the other one. Although this simple strategy is far from a complete solution, Robbins proved in 1952 that it performs reliably better than chance.

The Gittins index, then, provides a formal, rigorous justification for preferring the unknown, provided we have some opportunity to exploit the results of what we learn from exploring. The old adage tells us that "the grass is always greener on the other side of the fence," but the math tells us why: the unknown has a chance of being better, even if we actually expect it to be no different, or if it's just as likely to be worse. The untested rookie is worth more (early in the season, anyway) than the veteran of seemingly equal ability, precisely because we know less about him. Exploration in itself has value, since trying new things increases our chances of finding the best. So taking the future into account, rather than focusing just on the present, drives us toward novelty.

Regret and Optimism

If the Gittins index is too complicated, or if you're not in a situation well characterized by geometric discounting, then you have another option: focus on regret. When we choose what to eat, who to spend time with, or what city to live in, regret looms large—presented with a set of good options, it is easy to torture ourselves with the consequences of making the wrong choice. These regrets are often about the things we failed to do, the options we never tried.

Researchers in recent decades have set about looking for algorithms that offer the guarantee of minimal regret. Of the ones they've discovered, the most popular are known as Upper Confidence Bound algorithms.

Visual displays of statistics often include so-called error bars that extend above and below any data point, indicating uncertainty in the measurement; the error bars show the range of plausible values that the quantity being measured could actually have. This range is known as the "confidence interval," and as we gain more data about something the confidence interval will shrink, reflecting an increasingly accurate assessment. (For instance, a slot machine that has paid out once out of two pulls will have a wider confidence interval, though the same expected value, as a machine that has paid out five times on ten pulls.) In a multi-armed bandit problem, an Upper Confidence Bound algorithm says, quite simply, to pick the option for which the top of the confidence interval is highest.

Upper Confidence Bound algorithms implement a principle that has been dubbed "optimism in the face of uncertainty." Optimism, they show, can be perfectly rational. By focusing on the best that an option could be, given the evidence obtained so far, these algorithms give a boost to possibilities we know less about. As a consequence, they naturally inject a dose of exploration into the decision-making process, leaping at new options with enthusiasm because any one of them could be the next big thing.

The success of Upper Confidence Bound algorithms offers a formal justification for the benefit of the doubt. Following the advice of these algorithms, you should be excited to meet new people and try new things—to assume the best about them, in the absence of evidence to the contrary. In the long run, optimism is the best prevention for regret.

In general, it seems that people tend to over-explore—to favor the new disproportionately over the best.

The standard multi-armed bandit problem assumes that the probabilities with which the arms pay off are fixed over time. But that's not necessarily true of airlines, restaurants, or other contexts in which people have to make repeated choices. If the probabilities of a payoff on the different arms change over time—what has been termed a "restless bandit"—the problem becomes much harder.

To live in a restless world requires a certain restlessness in oneself. So long as things continue to change, you must never fully cease exploring. More generally, our intuitions about rationality are too often informed by exploitation rather than exploration. When we talk about decision-making, we usually focus just on the immediate payoff of a single decision—and if you treat every decision as if it were your last, then indeed only exploitation makes sense. But over a lifetime, you're going to make a lot of decisions. And it's actually rational to emphasize exploration—the new rather than the best, the exciting rather than the safe, the random rather than the considered—for many of those choices, particularly earlier in life. What we take to be the caprice of children may be wiser than we know.

Perhaps the deepest insight that comes from thinking about later life as a chance to exploit knowledge acquired over decades is this: life should get better over time. What an explorer trades off for knowledge is pleasure. The Gittins index and the Upper Confidence Bound, as we've seen, inflate the appeal of lesser-known options beyond what we actually expect, since pleasant surprises can pay off many times over. But at the same time, this means that exploration necessarily leads to being let down on most occasions. Shifting the bulk of one's attention to one's favorite things should increase quality of life.

SORTING MAKING ORDER

Sorting involves steep diseconomies of scale, violating our normal intuitions about the virtues of doing things in bulk. Cooking for two is typically no harder than cooking for one, and it's certainly easier than cooking for one person twice. But sorting, say, a shelf of a hundred books will take you longer than sorting two bookshelves of fifty apiece: you have twice as many things to organize, and there are twice as many places each of them could go. The more you take on, the worse it gets. This is the first and most fundamental insight of sorting theory. Scale hurts. Mergesort Bucket Sort

Sort Is Prophylaxis for Search

Computer science, as undergraduates are taught, is all about tradeoffs. And one of the most central tradeoffs is between sorting and searching. The basic principle is this: the effort expended on sorting materials is just a preemptive strike against the effort it'll take to search through them later. What the precise balance should be depends on the exact parameters of the situation, but thinking about sorting as valuable only to support future search tells us something surprising: Err on the side of messiness. Sorting something that you will never search is a complete waste; searching something you never sorted is merely inefficient. The question, of course, becomes how to estimate ahead of time what your future usage will be.

Computer science shows that the hazards of mess and the hazards of order are quantifiable and that their costs can be measured in the same currency: time. Leaving something unsorted might be thought of as an act of procrastination—passing the buck to one's future self, who'll have to pay off with interest what we chose not to pay up front. But the whole story is subtler than that. Sometimes mess is more than just the easy choice. It's the optimal choice.

CACHING: FORGET ABOUT IT

The idea of keeping around pieces of information that you refer to frequently is so powerful that it is used in every aspect of computation. Processors have caches. Hard drives have caches. Operating systems have caches. Web browsers have caches.

Least Recently Used (LRU): evicting the item that's gone the longest untouched (Stewart's "When was the last time I wore it or used it?").

Caching on the Home Front

First, when you are deciding what to keep and what to throw away, LRU is potentially a good principle to use—much better than FIFO. You shouldn't necessarily toss that T-shirt from college if you still wear it every now and then. But the plaid pants you haven't worn in ages? Those can be somebody else's thrift-store bonanza. Second, exploit geography. Make sure things are in whatever cache is closest to the place where they're typically used. This isn't a concrete recommendation in most home-organization books, but it consistently turns up in the schemes that actual people describe as working well for them. "I keep running and exercise gear in a crate on the floor of my front coat closet," says one person quoted in Julie Morgenstern's Organizing from the Inside Out, for instance. "I like having it close to the front door."

A final insight, which hasn't yet made it into guides on closet organization, is that of the multi-level memory hierarchy. Having a cache is efficient, but having multiple levels of caches—from smallest and fastest to largest and slowest—can be even better. Where your belongings are concerned, your closet is one cache level, your basement another, and a self-storage locker a third. (These are in decreasing order of access speed, of course, so you should use the LRU principle as the basis for deciding what gets evicted from each level to the next.) But you might also be able to speed things up by adding yet another level of caching: an even smaller, faster, closer one than your closet.

Recent work by a team of psychologists and linguists led by Michael Ramscar at the University of Tübingen has suggested that what we call "cognitive decline"—lags and retrieval errors—may not be about the search process slowing or deteriorating, but (at least partly) an unavoidable consequence of the amount of information we have to navigate getting bigger and bigger. Regardless of whatever other challenges aging brings, older brains—which must manage a greater store of memories—are literally solving harder computational problems with every passing day.

No matter how good your organization scheme is, having to search through more things will inevitably take longer. It's not that we're forgetting; it's that we're remembering. We're becoming archives.

SCHEDULING: FIRST THINGS FIRST

Thus we encounter the first lesson in single-machine scheduling literally before we even begin: make your goals explicit. We can't declare some schedule a winner until we know how we're keeping score. This is something of a theme in computer science: before you can have a plan, you must first choose a metric.

If you're concerned with minimizing maximum lateness, then the best strategy is to start with the task due soonest and work your way toward the task due last. This strategy, known as Earliest Due Date, is fairly intuitive.

Minimizing the sum of completion times leads to a very simple optimal algorithm called Shortest Processing Time: always do the quickest task you can. Even if you don't have impatient clients hanging on every job, Shortest Processing Time gets things done. (Perhaps it's no surprise that it is compatible with the recommendation in Getting Things Done to immediately perform any task that takes less than two minutes.) Again, there's no way to change the total amount of time your work will take you, but Shortest Processing Time may ease your mind by shrinking the number of outstanding tasks as quickly as possible. Its sum-of-completion-times metric can be expressed another way: it's like focusing above all on reducing the length of your to-do list. If each piece of unfinished business is like a thorn in your side, then racing through the easiest items may bring some measure of relief.

A commitment to fastidiously doing the most important thing you can, if pursued in a head-down, myopic fashion, can lead to what looks for all the world like procrastination. As with a car spinning its tires, the very desire to make immediate progress is how one gets stuck. "Things which matter most must never be at the mercy of things which matter least," Goethe allegedly proclaimed; but while that has the ring of wisdom about it, sometimes it's just not true. Sometimes that which matters most cannot be done until that which matters least is finished, so there's no choice but to treat that unimportant thing as being every bit as important as whatever it's blocking.

Preemption and Uncertainty

Minimizing maximum lateness (for serving customers in a coffee shop) or the sum of completion times (for rapidly shortening your to-do list) both cross the line into intractability if some tasks can't be started until a particular time. But they return to having efficient solutions once preemption is allowed. In both cases, the classic strategies—Earliest Due Date and Shortest Processing Time, respectively—remain the best, with a fairly straightforward modification. When a task's starting time comes, compare that task to the one currently under way. If you're working by Earliest Due Date and the new task is due even sooner than the current one, switch gears; otherwise stay the course. Likewise, if you're working by Shortest Processing Time, and the new task can be finished faster than the current one, pause to take care of it first; otherwise, continue with what you were doing.

The weighted version of Shortest Processing Time is a pretty good candidate for best general-purpose scheduling strategy in the face of uncertainty. It offers a simple prescription for time management: each time a new piece of work comes in, divide its importance by the amount of time it will take to complete. If that figure is higher than for the task you're currently doing, switch to the new one; otherwise stick with the current task. This algorithm is the closest thing that scheduling theory has to a skeleton key or Swiss Army knife, the optimal strategy not just for one flavor of problem but for many.

Preemption Isn't Free: The Context Switch

Personally, we have found that both programming and writing require keeping in mind the state of the entire system, and thus carry inordinately large context-switching costs. A friend of ours who writes software says that the normal workweek isn't well suited to his workflow, since for him sixteen-hour days are more than twice as productive as eight-hour days. Brian, for his part, thinks of writing as a kind of blacksmithing, where it takes a while just to heat up the metal before it's malleable. He finds it somewhat useless to block out anything less than ninety minutes for writing, as nothing much happens in the first half hour except loading a giant block of "Now, where was I?" into his head.

If you find yourself doing a lot of context switching because you're tackling a heterogeneous collection of short tasks, you can also employ another idea from computer science: "interrupt coalescing." If you have five credit card bills, for instance, don't pay them as they arrive; take care of them all in one go when the fifth bill comes. As long as your bills are never due less than thirty-one days after they arrive, you can designate, say, the first of each month as "bill-paying day," and sit down at that point to process every bill on your desk, no matter whether it came three weeks or three hours ago. Likewise, if none of your email correspondents require you to respond in less than twenty-four hours, you can limit yourself to checking your messages once a day.

Perhaps the patron saint of the minimal-context-switching lifestyle is the legendary programmer Donald Knuth. "I do one thing at a time," he says. "This is what computer scientists call batch processing—the alternative is swapping in and out. I don't swap in and out." Knuth isn't kidding. On January 1, 2014, he embarked on "The TeX Tuneup of 2014," in which he fixed all of the bugs that had been reported in his TeX typesetting software over the previous six years. His report ends with the cheery sign-off "Stay tuned for The TeX Tuneup of 2021!" Likewise, Knuth has not had an email address since 1990. "Email is a wonderful thing for people whose role in life is to be on top of things. But not for me; my role is to be on the bottom of things. What I do takes long hours of studying and uninterruptible concentration."

BAYES'S RULE: PREDICTING THE FUTURE

For any possible drawing of w winning tickets in n attempts, the expectation is simply the number of wins plus one, divided by the number of attempts plus two: (w+1)⁄(n+2).

This incredibly simple scheme for estimating probabilities is known as Laplace's Law, and it is easy to apply in any situation where you need to assess the chances of an event based on its history. If you make ten attempts at something and five of them succeed, Laplace's Law estimates your overall chances to be 6/12 or 50%, consistent with our intuitions

Laplace's Law offers us the first simple rule of thumb for confronting small data in the real world. Even when we've made only a few observations—or only one—it offers us practical guidance. Want to calculate the chance your bus is late? The chance your softball team will win? Count the number of times it has happened in the past plus one, then divide by the number of opportunities plus two. And the beauty of Laplace's Law is that it works equally well whether we have a single data point or millions of them.

Bayes's Rule. And it gives a remarkably straightforward solution to the problem of how to combine preexisting beliefs with observed evidence: multiply their probabilities together. Notably, having some preexisting beliefs is crucial for this formula to work. If your friend simply approached you and said, "I flipped one coin from this bag and it came up heads. How likely do you think it is that this is a fair coin?," you would be totally unable to answer that question unless you had at least some sense of what coins were in the bag to begin with. (You can't multiply the two probabilities together when you don't have one of them.) This sense of what was "in the bag" before the coin flip—the chances for each hypothesis to have been true before you saw any data—is known as the prior probabilities, or "prior" for short. And Bayes's Rule always needs some prior from you, even if it's only a guess.

The Copernican Principle emerges: if we want to predict how long something will last, and have no other knowledge about it whatsoever, the best guess we can make is that it will continue just as long as it's gone on so far.

The richer the prior information we bring to Bayes's Rule, the more useful the predictions we can get out of it. Bayes's Rule tells us that when it comes to making predictions based on limited evidence, few things are as important as having good priors—that is, a sense of the distribution from which we expect that evidence to have come. Good predictions thus begin with having good instincts about when we're dealing with a normal distribution and when with a power-law distribution. As it turns out, Bayes's Rule offers us a simple but dramatically different predictive rule of thumb for each.

Examining the Copernican Principle, we saw that when Bayes's Rule is given an uninformative prior, it always predicts that the total life span of an object will be exactly double its current age. In fact, the uninformative prior, with its wildly varying possible scales—the wall that might last for months or for millennia—is a power-law distribution. And for any power-law distribution, Bayes's Rule indicates that the appropriate prediction strategy is a Multiplicative Rule: multiply the quantity observed so far by some constant factor. For an uninformative prior, that constant factor happens to be 2, hence the Copernican prediction; in other power-law cases, the multiplier will depend on the exact distribution you're working with.

When we apply Bayes's Rule with a normal distribution as a prior, on the other hand, we obtain a very different kind of guidance. Instead of a multiplicative rule, we get an Average Rule: use the distribution's "natural" average—its single, specific scale—as your guide. For instance, if somebody is younger than the average life span, then simply predict the average; as their age gets close to and then exceeds the average, predict that they'll live a few years more. Following this rule gives reasonable predictions for the 90-year-old and the 6-year-old: 94 and 77, respectively.

Something normally distributed that's gone on seemingly too long is bound to end shortly; but the longer something in a power-law distribution has gone on, the longer you can expect it to keep going.

Between those two extremes, there's actually a third category of things in life: those that are neither more nor less likely to end just because they've gone on for a while. Sometimes things are simply … invariant.

The Danish mathematician Agner Krarup Erlang, who studied such phenomena, formalized the spread of intervals between independent events into the function that now carries his name: the Erlang distribution. The Erlang distribution gives us a third kind of prediction rule, the Additive Rule: always predict that things will go on just a constant amount longer.

Distributions that yield the same prediction, no matter their history or current state, are known to statisticians as "memoryless."

These three very different patterns of optimal prediction—the Multiplicative, Average, and Additive Rules—all result directly from applying Bayes's Rule to the power-law, normal, and Erlang distributions, respectively. And given the way those predictions come out, the three distributions offer us different guidance, too, on how surprised we should be by certain events. In a power-law distribution, the longer something has gone on, the longer we expect it to continue going on. So a power-law event is more surprising the longer we've been waiting for it—and maximally surprising right before it happens. A nation, corporation, or institution only grows more venerable with each passing year, so it's always stunning when it collapses. In a normal distribution, events are surprising when they're early—since we expected them to reach the average—but not when they're late. Indeed, by that point they seem overdue to happen, so the longer we wait, the more we expect them. And in an Erlang distribution, events by definition are never any more or less surprising no matter when they occur. Any state of affairs is always equally likely to end regardless of how long it's lasted.

Small Data and the Mind

Small data is big data in disguise. The reason we can often make good predictions from a small number of observations—or just a single one—is that our priors are so rich. Whether we know it or not, we appear to carry around in our heads surprisingly accurate priors about movie grosses and running times, poem lengths, and political terms of office, not to mention human life spans. We don't need to gather them explicitly; we absorb them from the world.

Priors in the Age of Mechanical Reproduction

Being a good Bayesian means representing the world in the correct proportions—having good priors, appropriately calibrated. By and large, for humans and other animals this happens naturally; as a rule, when something surprises us, it ought to surprise us, and when it doesn't, it ought not to. Everything starts to break down, however, when a species gains language. What we talk about isn't what we experience—we speak chiefly of interesting things, and those tend to be things that are uncommon. More or less by definition, events are always experienced at their proper frequencies, but this isn't at all true of language. Anyone who has experienced a snake bite or a lightning strike will tend to retell those singular stories for the rest of their lives. And those stories will be so salient that they will be picked up and retold by others. There's a curious tension, then, between communicating with others and maintaining accurate priors about the world. When people talk about what interests them—and offer stories they think their listeners will find interesting—it skews the statistics of our experience. That makes it hard to maintain appropriate prior distributions. And the challenge has only increased with the development of the printing press, the nightly news, and social media—innovations that allow our species to spread language mechanically.

Simply put, the representation of events in the media does not track their frequency in the world. If you want to be a good intuitive Bayesian—if you want to naturally make good predictions, without having to think about what kind of prediction rule is appropriate—you need to protect your priors. Counterintuitively, that might mean turning off the news.

OVERFITTING: WHEN TO THINK LESS

it is indeed true that including more factors in a model will always, by definition, make it a better fit for the data we have already. But a better fit for the available data does not necessarily mean a better prediction. This is what statisticians call overfitting. So one of the deepest truths of machine learning is that, in fact, it's not always better to use a more complex model, one that takes a greater number of factors into account. And the issue is not just that the extra factors might offer diminishing returns—performing better than a simpler model, but not enough to justify the added complexity. Rather, they might make our predictions dramatically worse.

Fundamentally, overfitting is a kind of idolatry of data, a consequence of focusing on what we've been able to measure rather than what matters.

It's incredibly difficult to come up with incentives or measurements that do not have some kind of perverse effect.

Detecting Overfitting: Cross-Validation

Simply put, Cross-Validation means assessing not only how well a model fits the data it's given, but how well it generalizes to data it hasn't seen. Paradoxically, this may involve using less data. In the marriage example, we might "hold back," say, two points at random, and fit our models only to the other eight. We'd then take those two test points and use them to gauge how well our various functions generalize beyond the eight "training" points they've been given. The two held-back points function as canaries in the coal mine: if a complex model nails the eight training points but wildly misses the two test points, it's a good bet that overfitting is at work. Aside from withholding some of the available data points, it is also useful to consider testing the model with data derived from some other form of evaluation entirely.

How to Combat Overfitting: Penalizing Complexity

One way to choose among several competing models is the Occam's razor principle, which suggests that, all things being equal, the simplest possible hypothesis is probably the correct one. Of course, things are rarely completely equal, so it's not immediately obvious how to apply something like Occam's razor in a mathematical context. Grappling with this challenge in the 1960s, Russian mathematician Andrey Tikhonov proposed one answer: introduce an additional term to your calculations that penalizes more complex solutions. If we introduce a complexity penalty, then more complex models need to do not merely a better job but a significantly better job of explaining the data to justify their greater complexity. Computer scientists refer to this principle—using constraints that penalize models for their complexity—as Regularization.

The Weight of History

In machine learning, the advantages of moving slowly emerge most concretely in a regularization technique known as Early Stopping. In many situations, however, tuning the parameters to find the best possible fit for given data is a process in and of itself. What happens if we stop that process early and simply don't allow a model the time to become too complex? Again, what might seem at first blush like being halfhearted or unthorough emerges, instead, as an important strategy in its own right.

The effectiveness of regularization in all kinds of machine-learning tasks suggests that we can make better decisions by deliberately thinking and doing less. If the factors we come up with first are likely to be the most important ones, then beyond a certain point thinking more about a problem is not only going to be a waste of time and effort—it will lead us to worse solutions. Early Stopping provides the foundation for a reasoned argument against reasoning, the thinking person's case against thought.

When to Think Less

As with all issues involving overfitting, how early to stop depends on the gap between what you can measure and what really matters. If you have all the facts, they're free of all error and uncertainty, and you can directly assess whatever is important to you, then don't stop early. Think long and hard: the complexity and effort are appropriate. But that's almost never the case. If you have high uncertainty and limited data, then do stop early by all means. If you don't have a clear read on how your work will be evaluated, and by whom, then it's not worth the extra time to make it perfect with respect to your own (or anyone else's) idiosyncratic guess at what perfection might be. The greater the uncertainty, the bigger the gap between what you can measure and what matters, the more you should watch out for overfitting—that is, the more you should prefer simplicity, and the earlier you should stop. When you're truly in the dark, the best-laid plans will be the simplest. When our expectations are uncertain and the data are noisy, the best bet is to paint with a broad brush, to think in broad strokes. The upshot of Early Stopping is that sometimes it's not a matter of choosing between being rational and going with our first instinct. Going with our first instinct can be the rational solution. The more complex, unstable, and uncertain the decision, the more rational an approach that is.

RELAXATION: LET IT SLIDE

A "constrained optimization" problem: how to find the single best arrangement of a set of variables, given particular rules and a scorekeeping measure ("the traveling salesman problem.")

Defining Difficulty

The Cobham–Edmonds thesis: an algorithm should be considered "efficient" if it runs in what's called "polynomial time"—that is, O(n2), O(n3), or in fact n to the power of any number at all. A problem, in turn, is considered "tractable" if we know how to solve it using an efficient algorithm. A problem we don't know how to solve in polynomial time, on the other hand, is considered "intractable." And at anything but the smallest scales, intractable problems are beyond the reach of solution by computers, no matter how powerful.

One of the simplest forms of relaxation in computer science is known as Constraint Relaxation. In this technique, researchers remove some of the problem's constraints and set about solving the problem they wish they had. Then, after they've made a certain amount of headway, they try to add the constraints back in. That is, they make the problem temporarily easier to handle before bringing it back to reality.

What relaxation cannot do is offer you a guaranteed shortcut to the perfect answer. But computer science can also quantify the tradeoff that relaxation offers between time and solution quality. In many cases, the ratio is dramatic, a no-brainer—for instance, an answer at least half as good as the perfect solution in a quadrillionth of the time. The message is simple but profound: if we're willing to accept solutions that are close enough, then even some of the hairiest problems around can be tamed with the right techniques.

The idea behind Lagrangian Relaxation is simple. An optimization problem has two parts: the rules and the scorekeeping. In Lagrangian Relaxation, we take some of the problem's constraints and bake them into the scoring system instead. That is, we take the impossible and downgrade it to costly. (In a wedding seating optimization, for instance, we might relax the constraint that tables each hold ten people max, allowing overfull tables but with some kind of elbow-room penalty.) When an optimization problem's constraints say "Do it, or else!," Lagrangian Relaxation replies, "Or else what?" Once we can color outside the lines—even just a little bit, and even at a steep cost—problems become tractable that weren't tractable before.

Relaxations offer us a number of advantages. For one, they offer a bound on the quality of the true solution. If we're trying to pack our calendar, imagining that we can magically teleport across town will instantaneously make it clear that eight one-hour meetings is the most we could possibly expect to fit into a day; such a bound might be useful in setting expectations before we face the full problem. Second, relaxations are designed so that they can indeed be reconciled with reality—and this gives us bounds on the solution from the other direction. When Continuous Relaxation tells us to give out fractional vaccines, we can just immunize everyone assigned to receive half a vaccine or more, and end up with an easily calculated solution that requires at worst twice as many inoculations as in a perfect world. Maybe we can live with that.

RANDOMNESS: WHEN TO LEAVE IT TO CHANCE

Sampling: when we want to know something about a complex quantity, we can estimate its value by sampling from it (Monte Carlo Method). A close examination of random samples can be one of the most effective means of making sense of something too complex to be comprehended directly.

Hills, Valleys, and Traps

Once you've assembled a baseline itinerary, you might test some alternatives by making slight perturbations to the city sequence and seeing if that makes an improvement. For instance, if we are going first to Seattle, then to Los Angeles, we can try doing those cities in reverse order: L.A. first, then Seattle. For any given itinerary, we can make eleven such two-city flip-flops; let's say we try them all and then go with the one that gives us the best savings. From here we've got a new itinerary to work with, and we can start permuting that one, again looking for the best local improvement. This is an algorithm known as Hill Climbing—since the search through a space of solutions, some better and some worse, is commonly thought of in terms of a landscape with hills and valleys, where your goal is to reach the highest peak. Eventually you will end up with a solution that is better than all of its permutations; no matter which adjacent stops you flip, nothing beats it. It's here that the hill climbing stops. Does this mean you've definitely found the single best possible itinerary, though? Sadly, no. You may have found only a so-called "local maximum," not the global maximum of all the possibilities.

Out of the Local Maximum One approach is to augment Hill Climbing with what's known as "jitter": if it looks like you're stuck, mix things up a little. Make a few random small changes (even if they are for the worse), then go back to Hill Climbing; see if you end up at a higher peak. Another approach is to completely scramble our solution when we reach a local maximum, and start Hill Climbing anew from this random new starting point. This algorithm is known, appropriately enough, as "Random-Restart Hill Climbing"—or, more colorfully, as "Shotgun Hill Climbing." It's a strategy that proves very effective when there are lots of local maxima in a problem.

But there's also a third approach: instead of turning to full-bore randomness when you're stuck, use a little bit of randomness every time you make a decision. This technique, developed by the same Los Alamos team that came up with the Monte Carlo Method, is called the Metropolis Algorithm. The Metropolis Algorithm is like Hill Climbing, trying out different small-scale tweaks on a solution, but with one important difference: at any given point, it will potentially accept bad tweaks as well as good ones.

When it comes to stimulating creativity, a common technique is introducing a random element, such as a word that people have to form associations with. Musician Brian Eno and artist Peter Schmidt created a deck of cards known as Oblique Strategies for solving creative problems. Pick a card, any card, and you will get a random new perspective on your project. (And if that sounds like too much work, you can now download an app that will pick a card for you.)

Being randomly jittered, thrown out of the frame and focused on a larger scale, provides a way to leave what might be locally good and get back to the pursuit of what might be globally optimal.

NETWORKING: HOW WE CONNECT

In human society, we tend to adopt a policy of giving people some finite number of chances in a row, then giving up entirely. Three strikes, you're out. This pattern prevails by default in almost any situation that requires forgiveness, lenience, or perseverance. Simply put, maybe we're doing it wrong. A friend of ours recently mused about a childhood companion who had a disconcerting habit of flaking on social plans. What to do? Deciding once and for all that she'd finally had enough and giving up entirely on the relationship seemed arbitrary and severe, but continuing to persist in perpetual rescheduling seemed naïve, liable to lead to an endless amount of disappointment and wasted time. Solution: Exponential Backoff on the invitation rate. Try to reschedule in a week, then two, then four, then eight. The rate of "retransmission" goes toward zero—yet you never have to completely give up.

GAME THEORY: THE MINDS OF OTHERS

Computer scientists have a term for this potentially endless journey into the hall of mirrors, minds simulating minds simulating minds: "recursion."

The Nash equilibrium offers a prediction of the stable long-term outcome of any set of rules or incentives. As such, it provides an invaluable tool for both predicting and shaping economic policy, as well as social policy in general. As Nobel laureate economist Roger Myerson puts it, the Nash equilibrium "has had a fundamental and pervasive impact in economics and the social sciences which is comparable to that of the discovery of the DNA double helix in the biological sciences."

Dominant Strategies, for Better or Worse: Even when we can reach an equilibrium, just because it's stable doesn't make it good. It may seem paradoxical, but the equilibrium strategy—where neither player is willing to change tack—is by no means necessarily the strategy that leads to the best outcomes for the players. Nowhere is that better illustrated than in game theory's most famous, provocative, and controversial two-player game: "the prisoner's dilemma."

The Tragedy of the Commons: So what can we, as players, do when we find ourselves in such a situation—either the two-party prisoner's dilemma, or the multi-party tragedy of the commons? In a sense, nothing. The very stability that these bad equilibria have, the thing that makes them equilibria, becomes damnable. By and large we cannot shift the dominant strategies from within. But this doesn't mean that bad equilibria can't be fixed. It just means that the solution is going to have to come from somewhere else.

Mechanism Design: Change the Game

While game theory asks what behavior will emerge given a set of rules, mechanism design (sometimes called "reverse game theory") works in the other direction, asking: what rules will give us the behavior we want to see? Mechanism design makes a powerful argument for the need for a designer—be it a CEO, a contract binding all parties, or a don who enforces omertà by garroted carotid. Scaling up this logic results in a potent argument for the role of government. In fact, many governments do have laws on the books mandating minimum vacations and limiting shop hours. Laws like these often stem from the colonial era and were initially religious in nature. Indeed, religion itself provides a very direct way of modifying the structure of games of this kind. In particular, a religious law such as "Remember the Sabbath day" neatly solves the problem faced by the shopkeepers, whether enforced by an all-powerful God or by the more proximate members of a religious community. And adding divine force to injunctions against other kinds of antisocial behavior, such as murder, adultery, and theft, is likewise a way to solve some of the game-theoretic problems of living in a social group. God happens to be even better than government in this respect, since omniscience and omnipotence provide a particularly strong guarantee that taking bad actions will have dire consequences.

Information cascades offer a rational theory not only of bubbles, but also of fads and herd behavior more generally. They offer an account of how it's easily possible for any market to spike and collapse, even in the absence of irrationality, malevolence, or malfeasance. The takeaways are several. For one, be wary of cases where public information seems to exceed private information, where you know more about what people are doing than why they're doing it, where you're more concerned with your judgments fitting the consensus than fitting the facts. When you're mostly looking to others to set a course, they may well be looking right back at you to do the same. Second, remember that actions are not beliefs; cascades get caused in part when we misinterpret what others think based on what they do. We should be especially hesitant to overrule our own doubts—and if we do, we might want to find some way to broadcast those doubts even as we move forward, lest others fail to distinguish the reluctance in our minds from the implied enthusiasm in our actions. Last, we should remember from the prisoner's dilemma that sometimes a game can have irredeemably lousy rules. There may be nothing we can do once we're in it, but the theory of information cascades may help us to avoid such a game in the first place. And if you're the kind of person who always does what you think is right, no matter how crazy others think it is, take heart. The bad news is that you will be wrong more often than the herd followers. The good news is that sticking to your convictions creates a positive externality, letting people make accurate inferences from your behavior. There may come a time when you will save the entire herd from disaster.

There's one auction design in particular that cuts through the burden of mental recursion like a hot knife through butter. It's called the Vickrey auction. Named for Nobel Prize–winning economist William Vickrey, the Vickrey auction, just like the first-price auction, is a "sealed bid" auction process. That is, every participant simply writes down a single number in secret, and the highest bidder wins. However, in a Vickrey auction, the winner ends up paying not the amount of their own bid, but that of the second-place bidder. That is to say, if you bid $25 and I bid $10, you win the item at my price: you only have to pay $10. To a game theorist, a Vickrey auction has a number of attractive properties. And to an algorithmic game theorist in particular, one property especially stands out: the participants are incentivized to be honest. In fact, there is no better strategy than just bidding your "true value" for the item—exactly what you think the item is worth. Bidding any more than your true value is obviously silly, as you might end up stuck buying something for more than you think it's worth. And bidding any less than your true value (i.e., shading your bid) risks losing the auction for no good reason, since it doesn't save you any money—because if you win, you'll only be paying the value of the second-highest bid, regardless of how high your own was. This makes the Vickrey auction what mechanism designers call "strategy-proof," or just "truthful." In the Vickrey auction, honesty is literally the best policy. Even better, honesty remains the best policy regardless of whether the other bidders are honest themselves. In the prisoner's dilemma, we saw how defection turned out to be the "dominant" strategy—the best move no matter whether your partner defected or cooperated. In a Vickrey auction, on the other hand, honesty is the dominant strategy. This is the mechanism designer's holy grail. You do not need to strategize or recurse.

CONCLUSION: COMPUTATIONAL KINDNESS

Any dynamic system subject to the constraints of space and time is up against a core set of fundamental and unavoidable problems. These problems are computational in nature, which makes computers not only our tools but also our comrades. From this come three simple pieces of wisdom. First, there are cases where computer scientists and mathematicians have identified good algorithmic approaches that can simply be transferred over to human problems. The 37% Rule, the Least Recently Used criterion for handling overflowing caches, and the Upper Confidence Bound as a guide to exploration are all examples of this. Second, knowing that you are using an optimal algorithm should be a relief even if you don't get the results you were looking for. The 37% Rule fails 63% of the time. Maintaining your cache with LRU doesn't guarantee that you will always find what you're looking for; in fact, neither would clairvoyance. Using the Upper Confidence Bound approach to the explore/exploit tradeoff doesn't mean that you will have no regrets, just that those regrets will accumulate ever more slowly as you go through life. Even the best strategy sometimes yields bad results—which is why computer scientists take care to distinguish between "process" and "outcome." If you followed the best possible process, then you've done all you can, and you shouldn't blame yourself if things didn't go your way. Outcomes make news headlines—indeed, they make the world we live in—so it's easy to become fixated on them. But processes are what we have control over. As Bertrand Russell put it, "it would seem we must take account of probability in judging of objective rightness.… The objectively right act is the one which will probably be most fortunate. I shall define this as the wisest act." We can hope to be fortunate—but we should strive to be wise. Call it a kind of computational Stoicism. Finally, we can draw a clear line between problems that admit straightforward solutions and problems that don't. If you wind up stuck in an intractable scenario, remember that heuristics, approximations, and strategic use of randomness can help you find workable solutions. A theme that came up again and again in our interviews with computer scientists was: sometimes "good enough" really is good enough. What's more, being aware of complexity can help us pick our problems: if we have control over which situations we confront, we should choose the ones that are tractable.

Seemingly innocuous language like "Oh, I'm flexible" or "What do you want to do tonight?" has a dark computational underbelly that should make you think twice. It has the veneer of kindness about it, but it does two deeply alarming things. First, it passes the cognitive buck: "Here's a problem, you handle it." Second, by not stating your preferences, it invites the others to simulate or imagine them. And as we have seen, the simulation of the minds of others is one of the biggest computational challenges a mind (or machine) can ever face. In such situations, computational kindness and conventional etiquette diverge. Politely withholding your preferences puts the computational problem of inferring them on the rest of the group. In contrast, politely asserting your preferences ("Personally, I'm inclined toward x. What do you think?") helps shoulder the cognitive load of moving the group toward resolution. Alternatively, you can try to reduce, rather than maximize, the number of options that you give other people—say, offering a choice between two or three restaurants rather than ten. If each person in the group eliminates their least preferred option, that makes the task easier for everyone. And if you're inviting somebody out to lunch, or scheduling a meeting, offering one or two concrete proposals that they can accept or decline is a good starting point. None of these actions is necessarily "polite," but all of them can significantly lower the computational cost of interaction.

Computational kindness isn't just a principle of behavior; it's also a design principle. One of the chief goals of design ought to be protecting people from unnecessary tension, friction, and mental labor. (This is not just an abstract concern; when mall parking becomes a source of stress, for instance, shoppers may spend less money and return less frequently.)

Algorithms to Live By: The Computer Science of Human Decisions - by Brian Christian and Tom Griffiths