Well, I’m not the only saying this (sadly, it’s behind a paywall):

The melding of science and statistics has often propelled major breakthroughs. Last year’s Nobel Prize in Physics was awarded for the discovery of the accelerating expansion of the universe. That discovery was facilitated by sophisticated statistical methods, establishing that the finding was not an artifact of imprecise measurement or miscalculations. Statistical methods also allowed the trial demonstrating that zidovudine reduces the risk of HIV transmission from infected pregnant women to their infants to be stopped early, benefiting countless children. Statistical principles have been the foundation for field trials that have improved agricultural quality and for the randomized clinical trial, the gold standard for comparing treatments and the backbone of the drug regulatory system.

I spent a little bit of time trying to present ways that scientists and laymen can engage each other. It seems that in calling for a policy change, either in raising the level of public funding or peddling statistics as a viable career choice, perhaps Science should have made these articles freely available? Otherwise, Marie Davidian and Thomas Louis, the authors of this editorial, are preaching to the choir.

****

This is as good a time as any to present my thoughts on Stephen Baker’s The Numerati. It is a serviceable introduction to the arenas where statistical analyses of large data sets are gaining prominence. Despite the title, the book does not really present leading scientists and statisticians who are at the forefront of converting our analog lives into computer friendly numbers. I would also have liked to see this book grapple more with issues such as how non-statisticians should come to terms with how we are all being quantified and analyzed.

The book presents this numerification without judgment. It is simply a description of what is already happening. By virtue of Mr. Baker’s matter-of-fact presentation, we can surmise that current uses of behavior quantification seem to be used to market products to us or to track on us. Politicians get to slice us into finer demographics; true believers are ignored while swing voters are targeted. Stores entice consumers to join rewards programs; the information that businesses gain is cheaply bought. The debris of our personal lives are vacuumed by governments, intent on identifying the terrorists among us. The workplace becomes more divided, first by cubicle walls and then by algorithms designed to flag malingerers.

Mr. Baker does not dwell on how power resides in those who have access to the information, although most of the researchers seem to think that their analyses will be used by laymen as much as by themselves. He presents two dissenting voices; one is a detective who utilizes the latest face recognition software for casinos. The expert has become an advocate for the privacy that citizens deserve; it might be uncomfortable for one to receive targeted ads that presumes too much about our behavior. The other is Baker himself, but only in the narrow scope of how numerification affects his own industry. He thinks there is a value in the role of editors in acting as a curator for news. Otherwise, that role will fall to the reader, who may be overwhelmed by the number of news items. More likely, that reader will defer to search engines (the very things supplanting editors).

Mr. Baker does not really push this issue, but search engines do not have to be value-neutral. They can very well reflect the political biases of their owners, or the function itself might be a value-add meant to drive up revenue streams (don’t forget, Google makes money by selling ads.) People tend to think of software as without bias and objective, due to its being based on algorithms, machine rules and mathematical models. I think one interesting aspect of numerification in that it in no way dismisses the need for judgment. This is especially important in selecting the mathematical rules to use, the filters and gates one applies to data, and the interpretation of results. A computer can crank out numbers, but humans decide what formulas to use.

A short while ago, I was discussing this very issue with a director of analytics at a marketing firm. We got to discussing cluster analysis; we both felt that while its result is perfect for what we want to do with our data, there is a surprising amount of ambiguity involved. In MatLab, one function used for finding groups of data points is k-means clustering. To use it, you have to specify how many clusters the function should slice the data. The process itself is straightforward: a number of positions are selected at random. The algorithm then proceeds to reposition these points so that it is equidistant from the group of points that will form the cluster. Everything about it works as advertised, expect for the part where the user needs to tell the program how many clusters there are. Not much help if you are looking for a computational method to find the clusters “objectively.” The director and I moved onto other topics, such as formulating the machine rules and vetting them.

Let’s leave aside the loss of dignity and individuality entailed in numerification; the subtle points not addressed in *The Numerati* are how models are built and how metrics are validated*.* This touches directly on the things that can go wrong with numerifying society. The most obvious example is bad data – either typos or out of date information – leading to misclassification. It’s not identify theft, but the result is the same: some agent attributes some notoriety to the wrong person. The victim gets stuck with a bill or worse, labeled a terrorist and detained by authorities. Another possible error is that the wrong metric is used, leading to even more inefficiencies than had numbers been ignored. Simply, are the measures used really the most relevant ones, and how likely are we to settle on the wrong formula?

Dave Berri, a sports statistician, has been a bellweather in this regard. He has spent significant space in two books, *The Wages of Win* and *Stumbling on Wins*, as well as on his website and on the Huffington Post, documenting how even people with a vested interest in using statistics do not always come to scientifically consistent conclusions. He is able to use sports statistics to give us insight into the decision making process. His observations and models, and frankly most models in general, have been met with two criticisms: 1) math models do not capture something as complicated as basketball and 2) his findings have deviated from existing opinion – that is, his models seem wrong. Answering these questions get at some issues at data-mining and correlation analysis that *The Numerati* avoided.

***

Both the objections speak of the confusion people have between determinism and the predictions one can make with a model. First, there are actually few deterministic physical laws. Quantum mechanics happen to be one, but the effects can only be seen in reduced systems – the level of single electrons. As we include more of the universe, at the scales relevant to human experience, our deterministic laws take on a more approximate character. We being to model empirical effects and not so much deriving solutions based on first principles (with a few important caveats.) The point is that we can use Newton’s Laws just fine in sending our space probes to Jupiter, with the laws modeled after observation. We do not need to use a unified field theory to figure out how the subatomic particles of the molecules of a spacecraft interact with the like particles making up Jupiter to help us aim.

Models based on empirical findings can only predict events prescribed within the boundaries of observation. This is even more true of statistical models based on data mining. New conditions can arise such that they invalidate the assumptions (or the previous observations) used to build the model. The worst case scenario is when some infrequent catastrophe occurs – Nicholas Nassim Taleb’s “black swan” event.

That’s part of the art of working with models. We must understand their limitations as well as their conclusions. As the system becomes more complex, so do our models (generally). The complexity of our models may be linked to both the system and to the precision which we require. For example, one can model Texas Hold’em in terms of the probability of receiving a given hand and deriving optimal betting strategies. But that ignores the game theory aspect of the game: players can use information gained during the course of play, bluffing, and alterations in strategy by plain ignorance. There are also emotional aspects to play that might lead players to deviate from optimal strategy or miscalculate probabilities. For models that are based on observations, their predictions pertain to the likelihood of outcomes. Over many trials, I would expect the frequency of outcomes to conform to the model, but I cannot predict what the immediate next result will be. It’s the same as knowing that throwing 7’s is the most common event when playing craps, but I can’t say whether the next throw will in fact be a 7.

So why build these models? Because the process allows us to make explicit our ideas. It allows us to specify things we know, things we wish we knew, and possibly to help us identify thing we were ignorant of. Let us use sports as an example. Regardless of what we think about statistics and models, all of us already have one running in our heads. In the case of basketball, we can actually see this unspoken bias: general managers, sportswriters, and fans tend to name players as above average the more points per game they score. This is without consideration of other contributions, like steals, blocks, turnovers, fouls, rebounds, and shooting percentage. We know this because of empirical data: the pay scale of basketball players (controlled by GMs), MVP voting (by sportswriters) and All-Star selections (by coaches and the fans). The number of points scored best explains why someone is chosen as a top player.

The upshot is that humans have a nervous system built to extract patterns. This is great for creating heuristics – general rules of thumbs. Unfortunately, we are influenced not only by the actual frequency of events but also by our emotions. Thus we do not actually have an objective model, but one modified by our perceptions. In other words, unless we take steps to count properly – that is, to create a mathematically precise model – we risk giving our subjective biases the veneer of objectivity. This is a worse situation than having no model; we would place our confidence in something that will systematically give us wrong answers, rather than realizing we simply don’t know.

There are even more subtle problems with model building. Even having quantifiable events and objective observational data do not guarantee that one will have a good model. This problem can be seen in the NFL draft; the predictors that coaches use – this time published and made explicit, such as Wonderlic scores and NFL combine observations – do not have much value in identifying players who will be average, let alone be superstars. Berri has presented a lot of data on this, ranging from original research published in economic journals to more informal channels such as his books and web pieces. So how do we conclude that we have a good model?

***

Here is where it gets tricky. In the case of sports, we can identify good output metric, such as a team’s win-loss record. If you start from scratch, you might argue that a winning team must score more points than an opponent. You would test this by performing a simple linear regression analysis, and you would find that it is in fact the case. As a matter of fact, the first model is an obvious one: score more points than your opponents and you win. So obvious that is sounds like it is the definition of a win. In this case, it becomes apparent that the win-loss record is a “symptom”, a reflection of the fact that for a given game, players do not make wins, but they do make points. Points-scored and points-against (point differential) become a more elemental assumption.

This isn’t too novel a finding; most sports conform to some variant of Bill James’s Pythagorean expectation (named as such because its terms resemble the Pythagorean relation a^2 + b^2 = c^2.) If we start at the assumption that everything a player does to help or hurt his team is to score points, then we can begin to ask whether all points are equal and whether other factors help or prevent teams from scoring. As it happens, Berri has done a credible job of building a simple formula using basketball box scores (rebounds, personal fouls, shooting percentage, assists, turnovers, blocks, and steals.) Here, we have obvious end goal measures: point differential and ultimately, win-loss.

But what if there are no obvious standard to judge the effectiveness of our models? That is the situation encountered by modelers who try to identify terrorists or to increase worker productivity. Frankly, the outcomes are confounded by the fact that terrorists take steps to hide their guilt, and workers might work much harder at giving the appearance of productivity than to actually do work. In this case, deciding which parameters are significant predictors is only half the job; one might need to perform an empirical investigation in order to establish the outcome. The irony is that despite the complicated circumstances in a sports contest, the system remains well-specified and amenable to analysis. Life, then, is characterized by having more parameters and variables, being less defined in outcome, and with much greater noise associated with their measures.

Nevertheless, some analysis can be done. Careful observation will allow us to classify the most frequent outcomes. This is most clear in the recommendations from Amazon: “Customers who purchased this also bought that.” If that linkage passes some threshold, it is to Amazon’s benefit to suggest it the customer. Thus the parallels between basketball (and sports) statistics and the numerification of life are clear. The key is to find a standard for performance. For a retailer, it might be sales. For a biotech company, it could be the number of candidate drugs entering Phase I clinical trials. Some endpoints might be fuzzier (what would one say makes a productive office worker? The ratio of inbox to outbox?) Again, identifying a proper standard is hard, combining both art and science. This is another point ignored in Baker’s book: there are many points for humans to exert an influence in modeling.

Basketball can again serve as an illustration. The action is dynamic, fast-paced, and has many independent elements (that is, the 10 players on the court.) However, just because we perceive a system to be complex does not imply that the model itself needs to be. Bill Simmons, a vocal opponent of statistics in “complicated” games like basketball, makes a big deal about “smart” statistics – like breaking down game footage into more precise descriptions of action, such as whether a shooter favors driving to one side over the other, if he has a higher shooting percentage from the corners, how far he plays off his man, and so on. In other words, Simmons would say that there is a lot of information ignored by box scores. Ergo, they cannot possibly be of use to basketball personnel. As Berri and colleagues have shown, box scores do provide a fair amount of predictive value – with regard to points differential.

What critics like Simmons miss is that these models most definitely describes the past, or, what the players have done, but the future is quite a bit more open ended. These critics confuse “could” with “will.” A model’s predictive value depends on not only its correlation with the standard and how stable it is across time. Again, despite the rather complicated action on the court, basketball players performance, modeled using WP48, is fairly consistent from year to year. Armed with this information, one might reasonably propose that LeBron James, having *this* level of performance last year, might reach a similar level this year.

As any modeler realizes, that simple linear extrapolation ignores many other variables. One simple confound is injury. Another is age. Yet another is whether the coach actually uses the player. In other words, the critics tend to assume past performance equals future returns. The statistical model, even WP48, does not allow us to say, with deterministic accuracy, how a player will perform from game to game, let alone from year to yea. At the same time, the model does not present a “cap” on a player’s potential. Used judiciously, it is a starting point for allowing coaches and GMs to identify the strengths and weaknesses of their players, freeing them to devise drills and game strategies that can improve player performance. Interpreted in this way, WP48 allows coaches to see whether their strategies have an impact on overall player productivity, which should lead to more points scored and fewer points given up.

How would we deal with competing models? The standard of choice, in sports – the points differential – also allows us to compare Berri’s formula with other advanced statistics. Berri’s “Wins Produced Per 48 minutes” (WP48) stat correlates with point differential, and hence wins. Among many competing models, John Hollinger has presented a popular alternative, the Player Efficiency Rating (PER). PER is a proprietary algorithm and by all accounts, “complicated”. That’s fine, except Berri showed that the rankings generated by PER differs little from ranking players according to their average points scored per game. In other words, you can get the same performance as PER by simply listing a player’s Points-per-game stat. Interestingly, Points-per-game has *lower* correlation to the points-differential than WP48: by the measure with the standard, simply scoring points actually does not lead to wins. On an intuitive level, this makes sense, because you also need to play defense and keep the opponents from scoring more than you.

A shrewd reader might also have realized that there can be “equivalent” models. This was emphasized by showing that two metrics are highly correlated to each other (such as points-per-game and PER). Coupled with correlation to our standard, we know have a technique for comparing models both on how well they perform and if we have redundant formulas. This is useful; if we have two alternatives that tell us the exact same thing, then wouldn’t we rather use the simpler one?

Recently, an undergraduate student undertook a project to model PER, resulting in a linear equation that allowed for analysis of the weightings that John Hollinger most likely used. In turn, this lays bare the assumptions and biases that Hollinger used in constructing his model. An analysis of the simplified PER models suggest that PER is dominated by points scored. All the other variables in PER only give the pretense of adding information. There are underlying assumptions and factors that prove overwhelming in their effects. But this isn’t such a novel finding given the suspiciously high correlation with points-per-game (and lower correlation with point-differential.) In this sense, then, “good” only implies correlation with the standards the modelers used. It isn’t “good” in the sense of being compared against what we feel a good model should look like.

***

I’ve been writing essays trying help non-scientists deal with scientific findings. When reporters filter research, much information gets trimmed. Emphasis is usually given to conclusions; the problem is that good science is a direct function of the methods. Garbage in, garbage out still holds, but bad methods will turn gold into garbage as well.

The paper I will next discuss highlights this issue: correlation and causation are two different beasts, and mistaking the two can take a very subtle form. Venet and colleagues recently published an article* in PLOS Computational Biology showing how, even when care is taken to identify the underlying mechanism of disease, the very mechanism of disease pathology may not prove to be specific enough of a metric to help clinicians diagnose the disease. They write,

Hundreds of studies in oncology have suggested the biological relevance to human of putative cancer-driving mechanisms with the following three steps: 1) characterize the mechanism in a model system, 2) derive from the model system a marker whose expression changes when the mechanism is altered, and 3) show that marker expression correlates with disease outcome in patients—the last figure of such paper is typically a Kaplan-Meier plot illustrating this correlation.

This is essentially the same method other mathematicians and modelers will use to identify target markets, demographics, terrorists, athletic performance, and what have you. In this case, one would assume that the wealth of research in breast cancer will yield many “hard” metrics by which one can identify a patient with the disease. Venet and colleagues show that this is not the case; the problem is,

… meta-analyses of several outcome signatures have shown that they have essentially equivalent prognostic performances [35], [36], and are highly correlated with proliferation [7]–[8], [37], a predictor of breast cancer outcome that has been used for decades [38]–[40].

This raises a question: are all these mechanisms major independent drivers of breast cancer progression, or is step #3 inconclusive because of a basic confounding variable problem? To take an example of complex system outside oncology, let us suppose we are trying to discover which socio-economical variables drive people’s health. We may find that the number of TV sets per household is positively correlated with longer life expectancy. This, of course, does not imply that TV sets improve health. Life expectancy and TV sets per household are both correlated with the gross national product per capita of nations, as are many other causes or byproducts of wealth such as energy consumption or education. So, is the significant association of say, a stem cell signature, with human breast cancer outcome informative about the relevance of stem cells to human breast cancer?

Scientific research is powerful because of its compare-contrast approach – explicit comparisons of test case with a control case. We can take a sick animal or patient, identify the diseased cells, and do research on it. All the research generally revolves around taking two identical types of cells (or animals, or conditions), but with one crucial difference. For the case of cancer, one might reasonably select a cancer cell compare it to a normal cell of the same type. In this way, we can ask how the two differ.

If the controls were not well-designed, then one might really be testing for correlation, not causation. As one can imagine, even if a few things go wrong, the effects might be masked by many disease-irrelevant processes – this is what we would call noise. Venet and colleagues looked at studies that used gene expression profiles. The idea is that a diseased cell will have some different phenotype (i.e. “characteristic”), whether it be in the genes it expresses, or the proteins that it uses, or in its responses to signals from other cells, or in its continual growth, or in its ignoring the cell-death signal, and so on. One characteristic of cancerous cells is that it grows and divides. The signature that researchers had focused on was simply the genes expressed by cancer cells, which presumably will not be expressed in non-cancer cells. Remember this point; it becomes important later.

Further, it was reasonable to hypothesize that the power of this test would grow when more and more genes from the diseased state are incorporated in the diagnostic. Whatever differed between cancer and normal cells should, in theory, be used as either a diagnostic marker or a potential target for drug action. As Venet and colleagues point out, many genes actually play a role in the grow-and-divide cycle (“proliferation”) of normal cells. While these genes may have increased expression in cancer cells their elevated levels will key them as being different from normal cells. In this case, that isn’t enough; the underlying attribute of these genes reflect an aberrant state, but only by degree. Even normal cells proliferate; it so happens that the genes involved in this process are relatively numerous. Thus there are two problems: one is that the markers are no good because they do not provide enough uniqueness or separation from the normal state. Second, a related problem is that if one were to pick a number of cells at random to use as a diagnostic (in this case for breast cancer), one will end up with a gene related to proliferation, since these genes are enriched. Even a random metric will show correlation to breast cancer diagnosis since chances are, a gene related to proliferation will be chosen. The problem is that the metric assumed that cancer cells has a gene expression profile that consists of genes expressed *only* in cancer cells (an on-off versus a more-less distinction.)

In the words of Venet and colleagues,

Few studies using the outcome-association argument present negative controls to check whether their signature of interest is indeed more strongly related to outcome than signatures with no underlying oncological rationale. In statistical terms,

these studies typically rest on [the null hypothesis] assuming a background of no association with outcome. The negative controls we present here prove this assumption wrong: a random signature is more likely to be correlated with breast cancer outcome than not. The statistical explanation for this phenomenon lies in the correlation of a large fraction of the breast transcriptome with one variable, we call it meta-PCNA, which integrates most of the prognostic information available in current breast cancer gene expression data. (emphasis mine)

The method was simple; Venet and colleagues compared previously published gene expression profiles vetted for breast cancer diagnosis and gene-signatures from other biological processes (such as “social defeat in mice” and “localization of skin fibroblasts”) and also from a random selection of genes from the human genome. All these metrics, regardless of relation to oncological significance, showed “predictive” value for breast cancer. What that means is that if your cells express these genes, you will be diagnosed with breast cancer. Hence the title of the paper, “Most Random Gene Expression Signatures Are Significantly Associated with Breast Cancer Outcome.”

***

How do we deal with this study? Does it suggest that biomarkers are a waste? No. For one, the only test presented in this paper is one where a randomized signature is compared to a breast-cancer diagnostic based on gene expression. That a specific test does no better than chance only allows us to conclude the test is deficient in some way. The point is that the existing test may be keying in on “proliferation”, except that Venet and colleagues showed that removing such genes did not worsen the performance of the randomized gene set in “diagnosing” breast cancer. It may be that the gene expression data has not been sufficiently de-noised. One can certainly try to “clean” up the model, but new tests must be shown to differ from the baseline (or, control) level of performance of a randomized gene set.

And how does this relate to the earlier points about basketball statistics? Only in that modeling effectiveness depends on how good a standard is, how well the variables are characterized, and how independent the relationships among the variables really are. Having testable hypotheses and experiments help too (although it seems a shame that gene expression profiles may not prove to be the key factor in this specific scenario). Even leaving aside the question of whether a model is good or bad, being able to show statistical correlation between models is powerful. Before, I had written that Dave Berri showed that John Hollinger’s PER model has no significant difference from simply looking at points-per-game (in fact, the correspondence is nearly one to one.) This conclusion was revealed by the types of statistical analyses that allowed Venet and colleagues to show the equivalence between existing “breast cancer” gene signatures and a randomized one. While correlation does not imply causation, in the case of models, they can certainly help us identify equivalent models with redundant information.