Joe Posnanski has written another thoughtful piece on the divide between writers of a statistical bent and those who prefer the evidence of their eyes.  I highly recommend it; Posnanski distills the arguments into one about stories. Do statistics ruin them? His answer is no. Obviously, one should use statistics to tell other stories, if not necessarily better ones. He approached this by examining how one statistic, “Win Probability Added”, helped him look at certain games with fresh eyes.

My only comment here is that, I’ve noticed on his and other sites (such as Dave Berri’s Wages of Wins Journal) that one difficulty in getting non-statisticians to look at numbers is that they tend to desire certainty. What they usually get from statisticians, economists, and scientists are reams of ambiguity. The problem comes not when someone is able to label Michael Jordan as the greatest player of all time*; the problem comes when one is left trying to place merely great players against each other.

* Interestingly enough, it turns out the post I linked to was one where Prof. Dave Berri was defending himself against a misperception. It seems writers such as Matthew Yglesias and King Kaufman had mistook Prof. Berri’s argument using his Wins Produced and WP48 statistics, thinking  that Prof. Berri wrote other players were “more productive” than Jordan. To which Prof. Berri replied, “Did not”, but also gave some nuanced approaches in how one might look at statistics. In summary, Prof. Berri focused on the difference in performance of Jordan above that of his contemporary peers. 

The article I linked to about Michael Jordan shows that, when one compares numbers directly, care should be taken to place them into context. For example, Prof. Berri writes that, in the book Wages of Wins, he devoted a chapter to “The Jordan Legend.” at one point, though, he writes that

 in 1995-96 … Jordan produced nearly 25 wins. This lofty total was eclipsed by David Robinson, a center for the San Antonio Spurs who produced 28 victories.

When we examine how many standard deviations each player is above the average at his position, we have evidence that Jordan had the better season. Robinson’s WP48 of 0.449 was 2.6 standard deviations above the average center. Jordan posted a WP48 of 0.386, but given that shooting guards have a relatively small variation in performance, MJ was actually 3.2 standard deviations better than the average player at his position. When we take into account the realities of NBA production, Jordan’s performance at guard is all the more incredible.

If one simply looked at the numbers, it does seem like a conclusive argument that Robinson, having produced more “wins” than Jordan, should be the better player. The nuance comes when Prof. Berri places that into context. Centers, working closer to the basket, ought to have more, high-percentage shooting opportunities, rebounds, and blocks. His metric of choice, WP48, takes these into consideration. When one then looks at how well Robinson performed above his proper comparison group (i.e. other centers), we see that Robinson’s exceptional performance is something one should expect when comparing against other positions but is not beyond the pale when compared to other centers. However, Jordan’s performance, when compared to other guards, shows him to be in a league of his own.

That argument was accomplished by taking absolute numbers (generated for all NBA players, for all positions) and placing them into context (comparing to a specific set of averages, such as by position.)

This is where logic, math, and intuition can get you. I don’t think most people would have trouble understanding how Prof. Berri constructed his arguments. He tells you where his numbers came from, why there might be issues and going against “conventional wisdom”, and in this case, the way he structured his analysis resolved this difference (it isn’t always the case he’ll confirm conventional wisdom – see his discussions on Kobe Bryant.)

However, I would like to focus on the fact that Prof. Berri’s difficulties came when his statistics generated larger numbers for players not named Michael Jordan. (I will refer people to a recent post listing a top-50 of NBA players on Wages of Win Journal.*)

* May increase blood pressure.

In most people’s minds, that clearly leads to a contradiction: how can this guy, with smaller numbers, be better than the other guy? Another way of putting this is: differences in numbers always matter, and they matter in the way “intuition” tells us.

In this context, it is understandable why people give such significance to 0.300 over 0.298. One is larger than the other, and it’s a round number to boot. Over 500 at-bats, the difference between a 300-hitter and a .298-hitter  translates to 1 hit. For most people who work with numbers, such a difference is non-existent. However, if one were to perform “rare-event” screening, such as for cells in the blood stream that were marked with a probe that “lights” up for cancer cells, then a difference of 1 or 2 might matter. In this case, the context is that, over a million cells, one might expect to see, by chance, 5 or so false-positives in a person without cancer. However, in a person with cancer, that number may jump to 8 or 10.

For another example: try Bill Simmons’s ranking of the top 100 basketball players in his book, The Book of Basketball. Frankly, a lot of the descriptions, justifications, arguments, and yes, statistics that Simmons cites looks similar. However, my point here is that, in his mind, Simmons’s ranking scheme matters.  The 11th best player of all time lost something by not being in the top-10, but you are still better off than the 12th best player. Again, as someone who works with numbers, I think it might make a bit more sense to just class players into cohorts. The interpretation here is that, at some level, any group of 5 (or even 10)  players ranked near one another are practically interchangeable in terms of their practicing their craft. The differences between two teams of such players is only good for people forced to make predictions, like sportswriters and bettors. With that said, if one is playing GM, it is absolutely a valid criterion to put a team of these best players together based on some aesthetic consideration. It’s just as valid to simply go down a list and pick the top-5 players as ordered by some statistic.* If two people pick their teams in a similar fashion, then it is likely a crap shoot as to which will be the better team in any one-off series. Over time (like an 82-game season), such differences may become magnified. Even then, the win difference between the two team may be 2 or 3.

* Although some statistics are better at accounting for variance than others.

How this leads back to Posnanski is as follows. In a lot of cases, he does not just simply rank numbers; partly, he’s a writer and story teller. The numbers are not the point; the numbers illustrate. Visually, there isn’t always a glaring difference between them, especially when one looks at the top performances.

Most often, the tie-breaker comes down to the story, or, rather, what Posnanski wishes to demonstrate. He’ll find other reasons to value them. In the Posnanski post I mentioned, I don’t think the piece would make a good story, even if it highlighted his argument well, had it ended differently.


My life (I am an American male) does not revolve around sports. I do follow the Boston Bruins, but they are never must-see TV for me – even when they are in the playoffs. Sorry, I prefer reading, making sure my house is in order, and spending time with family and friends.

My interest in sports run along mathematical lines; I am more interested in statistical analysis and model building than in the games (and especially for baseball.) That and drinking beer while watching games.

So it is strange that I read just about everything Joe Posnanski writes. He writes about baseball, and without exception I  read his pieces about living and long-dead ball players whom I have (mostly) never seen.

This piece is particularly good. The way I would approach describe Posnanski is that he is about nuance. Nick Hornby isn’t the first to notice that males tend to love ranking things. Bill Simmons and Chuck Klostermann have also made similar points, in their own entertaining ways. Posnanski, in addition to offering his own rankings, a number of observations that tempers the ranking. In other words, the separation between 2 players may not be as large as the gulf implied by, for example, a “first” and “second” ranking. This is interesting and somewhat in contrast to the approach of most sports columnists.

At any rate, here’s the nuance: Ryan and Suzuki are the best at what they do, but they don’t rank among the best baseball players ever. I won’t repeat Posnanski’s arguments here, but he’s not out to trash either guy. He’s simply trying to work through and present an informed opinion and analysis. The pair, Ryan and Suzuki, can be considered exceptional players along one-dimension. Ryan threw more strikeouts than anyone; Suzuki is a hit machine. But because of other inefficiencies in their game, they actually do not help their teams as much as one might think (in terms of preventing runs for Ryan and driving the offense for Suzuki.)

The greater point is this: I think Posnanski is among the best writers in explaining numbers to an audience. In all seriousness, I want that talent in describing science to non-scientists. When Posnanski gets rolling on presenting statistical arguments for baseball excellence, I applaud the effort because he is able to note all the ways in which these “binary answers” have many shades of gray. When Posnanski talks numbers, I don’t see a difference between him and a scientist who is trying to explain ideas to laymen. And of course his writing talent makes you want more. Or at least it makes me want to read more.

Although this blog is ostensibly about books, I’ve written a lot about sports, mostly dealing with how non-scientist readers perceive statistical analysis of athlete productivity. This issue fascinates me; I think how people think about sports statistics provides a microcosm in how they may respond to similar treatments in the scientific realm. Economists, mathematicians, engineers and physicists will provide a better explanation of the analysis than I can. Instead, I want to focus on the people who draw (shall we say) interesting conclusions about research.

In a recent podcast, Bill Simmons interviewed Buzz Bissinger on the BS Report (July 28, 2010). Bissinger gained some negative exposure as he had railed against the blogosphere and sports analysis. In this podcast, Bissinger was given some time to elaborate on his thoughts. He most certainly is not a raving lunatic, but he did say a few things that I find representative of how statistical analyses are often misinterpreted by non-scientists (and  even scientists.)

Bissinger took the opportunity to trash Michael Lewis’s Moneyball, mostly by pointing out how Billy Beane isn’t so smart, and that all in the end, the statistical techniques didn’t work – only Kevin Youkilis – mentioned in the book, had proven to be a success. I think that misses the point. Yes, the book documents the tension between the scouts and the stat-heads. I think Lewis chose this approach to make the book more appealing, by taking the human interest angle, than simply writing a technical description of Beane’s “new” approach. Perhaps Lewis overstates the case in showing how entrenched baseball GMs were in relying on eyeball and qualitative skill assessments, but the point I got from the book was that: Beane worked under money constraints. He needed a competitive edge. Most baseball organizations relied on scouts. Beane thought that to be successful, he needed to do something different (but presumably had some relevance) to provide baseball success.

Beane could have used fortune tellers; I think the technique in Moneyball (i.e. statistical analysis) is besides the point. Beane found something that was different and based more of his decisions on this new evaluation method. This is a separate issue from how well the new techniques performed. the first issue is whether the new technique told him something different. As it happens (as documented in Moneyball,  Bill James’s Baseball Abstracts, and by many sports writers and analysts), it did. The result is that Beane was able to leverage that difference – in this case, he valued some abilities that others did not – and signed those players to his roster. The assumption is that if his techniques couldn’t give him anything different from previous methods of evaluation, than he would have had nothing to exploit.

The second point is whether the techniques told him something that was correct. And again, the stats did provide him with a metric that has a high correlation with winning baseball games – the on-base percentage. So one thing he was able to exploit was the perception in value of batting average (BA) versus on-base percentage (OBP). He couldn’t sign power hitters: GMs – and fans – like home runs. He avoided signing hitters with high BA and instead signed those with high OBP.

This led to a third point: Beane can only leverage OBP to find cheap players (and still win) so long as there were few GMs doing the same. Of course the cost of OBP will increase if others come onboard and have deep pockets (like the Yankees and the Red Sox.) So Beane – and other GMs – would have to become more sophisticated in how they draft and sign players. Especially if they work under financial constraints. As my undergraduate advisor said, “You have to squeeze the data.”

One valid point point Bissinger made was that the success of the Oakland A’s coincided with the Big Three pitchers. So clearly, Bissinger wrote off a significant amount of  Oakland success to the three. That’s fine, as the question can be settled by looking at data. What annoyed me is when readers do not pay attention to the argument. I just felt that Moneyball was more about how one can find success by examining what everyone else is doing, and then doing something different. The only constraint is whether  something different would bring success.

I felt that Bissinger is projecting when he assumes that using stats means the rejection of visual experience. The importance of Moneyball is in demonstrating that one can find success by simply finding out what people have overlooked. Once the herd follows, it makes sense to seek out alternative measures, or, more likely, to find out what others are ignoring. If the current trend is on high OBP and ignoring pitchers with a high win-count, then a smart GM needs to exploit what is currently undervalued. Statistics happens to be one such tool – but it isn’t the only tool.

And part of the reason I write this is, again, to highlight the fact that people usually have unvoiced assumptions about the metrics they use. The frame of reference is important. In science, we explicitly create yardsticks for every experiment we perform. We assess things as whether they differ from control. It is a powerful concept. And even if the yardstick is simply another yardstick, we can still draw conclusions based on differences (or even similarities, if one derives the same answer by independent means.)

This brings me to recent Joe Posnanski and David Berri posts. The three posts I selected all demonstrate  the internal yardsticks (hidden or otherwise) that people use when they make comparisons. I am a fan of these writers. I think Posnanski has provided a valuable service in bridging the gap between analysis and understanding, facts and knowledge. Whether one agrees or disagrees with his posts, I think Posnanski is extremely thoughtful and clear about his assumptions and conclusions, which facilicates discussion.  The post has a simple point: Posnanski wrote about “seasons for the ages.” A number of readers immediately wrote to him, complaining about how just about anyone who hits 50 home runs in a season would qualify. To which Posnanski coined a new term (kind of like a sniglet) – obviopiphany.He realized that most people simply associate home runs with a fantastic season for a hitter. That isn’t what Posnanski meant, and in the post he offers some correction.

The Posnanski post has a simple theme and an interesting suggestion: the outrage over steroids may be due to the fact that people assume that home run hitters are good hitters. Since steroids help power, the assumption is that steroids make hitters good – which in most cases simply means more home runs. But Posnanski – and others sabermetricians – propose that one must hit home runs in the context of getting fewer strikeouts and more walks. The liability involved in striking out more, and not walking, is too much and washes out the gains made from hitting the ball far. Thus Posnanski posts names a 5 players who are not in the Hall of Fame, and aren’t home run hitters, but who nevertheless produced at the plate – according to some advanced hitting metrics. I won’t go into this more, except to say that here, Posnanski makes his assumptions clear. He uses OBP+, wins above replacement player, and other advanced metrics to make his point. But it is telling that Posnanski had to stitch together the assumptions his readers had – that the yardstick for good hitting simply boils down to home runs.

The Berri posts describe something similar. One of them is from a guest contributor, Ben Gulker, writing about how Rajon Rondo was not going to be selected for Team USA in the world championship because he doesn’t gather enough points. The other highlights how the perception of Bob McAdoo  changed as a function of the fortunes of his team. Interestingly enough, McAdoo became a greater point getter while becoming a less efficient shooter and turning the ball over more; at the same time, his reputation was burnished by the championships his teams won.

The story has been told many times by Berri. It seems that in general, basketball writers and analysts associate good players as those who score points (in the literal sense, regardless of shooting percentage) and who played on championship teams. There are several problems here. Point getting must take place in the context of a high shooting percentage. One must not turn the ball over, one must rebound, one must not commit an above average number of fouls, and hopefully get a few steals and blocks. I don’t think anyone would disagree that such a player is a complete player and ought to be quite desirable, regardless of how many championship rings he has or if he scores only 12 points a game. Berri has examined this issue of yardsticks, and he has found that what sports writers, coaches, and GMs think of players has an extremely high correlation with, simply, how many points they get (this is shown by what the writers write and how they vote for player awards, how often coaches play someone, and how much GMs pay players.)  The verbiage writing up about the defensive prowess and the “little things” are ignored when the awards are given and fat contracts handed out. Point getters get the most accolades and the most money.

And the other point is how easily point getters reflect the luster of championships. Nevermind that no player can win alone, but this again is an example of how people end up with not only unspoken yardsticks, but also choose a frame of reference without analyzing if it is the correct one. The reference point is a championship ring. As has been documented, championships are not good indicators of good teams. The regular season is. This is simply due to sample sizes. More games are played in the regular season. Teams are more likely to arrive at their “true” performance level than in a championship tourney with a variable number of games – and frankly where streaks matter. A good team might lose four games in a row, in the regular season, but they may lose only 10 for the year. In a tournament, they would be bounced out if they lose four in a series.

In this context, the Premier League system in soccer makes sense. The best teams compete in a regular season; the team with the best record is the champion. So people who assume that a point-getter who plays on a championship is better than a player who shoots efficiently (but with fewer points) and rebounds/steals/blocks/does not turnover above average, and on a non-champion team, make two errors. They selected the wrong metric twice over.

With that said, I could only have made that point because of newer metrics that provide another frame of reference. Moreover, the new metrics tend to have improved predictive abilities over simply looking at point-getting totals. Among the new metrics, there are some that show a higher correlation with the scoring difference (and thus win/loss record) of teams. It doesn’t matter what they are, but an important point is that one can derive these conclusions about which metric is better or worse.

This is the main difference in scientific  (of which I include athlete productivity analysis) and lay discourse. In the former, the assumptions are made bare and frames discussion. A good scientific paper (and trust me, there are bad ones) makes excruciatingly detailed descriptions of controls, the points of comparisons, any algorithms/formulae, and how things are compared. In the lay discourse, this isn’t the standard one would use, because communicating scientific findings to other scientists use a stylized convention. Using such a mode of communication with friends would make one a bore and a pedant – not to mention one would become lonely real quick.

I read Bill Simmons’s The Book of Basketball. I enjoyed his book, as it is a fun survey of NBA history. The book isn’t just a numbers game or just breaking down plays. It includes enough human interest elements that it should appeal to a casual fan or diffident parties (like me; I can count the number of basketball games I’ve seen – TV or live – on both hands.) Simmons does a fantastic job of conveying his love of basketball. For me, he really brought different basketball eras to life, inserting comments from players, coaches, and sportswriters. He also seems fairly astute in breaking down plays and describing the flow of the game.

Yes, I bought the book because I think Bill Simmons’s writing. If you enjoy his blog, you will find that same breezy conversation style here. The man has a gift for dropping pop culture references and making it germane to his arguments. But what I like most is that he is earnest in trying to understand and to make his readers appreciate the people who play a game for a living.

His segment on Elgin Baylor was moving, in showing how racism affected this one man; in some ways, it was probably more effective than if he just talked in general terms about the 1960’s. His whole book works because it stays at the personal level. Even in his discussion of teams and individual players, he takes pains to discuss how this person was and is regarded by his peers and teammates.

In this way,  I think Simmons did a fantastic job of making a case that basketball can contain as much historical perspective as baseball. This is something that should not have to be argued. Baseball has a lock on “the generational game by which history can be measured” status. What seems important is that there are human elements that make it accessible between generations: things like fathers taking their sons to the games, talking about the games and players, the excitement of watching breathtaking physical acts that expand how one views the human condition, and the joy and agony of championship wins and losses. While baseball’s slow pace lends itself to the way history moves one (periods where nothing seems to happen punctuated by drama), it doesn’t mean other things happen in a vacuum. Style of play, the way the players are treated, and the composition of the player demographic all reflect the times. These games can be a reflection of society, and one can see the influence of racial injustice in something as mundane as box scores as integration occurred.

Simmons blend basketball performance, its history, and its social environment of basketball effectively, some examples could be found in his discussion of Dr. J, Russell, Baylor, Kareem, and Jordan. In discussing why there probably won’t be another Michael Jordan (or Hakeem, or Kevin McHale), he takes inventive routes. Most of his points relate to societal/basketball environment pressures. Players are drafted sooner, the high pay scale for draft picks lower motivation to prove their worth, and perhaps society itself would actively discourage players from behaving as competitively as Jordan did. I suppose it’s interesting, but I’m not sure if that matters so much if the player is perceived to be an excellent player. Regardless, it seems to me that Simmons has been thinking about these things for some time. And I found it fun to read his take on basketball.

And I liked this book because it gives the lie to the weird view that someone who hasn’t done something cannot make reasonable, intelligent statements about it. Simmons wasn’t a professional basketball player, but he certainly uses every resource available to absorb the history and characters populating the game. He read a fair bit, he watched and rewatched games, he talked to players, he talked to people who covered basketball and he watched some more.  And he isn’t afraid to raise issues that occur to readers; you’ll see what I mean when you read his footnotes.

The book (and his podcast) confirms my opinion of Simmons as the smart friend who’d be a blast to have (one who bleeds Celtics green, watches sports for a living, and must keep up with Hollywood gossip, gambles, and pop culture because it gives him ammunition for columns).


There are some issues with the book, mainly in how statistical analysis of basketball is portrayed. I should be upfront and say that these issues did not detract from his arguments (for reasons that will be clear later), but I wish he would reconcile eyeball and statistical information.  And because I’ve decided one focus of this blog should be how non-scientists deal with science (and scientists), I thought I should offer some thoughts on some of these issues.

I am somewhat undecided about how Simmons (and I suppose I am using him as a proxy for all “non-scientist”) actually feels about statistics. He claims that team sports like basketball and football are fundamentally different from baseball; the team component of the former increase the number of additive and subtractive interactions while the latter game is composed of individual units of performance.  Thus the increase in complexity makes it difficult to model. So he discards so called simple measures of NBA player performance like WP48, PER, and adjusted plus-minus.

His rationale is that these indicators ought to back up existing observations about NBA players. So Kobe Bryant needs to be ranked as a top-20 player of all time (WP48 ranks Bryant as a superior player – like Paul Pierce – and not a step or two behind Michael Jordan.) It seems like he wants statistics to tell him what he wants to hear, when in fact statistics helps you see things you don’t see.

But then that leads to my second point about Simmons: why does he need the model to back up his mental model of player performance? Put differently, why is it that he cannot accept differences in rankings calculated by some turn-the-crank-spit-out-value model? I think Simmons lacks a nuanced view of how these numbers ought to be interpreted, and that he refuses to see that a simple model can capture a great many things about a complex system. Sure, once you’ve set up your criteria (like some level of significance you are willing to accept), you align everything by it, but there is room for some judgement as to where that line is drawn.

Another way of describing a complex system is to say that there are many things going on at once, and they are all interacting in some way. There are 10 players on a basketball court. One player, with the ball, has options to pass, to shoot, or to move the ball. Within each of these options, he has a set of suboptions: which one of the other four guys do I pass to? Who’s open? Which open player has a good shot from where he is? Am I in my optimal position to shoot? Do I need to drive to the basket or kick the ball out to the perimenter? There are many more possibilities than these.


At one level, Simmons is right; it is useful to break things down into “hyperintelligent” stats – identifying the tendency of players (whether he likes breaking to his left or right when he’s starts driving from the top of the key, whether he is equally good in shooting from his left or right hand, how often he does a turnaround, fadeaway, or drives to the hoop), trying to figure out how many forced errors a defender creates, how often a unforced turnovers happen (like someone dribbling off his foot), how many blocks get slapped out of bounds vs being tipped to get possession, and so on.

But isn’t it just as intelligent to find an easy way of collapsing the complex game into a simple “x + y” formula? On several occasions, Simmons uses a short quote (and praises the person who said it) that captures everything he wanted to say in 15 pages. A simple model is analogous to that short quote.

More importantly, what if we didn’t need all these hyperintelligent stats to capture the essence of the game?

I just switched the problem from one of identifying player performance and productivity to one that captures the game a broad strokes. The two ideas are of course related but still distinct and should not be confused to mean the same thing.

This gets back to the original motives of the person who does the modeling.

If it’s a scientist or economist, I’ll tell you now that he is interested in getting the most impact with the least amount of work. He probably has to teach, run a lab/research program, and write grants and publications. He doesn’t have time to break game film down. And he certainly does not have the money to hire someone to look at game film (although I am sure he’ll have no lack of applicants for the job.) He spends his money finding people to do research and teach. If his research program is into finding ways to measure worker productivity, he will probably start with existing resources. So fine; he now has a database of NBA player box scores.

He’ll want to link these simple measures of player output to wins and losses. But players score points, not wins, and thankfully the difference in points scored and points given up correlate extremely well with wins and losses.

From there, it is relatively simple to do a linear regression for all players for all teams, finding how each of the box score stats relate to the overall points scored for each team. And as noted, some metrics have a higher correlation to the point difference (I will not use the term differential to mean difference; differential belongs to diff EQ’s.) Regardless, it seems an affliction for males that they rank things; so the researchers have these numbers, and it’s trivial to list players from high to low.

Now, here’s another consideration. In this, and in other branches of science, the data are not “clean”. That is, we scientists (generally) assume that the phenomenon we are observing conforms to a “normal” distribution – that is, there is some true state for the thing we observe (found by taking the average of our observations) and the individual pieces of observation hover around this true state (or average). So there is variation around the mean.

In my research, for example, I can measure neural responses in the olfactory bulb. I use optical indicators of neural activity; essentially, the olfactory bulb lights up with odor stimulation. The more the neurons respond, the brighter things get. The olfactory bulb is separated into these circular structures called glomeruli. Each glomerulus receives connections from the sensory neurons situated in the nose and the output neurons of the olfactory bulb (some other cells are also present, but they aren’t important for this story.)

When a smell is detected by humans (or animals and insects), what we mean is that some chemical from the odor source has been carried, through the air, into the nose and neurons become active (they fire “action potential spikes”). And the pattern of this activity, at the olfactory bulb, is quite similar – but not exactly the same – from animal to animal.

Sometimes, we see fewer responses to the same smell. Other times, we see a few more responses. Sometimes we see a different pattern from what we expect. Sometimes, we see no responses. This might happen once every 15 animals. Not a whole lot to take away from our general, broad stroke understanding of how this part of the brain processes smell information. In most cases, some of these things might be explained technically; the animal was in poor health, or our stimulus apparatus has a leak, or the smell compound is degraded. We know this because we can improve the signal by fixing the equipment or giving the animal a drug to clear up its nose (mucus secretion – snot! – is a problem).

And as a direct analogy to this WP48 vs “hyperintelligent stats” problem, we find that a complex smell (compose of hundreds of different chemicals) may be “recreated” by using a few of these chemicals. There is good empirical evidence this is the case: prepared food manufacturers and fragrance makers can mimick smells and flavor reasonably well. This is akin to capturing the essence of the smell (or sport) with a few simple chemicals (or box scores). And generally, we don’t even need people to describe to us what they smell to figure this out (i.e. break down game film to create detailed stats). We can simply force them to make them answer a simple question: do these two things smell the same to you, yes or no? Thus “complex” brain processes and decision making can be boiled down into a forced-choice test results. Do we lose information? Yes, but everyone realizes this is a start. As we know more, and new technology becomes available, we can do more and ask more with less effort. Then we will be able to better use the information we have. As far as I know, most statheads have access to box-scores (although there is nothing to stop them from breaking down game film aside from time and money issues.)

But that’s the broad strokes view. If we get into details (that is, as if we started working with the “hyperintelligent” stat breakdowns), we find that of course there is more going on, and that the differences we see are not only technical issues. For example, the pattern of activity we see differs slightly from animal to animal, but this is because the cells that form connections with the olfactory bulb do not hit the same spot. And if we can use a single chemical to recreate a smell, the smell itself is still different enough that humans generally can tell something is missing. So the other chemicals are in fact detected and contributing some information that the brain uses to form the sensation of smell. And we know that the way neurons respond to a single chemical differs from how they respond to a mixture, confirming that there is in fact additional information being transmitted.

The important point is that the simple model captures an important part, but not all, of the complex system. One problem that can occur with increasing the complexity of models is that overfitting occurs: the model becomes applicable to one small part, rather than the whole, system. Even game film breakdown hinders  if it gives you so many options that you are back where you started. You’d probably avoid focusing on rare events and just concentrate on the things that happen often – which, again, is the point of a simple model.

The intense break down of game film to provide detailed portraits of player effectiveness could be combined with the broad strokes analysis. A metric like WP48 can tell a coach where a player is deficient. The coach can use the detailed breakdown to figure out why the player isn’t rebounding, passing, shooting well, and so on. That’s where things like defensive pressure, help defense, and positional analysis can be used for further evaluation. And I’m not sure if stat heads argued otherwise.

Deficiencies of statistical models

As in the things that models explicitly ignores.

One thing statistical models do not address is the fan’s enjoyment of a player. Actually, I suppose one might be able simply chart percent-capacity of stadiums when a particular player comes to town, but that’s something I don’t think Simmons would argue. There’s something to be said about how a player scores: Simmons pays tribute to Russell and Baylor, the first players to make basketball a vertical game. He cites Dr. J. as introducing the urban playground style  into basketball. He loves talking about the egos of players, especially when players take MVP snubs personally and then dominates the so-called MVP in a subsequent game.

Simmons also offers a rebuttal to PER, adjusted plus/minus, and “wages of win” metrics in his ranking of Allen Iverson – by saying that he doesn’t care. It’s sufficient for him that he finds Iverson a presence on the court. His emotions are acted out as basketball plays. He finds Iverson’s toughness and anger on the court fascinating to watch.

But Simmons does use metrics: the standard box scores. I would ask this: if Iverson didn’t score as much as he did, would Simmons still care? As Berri has noted, the rankings by sportswriters, the salaries given to scorers, and PER rankings all correlate highly with volume scoring (i.e. the points total, not field-goal percentage). Despite the tortured arguments writers might make, and the lip service given to building a lineup with complete players, “good” players are players who score a lot.

However, I should be clear and say that Simmons’s approach does not detract from his defense of his rankings. He uses player and coach testimonies, historical relevance, visual appeal of their playing style, sports writers, and the box scores to generate a living portrait of these players as people. Outside of the box scores, there are enough grist for the mill. I would suggest that it is these arguments that make the whole argument process fun. Even in baseball, supposedly the sport with the most statistically validated models of player performance (and Berri would argue that basketball players and their contribution to team records are even more consistent), there are enough differences of opinion concerning impact, playing styles, and relvance to confound Hall of Fame/MVP arguments (see Joe Posnanski).

Because Simmons is upfront about his criteria (even if the judgement of each might be not as “objective” as a number), it is fine for him to weight non-statistical arguments for greatness. It’s how he defined the game. Just as Berri defined “player productivity” in terms of his WP48 metric. Because Berri publishes in peer-reviewed journal, he needs methods that are reproducible. Science, and in general the peer review process, is a different process than writing books or Hall-of-Fame arguments or historical rankings. The implicit understanding of peer-review is that the work is technically sound and reproducible. Berri cannot take the chance of publishing a Simmons-like set of criteria and have other sports economist “turn the crank” and come out with different rankings. But Berri can publish an algorithm, and proper implementation will yield the same results.

Does this mean that Berri is right? Or that a formula is better than Simmons’s criteria? Mostly no. The one time where it is “better” is when one is preparing the analysis for peer-review. In this case, it is nicer to have a formula, or a process, or a set of instructions, that yield the same result each and everytime the experiment is run. In other words, we try to remove our bias as much as possible. Bias here does not mean anything pernicious; it just is a catch-all term for how we think a certain way (with our own gut feelings about the validity of ideas and research direction). Being objective simply means we try to make sure that our interpretation conforms to the data, and that the work is good enough so that other researchers come to the same general conclusions.

I think Simmons actually doesn’t need to trash statistics, nor does he need to ignore it. Once he establishes ground rules, he can emphasize or deemphasize how important box scores are in his evaluation. As it is, I found his arguments compelling. His strength, again, is to make basketball history an organic thing. He does his best to eliminate the “you had to be there” barrier and tries to place the players in the context of their time.

Now, one might ask why stats can’t be used to resolve these arguments about all time greats. Leaving aside the issue of the different eras (and frankly, this can be addressed by normalizing performance scores to the standard deviation for a given time period, as Berri does here ), there is the issue of what the differences in these metrics mean. In the same article I cited, Berri reports that the standard deviation for the performance of all power forwards, defined by his WP48 metric, is about .110. His average basketball player has a WP48 of .100. Kevin Garnett, for example, has a WP48 (2002-2003) of 0.443. That translates roughly that Garnett is more than 4x as productive as an average player, but normalized to the standard deviation, he is only 3.5x as productive.

But how much different is a power forward from Kevin Garnett if the other forward has a WP48 of 0.343? One might interpret this to mean that Garnett is still nearly 1 standard deviation better than the other player, but it could also mean that their performance fall within 1 standard deviation of each other. Depending on the variation of each player’s performance for a given year, compared to his career mean, they could be statistically similar. That is, the difference might be accounted for by the “noise” in slight upticks/downticks in rebounds/assists/steals/turnovers/shooting percentages/blocks. If you prefer, how about the difference between a .300 hitter and a .330 hitter? Over 500 at-bats, the .300 has 150 hits, and the .330 hitter has 165; the difference would be 15 hits over the course of a season. Are the two hitters really that different? The answer would depend on the variability of batting average (for the compared players) and how these numbers look with a larger sample set (i.e. over a career with over 5000 at-bats, for instance.) The context for the difference must be analyzed.

Here’s another example: let’s assume that Simmons and Berri’s metric turned out similar listings, perhaps with different order (one difference is that Iverson would be nowhere near Berri’s top 96.) And further, let us assume that the career WP48 scores are essentially within 1.5 standard deviations of one another. How might Simmons break with the WP48 rankings?

Let us tackle how Berri would have constructed his ranking: he would simply list players from highest to lowest WP48. That’s probably because he is in peer-review article mode. And frankly, if you profess to have a metric, why would you throw it out? You might if, like Simmons, you defined the argument differently. Of his Pyramid of Fame rankings, he lists a few arguments that do not encompass basketball productivity. Again, the idea of historical relevance, player/coach testimony, and the style and flair of the players enter into Simmons’s arguments. So all things being equal, and if the difference in rankings by metric is slight, there really is no reason against weighing the statistics more than any other attribute. Heck, even if the metric differences are large, it wouldn’t matter. Simmons like his other arguments more anyway.

But if you do talk about the actions on the court, then I believe you are in fact constrained. Of the metrics I had mentioned, WP48 offers high correlation with point-difference and thus with win-loss records. Further, some of the other metrics actually correlate with points-scored by players, suggesting that there is no difference between that metric and simply looking at the aggregate point total. So there are actually models that do reasonably well in predicting and “explaining” the mechanics of how teams win and lose.

In a way, I think the power of a proper metric is not in ranking similarly “productive” players, but in identifying the surprisingly bad or good players. Iverson is an example of the former; Josh Smith (of the 2009-2010 Hawks) of the latter. It might not be as powerful a separator of players with similar scores, because their means essentially fall within 1 standard deviation of one another; in essense, they are statistically the same. In this case, it  helps to have other information to aid evaluation (and this isn’t easy; as Malcolm Gladwell has written, and Steven Pinker taken issue with, some measuring sticks are less reliable than others.)

Another example where statistics is powerful is in determining, in the aggregate, if player performance varies from year to year. Berri found that it isn’t, suggesting that the impact of coaching and teammate changes may not be as high as one thinks. However, such a finding in no way precludes coaches and teammates from having an effect on teammates. It just means that these people are too few to affect the mean. Or perhaps it suggests that coachs are not using information properly to make adjustments that are meaningful to player performance. Overall, I suppose, one cause for why Simmons hates advanced stats and rankings is that he isn’t sensitive to the importance of standard deviation, and ironically enough,  he applies the mean tyrannically when there is such a concept as statistical insignificance.

But Berri has never pushed his work as a full explanation of the game of basketball. First, he doesn’t present in-game summaries: he only looks at averages over time. There’s nothing in his stat to indicate the ups and downs (i.e. standard deviation in performance) a player experiences from game to game. Even in baseball, hitting .333 does not guarantee a hit every 3 at-bats. It just means that over time, a hitter’s hit streaks and lulls add up to some number that is a third of his at-bats. Berri’s metric (and any other work that proposes to measure player performance) certainly cannot predict what a given box score would be, for a given game, for a given player.

Regardless, I do not see a problem with Simmons’s ranking his players. Simply, he values entertainment value as much as production. I would say he values the swings in performance just as much, if not more (more on this later). Yes, he says stats do not matter, but of course it does. It’s interesting that all the scoring lines he cites, in admiration, all lead with a high score or score per game. And if you can’t shoot, rebound, pass, steal, or block and coughs the ball up a lot, it wouldn’t matter how pretty you make everything look.


Joe Posnanski has pointed out that, whenever someone trashes stats, he tends to offer some other supplemental numbers that back up his point. In other words, the disagreement isn’t about statistics per se, but between the distinction of “obvious” stats vs. “convoluted” stats.

Even if one disagrees with basketball statistics, at least he can believe that statheads came up with a formula first and turned the crank before comparing the readout with their perceptions of players. Hence Simmons blowing up when PER or WP48 doesn’t rank his favorites highly.

Simmons approaches this from the opposite direction. He has an outcome in mind and “builds” a stat/model to fit it (like his 42-Club). But he mistakes his way of tinkering with what modelers actually do. Berri arrived at his model by performing linear regression on a particular box score and seeing whether the point-difference increased. It isn’t an arbitrary way of deriving some easy to use formulation. The regression coefficients are meaningful in that, what it says is, if you increase shooting percentage by this amount, the point-difference goes up by that amount. It so happens that points scored by a player did not increase the point-difference. And he built it by using all players; it’s strange to decide before hand what players are great, and then build a metric around that. Why even bother in the first place?

And for Berri to report differently on these aggregate data because Kobe isn’t ranked any higher, actually would become scientific fraud. But as I noted above, applying these WP48 rankings isn’t as hard and firm a process as Simmons thinks. There is some room for flexibility, depending on what one tries to accomplish.

In general, I agree that more break downs in the game would be useful, in the sense that more data is always nice. The problem, for academics, is that these stats might remain proprietary, and it becomes difficult to apply across all teams. Even if we could get all the “hyperintelligent” stat breakdowns from a single team, it is unclear if other teams would view the break down in the same way. The utility for examining general questions about worker (i.e. player) productivity for academic publication becomes less clear. The database ought to help the teams – assuming they are intellectually honest enough to verify that their stats that produce a better picture of player productivity and aren’t impressed by the gee-whiz-ness of it all. My guess is that they won’t be entirely successful, as Simmons still has a job trashing bad GM decisions.

Standard Deviations

Why I watch sports: it seems to be similar to the way Simmons does. He watches over a thousand hours of sports each year, waiting for the chance to see something he has never seen before. Something that stretches the imagination and the realm of human physical achievement.

I feel the same way; I am team and sport agnostic, and although I used to follow Boston Bruins hockey religiously, I left that behind in high school. Although I have lived in Boston from the age of 7 onwards, I had not been infected by the Red Sox or Celtics bug (even during their mid-80’s run). I did root for the Red Sox in 2003 and 2004, but that was because of the immense drama involved in the playoff games against the Yankees. And Bill Simmons’s blog for the season.

Perhaps I prove Simmons’s point about stat heads; I like to say that I am interested in sports in the abstract. I like the statistical analysis for the same reason Dave Berri had pointed out in his books. There is a wealth of data in there to be mined. I thought one good example of the type of research that can come from these data is finding evidence for racial bias in the way basketball referees call games.

However, what got me interested in watching professional sports was Simmons writing about it. Although I didn’t watch football, basketball, or baseball for a long time, I did watch the Olympics and, believe it or not, televised marathons. Partly it was because my wife and I were running, but mostly I saw the track and field type sports as a wonderful spectacle. So it wasn’t that much of a stretch to fall into a stereotypical male activity.

At any rate, I was amazed at Usain Bolt’s performance in the 2008 Summer Olympics. I was disappointed by Paula Radcliffe injuring herself during the Athens Olympics, and then relieved when she won the NYC marathon, setting a new speed record in the process. I rooted for Lance Armstrong to win his seventh Tour. I rooted for the Patriots to get their perfect season. And until the Colts laid down and the Saints loss a couple of weeks ago, I wanted the Colts and the Saints to meet in the Super Bowl, both sporting 18-0 records. I was glad that the Yankees won the World Series, and with that fantasy baseball lineup, I hope they continue to win. I want to see the best teams win, and win often. And yes, I wish the regular season records lined up with the championship winners for a given season. Then we wouldn’t have arguments about best regular season records and the championship winners.

This isn’t because I’m a bandwagon fan; I watch sports now for the same reason that Simmons does. To see the best of the best do great things. But not always because they might have a competitor who wants it more, leading to the best failing, at times. This drama is the power of sports.

And I can see why Simmons argues so passionately against stats. He likes the visceral impact of sports. I can say that Bolt ran a 9.69s 100 m. But it was nothing compared to seeing Bolt accelerate, distance himself from the other runners, and then slow down as he pulled into the finish line. He blew away the competition. My eyes were wide and my mouth hung open: he slowed down! And he was 2 strides ahead of everybody. And he set a new record. Even if Bolt didn’t set the record, he still made it look easy. On the field, on that particular day, he out-classed his competitors. It is watching the struggle of the competitors (like Phelps winning the 100m fly by 10 milliseconds), on that day, that matters. Over time, if one didn’t watch that particular heat, then the line World Record: Usain Bolt, 100 m, 9.69s doesn’t quite hit you the same way.

But then, there is this. What if instead of looking at the single race, you looked at the athlete performing in 8 or 20 or  50 events for a year? And at these events, the same set of athletes compete over and over?

Here are some possible outcomes: Phelps and Bolt lose every other match, essentially giving us a single transcendental moment. Phelps and Bolt win half their meets. Phelps and Bolt utterly dominate the field, winning 65% or more of their meets.

For first case, we would probably admit that the Phelps and Bolt phenomena was a one-off. For whatever reason, the contingencies (no sports gods or stars aligning here!) lined up such that they did highly improbable feats (but not impossible. This distinction is the point of this section.) The third case proves our point; they are not perfect, but they sure are good. The second case is a bit trickier: since they are right on the borderline, we need some analysis to help us decide. One way might be to sum up our individual observations about these two. Being .500, while giving us a single breathtaking moment might be persuasive. Or one might look at how everybody else did (Phelps and Bolt might have won 50% of the time, but if the remainder is split among their competitors, they have still dominated the field.)

But then what if Bolt and Phelps won 49% of the time, and some other competitor won 50% of the time? What then? Here, criteria are important. Most of the time, we say better meaning, well, something is better. Generally, we aren’t specific about what we mean by it.

In the book, Simmons ranks his top 96 players in a pyramid schematic. He is rather specific about what he wants in a player. And as one expects, he is specific about the types of intangibles his basketball player should have (basically, basketball sense – i.e. The Secret, if he made his teammates better, winnability, and if you choose someone based on “if your life depended on this one guy winning you a title.”) The evaluation of those intangibles, however, is not as precise as he’d like. However, the advantage here is that one might be able to answer “why” questions. In some cases, Simmons seemingly ranked two players differently while giving them the same arguments (like the consistency of Tim Duncan and John Stockton. Somehow, Stockton just rubbed Simmons the wrong way, while Duncan’s consistency makes him the seventh best player of all time.) And his emphasis on projecting Bill Russell’s game into the modern era seemed like Russell should have ranked lower. On occasion, I was left with the feeling that the arguments did not match the ranking.  From what he said about the stat inflation and how Wilt didn’t get the secret, I thought he would be ranked lower than 6.

Dave Berri has the opposite problem: he has a mathematically defined metric and when he says better or worse, it’s whether this metric is higher or lower between the players being compared. He can further break down this stat to show where a player is good or deficient (whether shooting percentage, blocks, turnovers, fouls, steals,  and assists are above or below the average). He can tell you the hows, with his model spitting out a number that combines these different performance stat into a metric of productivity. But he simply ranks players numerically, without talking about how these differences one might see between the players (and one might not be able to see it… it could be one more missed shot or one less rebound every couple of games.)

I am amazed that Simmons cannot reconcile eyeball and statistical information. Just about every time Simmons bitches out scorers, he talks about how this player didn’t get “The Secret”. It isn’t about scoring; it’s about having a complete game. It is about making the team better with the skills you have. To top it off, Simmons then says that point getters are one dimensional. You can’t shy away from rebounds. It’s great to have a few steals/blocks. Sure, not every athlete can do it all, and certainly not be as prolific as superstars, but you can’t avoid doing those things.

I’m sure Berri is nodding his head, agreeing with Simmons. Point getting isn’t the same as being a efficient shooter (at least average field goal and free throw percentages). And you certainly can’t be below average in the other areas if you want to help your team.

But Berri generally writes about the average. Simmons focuses on the standard deviations. He doesn’t just care about the scoring line; he focuses on Achilles-wreaking-havoc-on-the-Trojans type of performances. He loves the stories of Jordan’s pathological competitiveness. In other words, Simmons lives for the outlier moments.

And I think therein lies the nutshell (and to borrow a Simmons device, I could have said this 5500 words ago and shortened this review.) Simmons views the out-of-normal performance as transcendent, as examples of players who wanted something more or had something to prove. He treats the extreme as something significant; he uses a back story to it to give the event meaning. That’s fine. It’s also fine when Berri (and stat heads) are constrained in treating outliers as noise (possibly) or irrelevant to the general scope of the model, if they desire a model of what usually happens and are not concerned with doing the job of a GM and a coach for free. Because they both defined the game they wish to play in.

I swear I never meant for this blog to focus so much on sports. But Dave Berri has a post that dovetails neatly with some thoughts I have regarding experts, expertise, and how the public should handle them. I think it can be interesting to approach science issues from the side, rather than head on. Specifically, three authors (Berri, Malcolm Gladwell, and Steven Pinker), all of whom I admire, have had a minor verbal tussle about the issue of expertise.

First, a digression. I was already going to comment on the interface between experts and laymen. The original impulse came about because I just finished reading Trust Us, We’re Experts! by Sheldon Rampton and John Stauber. Like books of this ilk, the authors spend many chapters recounting the failures of authority figures and the exploitation of these failings by people who follow the profit motive to an extreme degree. Although the title hints at a broadside against arrogance of scientists, it really is about the appropriation of the authority, rigor, and analysis of science to sell things. The targets of this book are mainly PR companies and the corporations that hire them. There are also a few choice words for scientists who become corporate flacks.

The book lacked in presentation, mostly because the authors avoided analyzing how one can tell good from bad science. The presentation leans on linkages between instances of corporate malfeasance; there is no analysis and data on how many companies engage PR firms in this. There is no analysis on the amount of research from company scientists versus independent ones. The authors focus on motives of corporate employees, but somehow ignore the possibility of bias within the academy. There is no attempt to identify if and when corporate research can be solid. In broad brush strokes, then, chemists who discover compounds with therapeutic potential are suspect; the same people working in academia (and presumably someone who will not capitalize on this finding financially) can be trusted.

This is actually a huge problem in the book; one of the techniques that Rampton and Stauber document is the use of name-calling (good old fashion “going negative”, ironically enough, the PR firms would simply label all opposition as junk science.) in describing research and scientists who publish contrary findings from whatever corporations happen to be pushing. But by avoiding the main issue of identifying good and bad science, the two stitch examples of corporate and public relations collusion. Now, the evidence they present is good; they hoist PR and corporate employees by their own petards, quoting from interviews, articles written for PR workers, and from internal memos. But the ultimate point here is that Rampton and Stauber simply tarnish corporate research because the scientists work for corporations. I believe this to be a weak argument and is ultimately useless. One example I can think of is, what if two groups with different ideologies present contrary findings? Assuming that the so called ‘profit motive’ are equally applicable, or not at all, then readers will have lost the major tool that Rampton and Stauber pushed on in this book. But as I will show, the situation is not always as stark as, for example, corporate shills and academicians or creationists against biologists. There is enough research of varied quality, published by ‘honest actors’, to cause enough head-scratching about how solid a scientific finding was.

Let’s be clear, though. Of course the follow-the-money strategy is straightforward and, I would think more likely than not, correct. But that cannot be the only analysis one does; if the thesis is that PR firms use name-calling as a major tactic in discrediting good, rational, scientific research, it seems bad form to use funding source as a way to argue that investigators funded by corporations do bad research. It’s just another instance of name calling. I expected more analysis so that we could move away from that.

And that’s the unfortunate thing about a book like this; why wouldn’t I want a book that causes outrage? Why, in essence, am I asking for an intellectually “pure” book, one that deals with corporate strong arm tactics in a so-called more methodical, scientific way. Doesn’t this smack of the political posturing, where somehow a result matters less than the means – and no, I do not mean the ends justify the means. I am just pointing out that there might be multiple ways of doing something (like taking route A vs. B or cutting costs by choosing between vendor C and vendor D). Workplace politics might elevate these mundane differences into managerial warfare. Why should I care what the politics are, so long as it leads to a desirable end result?

One problem problem with a book like Trust Us is that it appeals to emotions with rhetoric, without a corresponding appeal to logic. I think including analytical rigor is important as it provides the tools for lasting impact. As it is written, the book (published in 2000) provides catchy examples of corporate malfeasance. The most basic motif is as follows: activists use studies that, for example, correlate lung cancer with smoking in order to drive legislation to decrease smoking. Corporations and interested parties attack by calling this bad science, by calling the researchers irresponsible, by calling the activists socialist control freaks who wish to moralize on an issue that is really a matter of personal choice. They have a considerable war chest for this sort of thing. Frankly, if that’s what Rampton and Stauber are worried about, then their focus should have been on the herd mentality of people, not the fact that PR firms use negative ads.

But that is only one weapon; the other weapon is the recruitment or outright purchase of favorable scientific articles. The  example would be the studies published by scientists who work for tobacco companies, with the studies refuting the claims of the investigators. But Rampton and Stauber focus on simply point out that this favorable finding comes from researchers who are paid by Philip Morris. That’s nice, but how is this different from the name-calling Philip Morris engages in? The real issue is how one goes about identifying what bad research is.

They do throw a sop to analytical tools, at the end of the book. The discussion is cursory; the focus is again on helping the reader dissociate the emotional rhetoric from the arguments (such as they are.) The appeal is that the analysis is simple. Just question the motives of the spokesmen and experts.Worst of all, their discussion of the difficulties of science gives the impression that the whole enterprise is a bit of a crapshoot anyway. They point out peer review is a recent phenomenon, that grant disbursal depends upon critiques from competing scientists, and that the statistically significant differences reported are more often than not, mundane and not dramatic. Their discussion of p-values make scientific conclusions sound like so much guesswork, rather then the end result of hard work. Day-to-day science isn’t as bad as the pair portrayed it.

It is a trick to take a broad question (“How does the brain work?”), break it down into a model (“Let us use the olfactory system as a ‘brain-network lite'”), identify a technique that can answer a specific question (“I wonder if the intensity of a smell is related to the amount of neural activity in the olfactory system? We expect to see more synaptic transmission from the primary neurons that detect ‘smells.'”), do different experiments to get at this single question, analyze the data, and write up the results.

Forget the fact that different scientists have different abilities to ask and answer scientific questions; nature doesn’t often give a clear answer. So yes, it is hard to get conclusive statements. To confound the issue further, even good research can have a flaws, unclear experimental design, incorrect analysis, and distressingly minor differences between control and test conditions.  Which leads us to the question, what exactly does good research look like?

I am not going to answer this now, and I can’t answer this. The blog will, eventually, attempt to deal with this very issue by presenting papers and research that I read about, in addition to book reviews. But my point here is that Rampton and Stauber didn’t address this issue either. The very end of the book is a populist appeal, one that emphasizes “common sense” over jargon and statistics. They even appeal to our civic duty, that we should become more politically active and associate with (my term, not theirs) “lay-experts”. At some point, however, even well-informed non-scientist and non-experts must have turned to experts for some original research. Rather than disregard that research, then, one must learn and gain a comfort level with parsing scientific literature.

It took a while, but we return to the Gladwell-Pinker-Berri flap. The setup is simple: Berri is a sports economist, specializing in creating models that predict athletic performance. However, he has tackled multi-player games (basketball and American football), which, presumably, would lead to complex models, or perhaps something computationally intractable. Surprisingly, he found that neither was the case. The important point this time is that he was able to show where quarterbacks are selected in the NFL draft doesn’t fit with their performance (assessed using the Berri and Simmons QB Score metric.) Gladwell wrote an essay that presented Berri and Simmons argument favorably. Pinker made a short comment refuting this, saying that QB’s drafted high do have better performance.

Both Pinker and Gladwell‘s review and response seemed snippy to me. But what I found interesting was that while Pinker questioned Gladwell’s ability as an analyst (while giving Gladwell the backhanded compliment that he is a rather gifted essayist – but not a researcher or analyst), Gladwell, in turn, questioned the background of Pinker’s sources. I think Gladwell’s highlighting the faults with the arguments was sufficient, as Pinker’s sources are somewhat weak. It really wasn’t necessary to impugn their background.

This is ironic, as Pinker raises some peripheral issues regarding Gladwell’s suitability in reviewing the research and observations from experts. Just as with Gladwell, I think Pinker gave a reasonable counter-argument to Gladwell’s generally gung-ho and favorable presentation of his subjects. For example, there is a flip side to imperfect predictors: while they may not be useful for predicting the most suitable candidates, they help to remove the worst ones from the pool, in a cost-effective way. That’s an interesting, and I think one “system” that scientists can study to answer this is… sports (because of the wealth of performance data).

There really is no need to trash an expositor just because he is a better essayist than a scientist, for instance. Isn’t Gladwell in fact an expert in conveying novel research to the public (and effectively)?

In this case, I think both the “expert” and “lay person” gave a good accounting of their (intellectual) problems with the other. However, they both engaged in what amounted to look-at-the-source “analysis” (Pinker says Gladwell doesn’t know what he writes about. Gladwell trashes Pinker’s football sources for things they did, that are unrelated to football). The only thing the ad hominem attacks achieved was to raise the blood pressure of both participants.

Strangely enough, I find myself writing again about Bill Simmons. I found his latest article interesting, well-thought out, with his conclusions generally supported by his arguments. So why am I writing? Simmons did a great job breaking down film and the problems with the type of statistics used. I took issue with the fact that he concludes this “proves” the lack of predictive power of statistics, when I thought he should have concluded that he used statistical and observational analysis correctly. Simmons missed a golden opportunity to show readers how to synthesize statistics and low-sample number observations.

The setup:  Week 10, Patriots at the Colts, 34-28. The Patriots had the ball on their 28 yard line, 2 min 3 s left to play, and it was 4th-and-2. Belichek decided to go for the first down rather than punting. There might have been some issue with the ball being spotted in the wrong place, but essentially, the Colts stopped the Patriots. Turnover on downs. The Colts scored on their series, after dragging out the clock, and won the game by a point.

First, Simmons does what I like sports writers to do: combine on-the-field observation with the context of what one usually sees from football teams, in the aggregate (i.e. some group analysis, which usually does mean statistical analysis). I happen to think his argument against not-punting, in this specific play, is stronger than, for example,  Joe Posnanski’s and Gregg Easterbrook’s posts about the statistical analyses that generally supported Belichek’s decision. Simmon’s arguments were stronger because he specifically placed his observation of the game and the Patriot’s performance leading up to this last offensive call in the context of aggregate statistics. True to form, however, he followed this by trashing the statistical analysis, rather than concluding that he had properly evaluated singular performance and identified how the Patriots deviated from the aggregate.

Simmon’s argument is that most stat-heads used the wrong set of probabilities. Posnanski,  Easterbrook and Simmons presented the statistical arguments that the Patriots had a greater chance of winning had they gone for the conversion, rather than punting. To be fair, the difference might have been slight; numerically, of course, one probability was higher than the other (Tim Graham of ESPN arriving at a 1.5% win probability). Had Simmons focused on reconciling the statistical assumptions with how Belichek’s play calling lowered the Patriots’ chances of achieving first down, I believe he would have provided a wonderful illustration of how one goes about reconciling statistical/probability estimates with actual events. Unfortunately, Simmons ignores the probability of winning, focuses on the probability of losing, and asserts that  punting was the unequivocal correct call.

Simmons had a contrary opinion from Easterbrook and Posnanski on the punting issue, but all three of them found problems with Belichek’s coaching in the last minutes of play, preceding the 4th down conversion attempt. All three seemed to have pointed out issues with game management (such as 2 timeouts that were called just to make sure the right players were on the field) and with play calling (rushing on first down, passing on the next two downs). That last sequence seemed to have suggested that the call to play out the fourth down rather than punting was a spontaneous call. Simmons broke that down nicely, suggesting that rushing on third down made more sense if one is in fact going for a 4th down conversion. Finally, the actual play on 4th down was atrocious, as the Patriots limited their options drastically, going with an empty backfield. In this formation, there was no running option, and the Colts simply jammed Brady to hurry his throw. As it happens, he connected with Kevin Faulk, but short of first down.

I don’t think anything here contradicts the aggregate story (such as a greater than even chance of getting 2 yards). The fact is, there was much circumstantial evidence that Belichek might have flubbed the play. After all, there are no guarantees; just because the average play nets 5 yards doesn’t mean the players just stand there, waiting for the refs to spot the ball up field. You need to select a play and then execute it. As the saying goes, that’s why they play the game. The players still need to give their fullest effort.

What one should consider is how Belichek reduced the Patriots’ chance of converting by using a bad strategy. And Simmons actually did this. He noted that this play was essentially a 2-point conversion attempt, as both offense and defense were lined up to attack and defend a short field (i.e. defending the end zone with the line of scrimmage at the 2 yard line). There seemed to have been some confusion between the special teams and offense as it wasn’t clear to the players whether they were attempting a punt or not, necessitating a time out that could have been used later to challenge the Faulk bobble (see Posnanski’s post). Simmons presented some stats showing that 2-point conversions had a lower success rate (on the road; I have issues with Simmons’s selective stat picking, but that piece wasn’t exactly a peer-reviewed article.) It was unreasonable to conclude that the Colts would have rolled back down field to score with under 2 minutes to go, possessing only 1 timeout (despite the fact that the Colts did exactly that on their preceding drive. It probably was an aberration and won’t happen again. But a stat here would be nice, comparing how long in distance and time an avg NFL drive is.) The Colts  also had an inexperienced, young receiver corps, which might have increased the Patriots’ chances of stopping the Colts after a punt.)

So, even if the average successful 4th down conversion is around 60%, the Patriots did not maximize the likelihood of success. Thus the stat-heads, in essence, should have altered the assumptions for their calculations, based on the on the field observations, from the last couple of minutes of the game. Maybe the Patriots should have punted.

There are some arguments against punting. Easterbrook focused on the specific offense/defense matchups as determined by this particular game. Easterbrook wrote that, on the previous possession, the Colts drove 79 yards in 1:40, without a time out, for a touch down. Easterbrook also noted that, to his eyes, the Patriots defense seemed a step behind the Colts offense. Also, the Patriots were playing against a weak secondary. As it happened, Brady and company rolled up 370 yards on the night. It seemed like they should have had a greater than the league average chance of converting the 4th down.  They might have had a slightly lower than league average chance of defending ~70 yards, had they punted, as they had just shown they could give up a long drive (although Simmons pointed out that the Patriots stopped the Colts in 5 of the last 7 defensive series in that game.)

Again, the two arguments are  whether the Patriots can stop the Manning with under 2 minutes and whether Brady plus Faulk, Welker, and Moss can gain 2 yards. On the field, there are probably enough game-related distractions and observations for Belichek. As Posnanski said, there might have been a lot going in Belichek’s mind. It might have taken him until the last second to come to some conclusion about what to do on that fourth down. He probably did know, in general terms, the arguments above, but might not have led to a clear cut answer. He might have just decided that there was a very good chance his QB would have found a way to get the 2 yards. Although I support Simmons’s argument (and only because I think the win probability is shaded just slightly more towards punting, with Simmons’s modifications taken into account), I’m not sure if punting is a clear answer with so much time left on the clock, against a quarterback like Manning.

I think both punt and no-punt, observational arguments are valid. And the whole point of statistics is to help you weigh these alternatives against some metric (i.e. the league average.) Where it actually detracts from the analysis (to the non-statistician’s mind mind) is when the likelihoods of a positive outcome, for the considered alternatives, are rather similar.

The two points here is that, 1) contrary to Simmons point that observations are somehow better, observations also led to two contradictory, sound conclusions about the overall strategy, and 2) with the situation as stated, punting was still not a guarantee of a win (punting would have been the better option as time left to play decreased.)

The problem with the former is that we have a tendency to shoehorn these anecdotes into fitting the conclusions that we want to draw. That’s why having some statistics can provide a context for evaluating the single sample observations. You can’t do what Simmons did, which is to say that the aggregate is wrong because of the details in this situation (wrong play selection or no strategy leading to a 4th down conversion attempt) just as you can’t argue against the punt if a punt return-touchdown happened. Because in the aggregate, these things are aberrations. Even if Simmons arguments for punting was strong, it probably should have modified the outcome to only a greater than 50% winning probability, not the 100% win that Simmons thinks. In other words, you can’t just turn a 60% win probability into 100% just because you chose it. In the aggregate, both plays would yield a win more than 50% of the time.

Some other criticisms of Simmons’s piece: not all stats are created equal. Examples of what not to do with stats include Simmons using spurious stats, like how often there are 3TDs scored in the 4th quarter, to bolster his point. But why limit it to 4th quarter? Why not just look at how often 3TDs are scored in a quarter? Or why look at only 2 point conversion plays, on the road? I know Simmons made a point about how this particular play is set up like one, but the proper comparison is still against all 2 yard attempts or a comparison against all 2-point conversion plays. The problem is that, he made no attempt to discuss the validity of that particular stat in general before analyzing the break downs. In some regards, it might be simpler to prove the general case before the specific one. And certainly it helps to present all the splits, not just the ones that support your case.

Part of the issue with probability and statistics is that people do not have the luxury of the long-run or multiple trials. We only have this one trial. Which brings us the the asymmetry referred to in the title of this post. Models are one way in that one can build them by collecting multiple observations; it is a mug’s game to apply models to predict a specific event. Something might happen, until it does; the model is probabilistic, but the outcome is binary. That is part of the difficulty in accepting statistical models.

I thought that Simmons piece indicated that he did not separate the overall strategy with the details of the execution.  As he is so fond of arguing, the details cannot be captured by a simple measure as “conversion”. There were many ways of getting there: is a recovered fumble an ideal way of converting a 4th down? How about a penalty against the defense? Was it a 4th and inches grind forward? Was it 8 yd pass against a weak opponent? Did the coach rest the first string defense in the fourth quarter, with the game well in hand? However, this was in the context of a Brady plus Welker, Faulk, and Moss offense that had nearly 400 yards on the night. That is a detail that Simmons did not dwell on. The players gave the Patriots a legitimate shot at converting the 4th down. It was the playcalling from Belichek that failed the Patriots. I thought it was unfair for Simmons to trash the strategy based on the example of this particular play.

And to spread the criticism a bit, I don’t think it makes sense to never punt, as Easterbrook maintains (though he argues this from an aesthetic perspective.)  The contribution of that particular play to the overall win probability depends on the situation. It is the coach’s job to identify the most significant factors in terms of the aggregate (i.e. whole NFL result) and then apply it to an analysis of how his particular offensive and defensive play callings maximize the actual performance of his players.

Simmons missed a great opportunity to show how a proper analysis should be done. He could have supported the obvious point, that, hey, to maximize on that 60% success rate, you need to treat this like a normal play in a scripted series, not like a 2 pt conversion. He even said as much; another one of his points is that Belichek did not treat the whole series like a four down set. Doing so would have enhanced the overall chance of success. Instead, he raised the metaphorical equivalent of the “blogger-in-Mom’s-basement” attack against stat-heads: that they don’t watch the games. And that watching the game would have told you what the correct strategy was. I don’t think that was the case as all, as the contrary view can be derived using Easterbrook’s asssumptions.

%d bloggers like this: