Smart Baseball Page 16
Calculating the Win Probability Added of a specific appearance, which could be an at bat or an inning of work, requires knowing two numbers and subtracting one from the other. Before the appearance, what were the team’s chances of winning that game, based on the score, the inning, the base-out state, and how often teams in MLB history have won games in which they faced that score in that inning and base-out state? And after the appearance, what were the team’s new chances of winning that game, given the same variables? Subtract the first number from the second number and you get the Win Probability Added, which will be positive if the team’s chances of winning the game increased (the hitter doubled, the pitcher threw two scoreless innings) and negative if their chances decreased (the hitter grounded into a double play, the pitcher walked two men and gave up a single). For a player, add up the WPA figures from all of his at bats or innings pitched over the course of a season and you’ll get a total number that shows you, in context, how much he helped or hurt his team given the timing of his production.
WPA is symmetrical for a single event: if the batter makes an out that drops his team’s chances of winning from 48 percent to 46 percent, then the pitcher gains that .02 in WPA as well. Per Fangraphs’ description of the metric, “At the end of every game, the winning team’s players will have a total WPA of +0.5 and the losing team’s players will have a total WPA of –0.5.” The stat is also park-adjusted, since the value of a run scored or prevented will vary depending on whether you’re in a high-scoring environment or a low-scoring one.
While WPA is context driven, it has limitations. WPA ignores defense entirely, hanging all run prevention credit or blame on the pitcher like some of the more problematic pitching stats. For this reason, WPA is more helpful in conjunction with other stats and information rather than by itself as a measure of a player’s ability.
WPA serves two main functions. One, as I mentioned above, is that it is a useful measure of the impact of what actually happened on the field. The probabilities (chances) are based on historical data, the best way for us to estimate how much a specific play or event or series of events altered the potential outcome; if you were a bettor or an investor, this is exactly how you’d want to calculate the change in odds. The other is in understanding how much a player’s performance and the way in which he was used impacted his team’s won-lost record in a season. To rack up a strong WPA for a season, it’s not enough to play well; you have to play often and find yourself in situations where what you do makes a difference in game outcomes. That’s a function of where you hit in the lineup and who’s around you, or when your manager sends you out as a pinch hitter, or, for a relief pitcher, whether you get to pitch in high-leverage situations (later in games when the score is close or tied).
On August 8, 2016, the Cincinnati Reds—who, in case you didn’t know where this was going, had one of the worst bullpens in recent memory—took a 4–0 lead into the bottom of the ninth inning against the St. Louis Cardinals. At that point, before the Cards’ first at bat of the ninth, the Reds’ win expectancy was 98.5 percent; that is, teams up four runs with three outs to go have gone on to win their games about that frequently. After a hit and two outs, leaving the Reds still up 4–0 and a single out from victory, their win expectancy had risen to 99.6 percent. And then:
Event
Reds’ Win Expectancy
Walk
98.7%
HBP
95.6%
Single, 2 runs score
91.1%
Single, 1 run scores
83.8%
Walk
73.9%
Walk, 1 run scores
34.7%
HBP, 1 run score
0.0%
All data via Fangraphs.com
Five of those seven events involved the hitter receiving a “free” pass to first base via a walk or a time hit by pitch, but the win expectancy of each event was different. The last walk, where Brandon Moss walked with the bases loaded to force in a run, resulted in the biggest shift in win expectancy (WE), a delta of 39.2; the batter hit by pitch to force in the winning run actually produced less of a shift in WE because you can’t get any lower than 0 percent chance of winning (although if you could, this Reds bullpen would have found a way). We don’t talk about “clutch walks” or “clutch times hit by pitch,” but that is what happened here—the last three batters in particular drew walks or were hit by a pitch in high-leverage (clutch) situations and thus raised the Cardinals’ chances of winning substantially without swinging their bats.
The MLB leader in WPA in 2015 was Anthony Rizzo, first baseman for the Cubs, at +7.15 wins above average, a full win above what Bryce Harper, the most productive hitter in the majors that year, provided to the Nationals. That’s a function of context, since Harper outproduced Rizzo in individual stats but didn’t have the opportunities Rizzo had to affect game outcomes and/or didn’t produce as well in those opportunities. Among pitchers, the Dodgers’ Zack Greinke, who led the majors in ERA thanks to a very low BABIP, led at +6.79.
But where WPA gets interesting is when you look at the values relievers rack up because of how they’re deployed. The MLB leader in WPA among relievers in 2015 was Pirates closer Mark Melancon at +5.39, more than all MLB starters but Greinke and NL Cy Young winner Jake Arrieta. That credits Melancon for a job well done . . . but also credits Pittsburgh manager Clint Hurdle for using his best reliever so frequently in high-leverage situations. That means that WPA is telling us two things at once: the opportunities the player had to affect the outcomes of the games in which he played, and how well or poorly he played in those situations.
As a description of what actually happened in games, WPA is useful, but as a measure of individual skill or a tool to predict the player’s performance going forward, it is useless. There’s no correlation between WPA in one season and the next independent of the player’s own underlying performance—a good hitter is a good hitter, no matter the score, the inning, or who’s on base. If you’re in a front office, trying to determine how much to pay a player, or what to trade to acquire a player, WPA shouldn’t enter into your thinking at all. Its value is limited to fans and writers who want to understand the story of a game, for which it’s better suited than most traditional statistics, like the pitcher win, the save, or the now-discredited Game-Winning RBI, a garbage stat introduced by the Elias Sports Bureau in 1980 and quietly discontinued after 1988. WPA gives us a better way to think about what happened, but doesn’t help team executives in their quest to more accurately value a player or predict his future.
13
The Black Box:
How Baseball Teams Measure Defense Today
Given how flawed many of the long-accepted fielding stats are, it shouldn’t come as shock that, when it comes to fielding, the old numbers don’t give us much to work with. Unlike other flawed numbers that are only telling us part of the story, stats like fielding percentage are actually deceiving us, which makes them of very little use.
And so, when it comes to fielding stats, we need to start over entirely, but the question is where. If I could somehow wipe fielding percentage and errors from your brain and then ask you to tell me how we might assess a fielder’s value, what would you say?
I think a reasonable first answer might be this: does he make the plays he’s supposed to make? This is the most basic thing we expect of any player on the field. If a ball is hit to him that we would expect him to field—that is, to convert into an out, either making the play himself or beginning a play for another fielder to complete—then he’d better field it. If he can’t do that, we’d probably consider him a below-average fielder, and we’d expect a fielding statistic to dock him for the plays he should make but doesn’t. So we’d need to track all the plays our fielder should have made, see how many of them he did make, and then compare it to some baseline of players at that same position—how often they made those same plays themselves.
If you go a little deeper, you might also consid
er the plays the player does make that most players don’t. We tend to think of these as highlight plays, but they don’t have to look like Web Gems to actually be great defensive plays. Andruw Jones is one of the ten or so best fielders in baseball history, but many of his best plays wouldn’t make a highlight reel because his range was so good he wasn’t diving to make these highly valuable catches. So now we have to look at all of the plays that our player made that most players at his position don’t make, and see how often he made them and what happens when a fielder doesn’t make each of those particular plays.
This is the fundamental logic between the new defensive stats you see on sites like Baseball-Reference and Fangraphs and the proprietary defensive metrics teams use themselves for their own player valuations. The play not made can be damaging, and the extra play that is made can be extremely valuable. Any attempt to value a fielder’s contributions on the field or to try to estimate his true defensive talent level must include these plays, which none of the “old” defensive metrics accurately counted.
Before the advent of play-by-play data that showed where balls in play first hit the ground or were fielded, putouts, assists, and errors were all we had, and anyone trying to come up with a better way of valuing fielding had to find a way to glean some kind of information from that data.
One of the earliest alternatives to fielding percentage was Range Factor, developed by Bill James to try to put some kind of numerical value on a player’s range by measuring the frequency of plays he made. Range Factor was simply putouts plus assists divided by games or, better, putouts plus assists times 9 divided by innings played, making it an ERA-like rate stat scaled to nine-inning games. A player who made a lot of plays would thus score higher, and we would presume that such a player had more range.
Palmer and Thorn tried a novel approach with Fielding Runs, which they described in a chapter of The Hidden Game of Baseball called, appropriately for its time, “Measuring the Unmeasurable.” Their statistic, which is a linear-weights approach to fielding data, still relies on the same raw information of putouts, assists, and errors, while also considering double plays for infielders, and then compared individual players’ results to estimates of the league average for each position. Unfortunately, because of the problems in the raw data, this fell into the old dictum about garbage in/garbage out; their results pointed in the right direction, as did Range Factor, but there was too much nonsense in their inputs (for example, using games played rather than innings played, which wasn’t available) to give fielding runs the precision they reached with batting or pitching runs.
While John Dewan was working for Stats Inc. in the 1980s, he developed Zone Rating, the earliest attempt to divide the field of play into zones of responsibility for the fielders. The concept is simple—we’d certainly expect the shortstop to field a ball hit to the place where the shortstop usually stands, and if we mark off that spot and the area right around it, we can say he’s responsible for balls hit into that zone. Dewan divided the field into zones of responsibility for each player, then looked at how many balls were hit into each zone for each fielder, and how often the fielder turned those balls into outs. The concept was correct, representing a substantial leap forward over anything based on traditional stats, but Dewan didn’t have the data required in the 1980s or ’90s to get meaningful results from Zone Rating. Dewan eventually cofounded Baseball Info Solutions, a new data-collection firm that, among other things, provided more specific details on where balls were hit into play, and even employed people to watch defensive plays and subjectively grade them as “misplays” or “good fielding plays.” Before the advent of more specific play-by-play data, showing locations for balls in play and for fielder positions, this was as precise as advanced defensive metrics could get.
The advent of new data has changed the approach once again. As of 2016, when I’m writing these words, there are two major, publicly available defensive metrics that attempt to answer the question of how many runs a defender saved or cost his team compared to an average fielder at the same position. One is Ultimate Zone Rating (UZR), developed by Mitchel Lichtman and available on Fangraphs’ site. The other is defensive Runs Saved (dRS), developed by Baseball Info Solutions and available on Fangraphs and Baseball-Reference. While the two rarely align perfectly—which highlights one of the key difficulties in evaluating defense—they typically agree on direction; that is, if UZR says a player is well above average, dRS will probably say the same thing. Over multiple seasons, they’re more likely to agree on a player, simply because large sample sizes tend to reduce the effect of outliers in the subjective aspects of these stats.
Because these stats are publicly available, they immediately become Important Things, perhaps even Numbers You Are Not Allowed to Question. This is silly, although it’s a tenet of behavioral economics that when someone slaps a number on something, that’s what we assume that something is worth; it’s integral to commerce, where we’ll all automatically assume a product is more desirable or of higher quality just because its price is higher. And it’s important to bear in mind what UZR and dRS are actually telling us.
There’s a key difference between measuring the value of what a player did and what his actual or “true” talent might be. A career .320 hitter who goes 0-for-4 did not suddenly lose his hitting ability in that game, but the 0-for-4 is an accurate statement of what he did in those four trips to the plate. The same idea applies to defensive metrics, but in my experience is often forgotten: a player is not his UZR. His UZR (or his dRS) estimates the value of what he did in the field, with some adjustments. It does not tell you what his underlying ability* is, and without a lot of mathematical chicanery I won’t bore you with here, we can’t draw a straight line from a UZR number to that underlying ability unless we make assumptions so big they could swallow Mount Everest whole and still have time to consume K2 for dessert.
Instead, what we want to do with defensive metrics is the same thing we do with stats for offense and pitching: get the most accurate possible picture of what each player did on the field. The difference between fielding stats and other stats is that a proper accounting of defense involves something more speculative: deciding when a player didn’t do something he should have done. We really don’t do anything of the sort with hitters or pitchers; we try to isolate responsibility, but we don’t get into the less comfortable realm of “well, he swung through that hanging breaking ball, but he really should have hit it halfway to Mars.” That’s part of scouting, but isn’t and probably should not be any part of even advanced hitting statistics. With fielders, however, we have to consider what didn’t happen, because part of the difference between a good fielder and a bad one is the play the good fielder makes on a ball the bad fielder never even touches.
Lichtman describes the philosophy of Ultimate Zone Rating as trying to tell us two things: for any ball that’s hit or put into play, did any fielder who had any chance to catch that ball do so, and what were the chances of a league-average fielder at that position making that same play? It turns out that this calculation is not as easy as it seems, although improved data on balls in play in the last few years have made the results more useful.
“How often would an average center fielder have caught that particular flyball?” seems like a straightforward question, but to answer it, you need a lot of flyballs to that specific point on the field—hit the same way, at the same speed. That’s awfully limiting, even if we had lots of historical data showing the precise spot on the field where a flyball landed or was caught. Instead, analysts like Lichtman and the folks at Baseball Info Solutions (BIS) use some approximations to give them enough of a sample of past balls in play so that saying “an average center fielder catches that ball 80 percent of the time” has some actual meaning.
For UZR, that means dividing the field into a number of zones, about 10 feet by 10 feet, for where a ball hit in the air landed or was caught, and then rating each ball by how hard it was hit (soft, medium, or hard) and,
for flyballs, whether it was a line drive or a true flyball. For groundballs, the system comes up with a vector for each ball hit into play, a combination of an angle (dividing the 90 degrees from the third-base line to the first-base line into eighteen five-degree wedges) and a speed estimation (soft, medium, or hard again). Then Lichtman’s algorithm can look through a historical database of ten years of balls in play to find other balls that resemble the one we’re looking at.
For example, a medium-hit line drive to a zone in the left-center gap might be caught 20 percent of the time by the center fielder, 10 percent of the time by the left fielder (some of whom are there because that’s where you put the guy who can’t field), and 70 percent of the time by nobody, falling in for a hit. A softly hit groundball to the left side, 10 degrees from the third-base line, might be fielded 40 percent of the time by the third baseman, 5 percent of the time by the pitcher, and 55 percent of the time by nobody. With that many dimensions—location, hit type, estimated velocity, and angle for groundballs—we’re getting enough precision to distinguish between different types of balls hit into play without getting so specific that we’re comparing that ball hit into play today to one catch that Jim Edmonds made eight years ago and nothing else. The larger the sample of past balls in play we get to use for comparisons, the more confident we can feel when rating the current play in question.*
There are two major problems with the UZR/dRS approach beyond the lack of precision in historical data. One is that these approaches must assume that all fielders start in essentially the same positions on the field—that every center fielder is positioned in the same spot for every hitter, behind every pitcher, regardless of the score or inning. If you thought, “Well, that’s stupid,” you’re incredibly rude but not far off the mark; we can all think of situations, like the final batter of the 2001 World Series, where a manager chose to bring in the infield, or play outfielders in the so-called no-doubles defense (which is better referred to as the “more singles” defense), all with an eye toward preventing certain outcomes. If you try to grade a center fielder on a routine flyball when he was playing very shallow to try to prevent a bloop single from driving in a run, you’re going to see he failed to make a play that center fielders make nearly 100 percent of the time, and you’ll think he blew it.