It aint over ‘til it’s over
May
18
Written by:
Charles Flock
5/18/2010 8:19 PM
When is it time to admit that your favorite baseball team is not going to play in the post-season?
Casey Stengel, the famous NY Yankee manager, once said of the baseball season: no matter what you do you will win 50 games and lose 50 games; it’s what you do with the other 50 games that really matters. (This was when the season was only 154 games, today he would say 55).
This week a Yankee announcer commented that the Boston Red Sox need to get moving before they fall too far behind division leaders TampaBay and NY. Since, MLB plays 162 games, I was struck by the idea that this early in the season, there would be some sense of urgency about having to win games. The major league baseball season is a little more than 1/5th complete – let’s call it 33 games played – and I began to wonder what these early results might tell us about the play-off picture.
I thought it would be cool to see if we could use Monte Carlo in SQL Server to see if we can’t make some informed guesses about who might make the playoffs. I am going to keep this model very simple and make lots of simplifying assumptions.
Let’s start with the standings in the American League East. While it is early, you can see that the New York Yankees and the Tampa Bay Rays are in a close race for first place in their division, and Boston is already 5.5 games back. (To be fair, Boston has won its last three games).
Team
|
Wins
|
Losses
|
Pct
|
GB
|
WCGB
|
TampaBay
|
23
|
10
|
.697
|
-
|
-
|
New York
|
21
|
10
|
.677
|
1.0
|
-
|
Toronto
|
19
|
16
|
.543
|
5.0
|
4.0
|
Boston
|
18
|
16
|
.529
|
5.5
|
4.5
|
Baltimore
|
9
|
24
|
.273
|
14.0
|
13.0
|
The way that we frame the problem determines the kind of Monte Carlo simulation we will build. I am going to assume (for now) that if an American League East team wins 95 games they will make the playoffs either as division champion or as the wild card entry. Ninety-five wins equates to a .586 winning percentage.
With just this one assumption, we can build a model that will calculate the probability of a team winning 95 games given its current record, everything else being equal. We can use the SERIESINT function to randomly generate wins and losses for a specified number of games. The following statement will produce a 1 for a win and a zero for a loss.
SELECT seq
,CASE
WHEN SERIESVALUE < 587 THEN 1
ELSE 0
END as wins
FROM wct.SERIESINT(1,1000,1,33,'R')
We have specified in the SERIESINT table-valued function that we want to randomly generate 33 integer values between 1 and 1,000. In the SELECT statement we have specified that if the randomly-generated number is less than 587 (which is based on our .586 winning percentage for 162 games), return a 1, otherwise return a 0.
This will return 33 rows of 0 or 1. What we really want to know is how many wins. Moreover, we do not just want to run one simulation, we want to run 10,000 (or more).
This statement will record the results for 10,000 simulations of the first 33 games of the season:
SELECT num
,SUM(n.wins)as record
FROM(
SELECT wct.TRUNC((seq-1)/33, 0)as num
,CASE
WHEN SERIESVALUE < 587 THEN 1
ELSE 0
END as wins
FROM wct.SERIESINT(1,1000,1,330000,'R')
) n
GROUPBY num
This statement uses a derived table, n, which has two columns, num and wins. The num column is a way of grouping the wins together for a particular trial and is calculated using the TRUNC function such that every 33 rows are grouped together. Running this statement returns 10,000 rows, so we are going to summarize the data further.
What we are really interested in is calculating the number of times a trial returns a specified number of wins. The following statement will do that (your results will vary because of the random numbers).
SELECT record
,COUNT(*)as [Number of Trials]
FROM(
SELECT num
,SUM(n.wins)as record
FROM(
SELECT wct.TRUNC((seq-1)/33, 0)as num
,CASE
WHEN SERIESVALUE < 587 THEN 1
ELSE 0
END as wins
FROM wct.SERIESINT(1,1000,1,330000,'R')
) n
GROUP BY num
) m
GROUP BY record
ORDER BY record
The results will look something like this.
So, this looks nice, especially since it looks like the results are normally-distributed, but what exactly is it telling us? A team that has 9 wins after 33 games (so they have a 9 – 24 record), won exactly 95 games 1 time out 10,000 trials. A team that has 10 wins after 33 games won exactly 95 games 13 times out of 10,000 trials. However, a team that has 28 wins after 33 games won exactly 95 games 10 times out of 10,000 trials. So the probability for a team with a 28 – 5 record winning 95 games is less than the probability of a team that is 10 – 23?
What we have calculated is the probability of winning exactly 95 games. This is called the probability density function. The team with a 28 – 5 record will probably win more than 95 games. So, what we really want to calculate is the probability of a team winning at least 95 games. This is called the cumulative distribution function.
To calculate the cumulative distribution function, all we need to do is add up the probability density functions that come before it. This requires us to convert the number of wins to a probability, which we do simply by dividing by the number of trials, and then using the SUM function do the addition. To keep the SQL simple, I am going to insert my interim results into a temporary table and calculate the cumulative distribution from that.
SELECT record
,COUNT(*)as [Number of Trials]
INTO #w
FROM(
SELECT num
,SUM(n.wins)as record
FROM(
SELECT wct.TRUNC((seq-1)/33, 0)as num
,CASE
WHEN SERIESVALUE < 587 THEN 1
ELSE 0
END as wins
FROM wct.SERIESINT(1,1000,1,330000,'R')
) n
GROUP BY num
) m
GROUP BY record
SELECT w1.record
,sum(w2.[Number of Trials])/cast(10000 as float) as [Cumulative Distribution]
FROM #w w1, #w w2
WHERE w2.record <= w1.record
GROUP BY w1.record,w1.[Number of Trials]
ORDER BY 1
As you can see, I inserted my results into a temporary table, #w, which I then used in the second SELECT to calculate the cumulative distribution function. This returned the following results.
Now we can see that a team with a 9 – 24 record (the Baltimore Orioles) has a probability of .0001 (.01%) of winning at least 95 games, whereas a team with a 23 – 10 record (the Tampa Bay Rays) has a probability of .9327 (93.27%) of winning at least 95 games, all things remaining the same.
How about the Yankees? They have played 31 games, not 33 games, so we need to change the number of games in our SELECT. We will take this opportunity to create some variables so that it is easier to run this SELECT for any number of games and any number of iterations.
DECLARE @num_iterations as float,
@num_games as float,
@season_wins as float,
@record as float
SET @num_iterations = 10000
SET @num_games = 31
SET @season_wins = 96
SET @record =Round(@season_wins/cast(162 as float), 3)* 1000 + 1
SELECT record
,COUNT(*)as [Number of Trials]
INTO #w
FROM(
SELECT num
,SUM(n.wins)as record
FROM(
SELECT wct.TRUNC((seq-1)/@num_games, 0) as num
,CASE
WHEN SERIESVALUE < @record THEN 1
ELSE 0
END as wins
FROM wct.SERIESINT(1,1000,1,@num_games * @num_iterations,'R')
) n
GROUP BY num
) m
GROUP BY record
SELECT w1.record
,sum(w2.[Number of Trials])/cast(@num_iterations as float) as [Cumulative Distribution]
FROM #w w1, #w w2
WHERE w2.record <= w1.record
GROUP BY w1.record,w1.[Number of Trials]
ORDER BY 1
When I run this simulation, the Yankees have a better than .87 probability of winning 96 games based on their 21 – 10 record. The Red Sox, with an 18 – 16 record, have just a little less than a .28 probability of winning 96 games. In other words, the Yankees are 3 times as likely to win 96 games as the Red Sox and the Rays are 3.33 times as likely as the Red Sox to win 96 games, all things remaining the same.
The key phrase here, of course, is: all things remaining the same. The Orioles, with a probability of .0001 want to make some changes. Mostly, they just want their injured players to get healthy. The Rays, with a probability of .9327 want to keep everything the same. This being baseball, that is unlikely. Players get injured, traded, and return to health. Weather affects games and schedules. Lots of things don’t remain equal.
That’s what makes the game interesting. It is also what gives every die-hard fan and season ticket holders of lagging teams HOPE. . .that this will be the year the magic happens.
UPDATING THE MODEL
The following script will create a table-valued function that will allow you to run this analysis any time, adjusting the variables as you see fit, during the course of the season. If you are new to SQL Server, you can get SQL Server Express for free and XLeratorDB Express for free and do this analysis for yourself.
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
-- =============================================
-- Author: Charles Flock
-- Create date: 05/12/2010
-- Description: Monte Carlo simulation for winning records in baseball
-- =============================================
CREATE FUNCTION [dbo].[fn_simWL]
(
-- Add the parameters for the function here
@num_iterations as float,
@num_games as float,
@season_wins as float
)
RETURNS @wlCDF
TABLE
(
-- Add the column definitions for the TABLE variable here
wins float,
losses float,
cdf float
)
AS
BEGIN
-- Add the SELECT statement with parameter references here
DECLARE @record as float
DECLARE @w TABLE(
record float,
[Number of trials] float
)
SET @record =Round(@season_wins/cast(162 as float), 3)* 1000 + 1
INSERT INTO @w
SELECT record
,COUNT(*) as [Number of Trials]
FROM(
SELECT num
,SUM(n.wins) as record
FROM(
SELECT wct.TRUNC((seq-1)/@num_games, 0)as num
,CASE
WHEN SERIESVALUE < @record THEN 1
ELSE 0
END as wins
FROM wct.SERIESINT(1,1000,1,@num_games * @num_iterations,'R')
) n
GROUP BY num
) m
GROUP BY record
INSERT INTO @wlCDF
SELECT w1.record
,@num_games - w1.record
,sum(w2.[Number of Trials])/cast(@num_iterations as float)
FROM @w w1, @w w2
WHERE w2.record <= w1.record
GROU PBY w1.record,w1.[Number of Trials]
RETURN
END
And now, I can just enter the following statement to run the Monte Carlo simulation 100,000 times:
SELECT *
FROM dbo.fn_simWL(100000,33,96)
ORDER BY wins