Login     Register

        Contact Us     Search

Calculating percentiles in SQL Server

Jul 15

Written by: Charles Flock
7/15/2011 5:21 PM  RssIcon

Using the XLeratorDB/statistics 2008 PERCENTILE and PERCENTRANK functions in SQL Server 2008.

UPDATE: XLeratorDB has released a new table-valued PERCENTILE function that offers significant performance improvements over the aggregate function.

Read the new PERCENTILE article here.

If you go to Wikipedia and look up the definition of PERCENTILE, the definition starts “There is no standard definition of percentile, however all definitions yield similar results when the number of observations is very large.” If you are an EXCEL 2010 user, you might have already guessed this, since the PERCENTILE function has morphed into 2 functions; PERCENTILE.INC and PERCENTILE.EXC. If you are not an EXCEL 2010 user, you may have been scratching your head over different results produced by different packages, say EXCEL and Matlab, and wondering which is right.
Considering how often percentiles are used (height and weight charts, test scoring, performance evaluations, figuring wages, etc.) it’s important to understand the different ways to calculate percentile and what a percentile value actually means.
Generally, percentile calculations can be broken down into two functions. The first function evaluates a score (k) of some type (something like height, weight, test scores, length, etc.) within a set of scores and returns a value between 0 and 1 (p) indicating a relative ranking of that score in relation to all the other scores in the set. This ranking will indicate the percentage of scores which fall below k in the set. This is the PERCENTRANK function.
The second function evaluates a percentage (p) on a set of scores and returns a value (k) such that the percentage of scores in the set falls below k. The value k need not be a member of the set, though it is bound by the minimum and maximum values of the set. This is the PERCENTILE function.
For example, let’s say you have a 4th grade student in a class of 40 who takes a standardized exam that is given to all 4th grade students in the United States. You receive results broken down into a national percentile, a percentile for all private schools, and a percentile within school. Let’s say that this student scores in the 99th percentile nationally, 99th percentile for private schools, and the 21st percentile in-school. Is it possible to conclude that this student is near the bottom of his class and that 78% of his classmates outperformed him on this exam?
The maximum possible score on the exam is 100. In the class of 40, 32 students got a perfect score of 100, and the remaining students got a score of 99 (scores are not disclosed in the reporting of the results). Let’s calculate the value for the 21st percentile in this group.

SELECT wct.PERCENTILE(x,.21) as PERCENTILE
FROM (VALUES
      (99),(99),(99),(99),(99),(99),(99),(99),
      (100),(100),(100),(100),(100),(100),(100),(100),
      (100),(100),(100),(100),(100),(100),(100),(100),
      (100),(100),(100),(100),(100),(100),(100),(100),
      (100),(100),(100),(100),(100),(100),(100),(100)
      ) n(x)
This produces the following result.

            PERCENTILE
----------------------
                   100

Wait, but isn’t 100 the maximum score? How could someone get the maximum score and still be in the 21st percentile?
This gets back to the definition of percentile. While there is no standard definition, the differences in the definitions are about methods of computation rather than interpretation of the resulting statistic. The percentile statistic is the value below which a certain percent of observations fall. In this example, we know that 21 percent of the observations fall below a value of 100. We cannot draw any conclusions about how many observations are above.
Let’s look at another example. There are 40 scores in this example, with a maximum value of 300 for the score. We will calculate the 90th, 95th, and 99th percentile.
SELECT m.p as PERCENTILE
,wct.PERCENTILE(n.x,m.p) as SCORE
FROM (VALUES
      (257),(261),(275),(283),(282),(261),(279),(256),
      (263),(261),(283),(282),(249),(300),(257),(263),
      (286),(261),(275),(245),(288),(267),(292),(255),
      (279),(270),(300),(286),(285),(297),(278),(300),
      (260),(269),(264),(273),(256),(300),(300),(290)
      ) n(x)
CROSS APPLY(VALUES (.90),(.95),(.99)) m(p)
GROUP BY m.p
ORDER BY 1
This produces the following result.

                             PERCENTILE                  SCORE
--------------------------------------- ----------------------
                                   0.90                    300
                                   0.95                    300
                                   0.99                    300


We can see that all three percentiles contain the maximum score, so all we can safely conclude is that a score of 300 is better than 90% of the reported scores. Since 300 is the maximum score, there is no ‘room for improvement’ for scores in the 90th percentile. Furthermore, let’s say the top 10% of the students will be selected for a more intensive program of study. The logical conclusion would be that this test has not let us accurately identify who the top 10% are, so we should look at some other criteria that is capable of producing a clearer differentiation among the top performers.
Frequently, instead of coming up with the score based on a percentile, we will need to come up with percentile based upon the score. To do this, we use the PERCENTRANK function.
In this example, we will use the same test scores from the previous example and for each test score we will calculate the corresponding percentile value. Even though there are 40 scores in our data set, the resultant table will only contain 25 scores as the duplicates are eliminated.
SELECT *
INTO #c
FROM (VALUES
      (257),(261),(275),(283),(282),(261),(279),(256),
      (263),(261),(283),(282),(249),(300),(257),(263),
      (286),(261),(275),(245),(288),(267),(292),(255),
      (279),(270),(300),(286),(285),(297),(278),(300),
      (260),(269),(264),(273),(256),(300),(300),(290)
      ) n(x)
SELECT c1.x
,wct.PERCENTRANK(c2.x,c1.x,5) as PERCENTRANK
FROM #c c1, #c c2
GROUP BY c1.x
This produces the following result.

          x            PERCENTRANK
----------- ----------------------
        245                      0
        249                0.02564
        255                0.05128
        256                0.07594
        257                0.12658
        260                0.17948
        261                0.20125
        263                0.30379
        264                0.35897
        267                0.38461
        269                0.41025
        270                0.43589
        273                0.46153
        275                0.48101
        278                0.53846
        279                0.55696
        282                0.60759
        283                0.65822
        285                0.71794
        286                0.73417
        288                0.79487
        290                0.82051
        292                0.84615
        297                0.87179
        300                0.87939

We can see that a score of 300 corresponds to the 87.9 percentile; a score of 245 corresponds to the 0 percentile. We can see that the 50th percentile lies between 275 and 278 and that the 75th percentile lies between 286 and 288.
Let’s assume that the students have the opportunity to take the exam again, the teacher reviews the materials, and everybody retakes the exam.
SELECT *
INTO #c
FROM (VALUES
      (248),(252),(258),(259),(259),(260),(260),(263),
      (264),(264),(264),(264),(266),(266),(267),(270),
      (272),(273),(276),(278),(278),(281),(282),(282),
      (285),(285),(286),(286),(288),(289),(289),(291),
      (293),(295),(300),(300),(300),(300),(300),(300)
      ) n(x)
SELECT c1.x
,wct.PERCENTRANK(c2.x,c1.x,5) as PERCENTRANK
FROM #c c1, #c c2
GROUP BY c1.x
 
DROP TABLE #c
This produces the following result.

          x            PERCENTRANK
----------- ----------------------
        248                      0
        252                0.02564
        258                0.05128
        259                0.07594
        260                0.12658
        263                0.17948
        264                0.20125
        266                0.30379
        267                0.35897
        270                0.38461
        272                0.41025
        273                0.43589
        276                0.46153
        278                0.48101
        281                0.53846
        282                0.55696
        285                0.60759
        286                0.65822
        288                0.71794
        289                0.73417
        291                0.79487
        293                0.82051
        295                0.84615
        300                0.85355

We can see that there was no improvement in PERCENTRANK for any of their test scores and an actual decline in the percentile value for those with scores of 300. Let’s analyze this data a little further by assigning a test score to an id and keeping track of the scores for test 1 and test 2. We can then select the id, the test scores for test 1 and test 2, and the percentiles for test 1 and test 2.
SELECT *
INTO #d
FROM (VALUES
      (15,1,245),(15,2,248),(9,1,249),(9,2,252),(10,1,255),(10,2,258),
      (7,1,256),(7,2,259),(17,1,256),(17,2,259),(32,1,257),(32,2,260),
      (33,1,257),(33,2,260),(31,1,260),(31,2,263),(26,1,261),(26,2,264),
      (19,1,261),(19,2,264),(8,1,261),(8,2,264),(21,1,261),(21,2,264),
      (22,1,263),(22,2,266),(6,1,263),(6,2,266),(14,1,264),(14,2,267),
      (35,1,267),(35,2,270),(18,1,269),(18,2,272),(28,1,270),(28,2,273),
      (4,1,273),(4,2,276),(20,1,275),(20,2,278),(25,1,275),(25,2,278),
      (36,1,278),(36,2,281),(3,1,279),(3,2,282),(37,1,279),(37,2,282),
      (13,1,282),(13,2,285),(23,1,282),(23,2,285),(29,1,283),(29,2,286),
      (40,1,283),(40,2,286),(16,1,285),(16,2,288),(24,1,286),(24,2,289),
      (39,1,286),(39,2,289),(1,1,288),(1,2,291),(38,1,290),(38,2,293),
      (12,1,292),(12,2,295),(34,1,297),(34,2,300),(27,1,300),(27,2,300),
      (11,1,300),(11,2,300),(2,1,300),(2,2,300),(5,1,300),(5,2,300),
      (30,1,300),(30,2,300)
      ) n(id,tn,sc)
 
SELECT d1.id
,d1.sc as [Test 1 score]
,d3.sc as [Test 2 score]
,wct.PERCENTRANK(d2.sc, d1.sc, 5) [Test 1 PERCENTILE]
,wct.PERCENTRANK(d4.sc, d3.sc, 5) [Test 2 PERCENTILE]
FROM #d d1, #d d2, #d d3, #d d4
WHERE d1.tn = 1
AND d2.tn = 1
AND d3.tn = 2
AND d4.tn = 2
AND d3.id = d1.id
GROUP BY d1.id
,d1.sc
,d3.sc
This produces the following result.

         id Test 1 score Test 2 score      Test 1 PERCENTILE      Test 2 PERCENTILE
----------- ------------ ------------ ---------------------- ----------------------
          1          288          291                0.77548                0.77548
          2          300          300                0.87554                0.85053
          3          279          282                0.55034                0.55034
          4          273          276                0.45028                0.45028
          5          300          300                0.87554                0.85053
          6          263          266                0.30018                0.30018
          7          256          259                0.07504                0.07504
          8          261          264                0.20012                0.20012
          9          249          252                0.02501                0.02501
         10          255          258                0.05003                0.05003
         11          300          300                0.87554                0.85053
         12          292          295                0.82551                0.82551
         13          282          285                0.60037                0.60037
         14          264          267                0.35021                0.35021
         15          245          248                      0                      0
         16          285          288                0.70043                0.70043
         17          256          259                0.07504                0.07504
         18          269          272                0.40025                0.40025
         19          261          264                0.20012                0.20012
         20          275          278                0.47529                0.47529
         21          261          264                0.20012                0.20012
         22          263          266                0.30018                0.30018
         23          282          285                0.60037                0.60037
         24          286          289                0.72545                0.72545
         25          275          278                0.47529                0.47529
         26          261          264                0.20012                0.20012
         27          300          300                0.87554                0.85053
         28          270          273                0.42526                0.42526
         29          283          286                 0.6504                 0.6504
         30          300          300                0.87554                0.85053
         31          260          263                 0.1751                 0.1751
         32          257          260                0.12507                0.12507
         33          257          260                0.12507                0.12507
         34          297          300                0.85053                0.85053
         35          267          270                0.37523                0.37523
         36          278          281                0.52532                0.52532
         37          279          282                0.55034                0.55034
         38          290          293                 0.8005                 0.8005
         39          286          289                0.72545                0.72545
         40          283          286                 0.6504                 0.6504
 
With just a slight adjustment to the SQL, we can get a little clearer picture of how the results changed.
SELECT p.id
,p.[Test 2 score] - p.[Test 1 score] as [Change in Score]
,p.[Test 2 PERCENTILE] - p.[Test 1 PERCENTILE] as [Change In Percentile]
FROM (
      SELECT d1.id
      ,d1.sc as [Test 1 score]
      ,d3.sc as [Test 2 score]
      ,wct.PERCENTRANK(d2.sc, d1.sc, 5) [Test 1 PERCENTILE]
      ,wct.PERCENTRANK(d4.sc, d3.sc, 5) [Test 2 PERCENTILE]
      FROM #d d1, #d d2, #d d3, #d d4
      WHERE d1.tn = 1
      AND d2.tn = 1
      AND d3.tn = 2
      AND d4.tn = 2
      AND d3.id = d1.id
      GROUP BY d1.id
      ,d1.sc
      ,d3.sc
      ) p
This produces the following result.

         id Change in Score   Change In Percentile
----------- --------------- ----------------------
          1               3                      0
          2               0               -0.02501
          3               3                      0
          4               3                      0
          5               0               -0.02501
          6               3                      0
          7               3                      0
          8               3                      0
          9               3                      0
         10               3                      0
         11               0               -0.02501
         12               3                      0
         13               3                      0
         14               3                      0
         15               3                      0
         16               3                      0
         17               3                      0
         18               3                      0
         19               3                      0
         20               3                      0
         21               3                      0
         22               3                      0
         23               3                      0
         24               3                      0
         25               3                      0
         26               3                      0
         27               0               -0.02501
         28               3                      0
         29               3                      0
         30               0               -0.02501
         31               3                      0
         32               3                      0
         33               3                      0
         34               3                      0
         35               3                      0
         36               3                      0
         37               3                      0
         38               3                      0
         39               3                      0
         40               3                      0


It’s clear from this analysis, that everybody who did not have a perfect score on the first test improved their test score by 3 points and that everybody who got a perfect score on the first test got a perfect score on the second. However, none of the test-takers showed an improvement in percentile and the best-performing group, the group with perfect scores, actually showed a decline in percentile. Despite the fact that everybody’s performance improved, the top percentile went from 0.87554 to 0.85053. Thus, the statement that none of the test-takers scored above the 86th percentile on the second test as opposed to none scored above the 88th percentile in the first test would be literally true, but would convey the idea that overall performance had declined, when in fact it had improved.
Let’s look at something a little more complicated. In this example, I use the baseball database from Sean Lahman’s site, http://baseball1.com/statistics/. In this database, he has a table of batting statistics which contains every player for every season for a total of 93,955 rows, which is a pretty small table for a database, but a massive amount of data for a spreadsheet.
Here’s the challenge. We want to calculate a PERCENTRANK for all the players who played in the 2010 season who had more than 199 at-bats (which is a relatively easy way to identify players who are everyday players). But, we want to calculate 4 different statistics all using players who had more than 199 at-bats. First, we want to calculate the percentile using the previous 3 years batting statistics for all MLB players. Second, we want to calculate the percentile using the 2010 statistics for all MLB players. Third, we want to calculate the percentile using the 2010 statistics for each league (AL or NL). Finally, we want to calculate the percentile for the team.
Here’s a picture of the batting table.




Our SQL will use a common table expression to calculate the 4 percentiles for each player, and then we will join to another table to get the player’s first name and last name to produce the report.
;with mycte as (
SELECT b1.playerID
,'3Y' as category
,b1.stint
,b1.H
,wct.PERCENTRANK(b2.H, b1.H, 3) Percentile
FROM BATTING b1, BATTING b2
WHERE b1.yearID = 2010
AND b1.AB > 199
AND b2.yearID > 2007
AND b2.AB > 199
GROUP by b1.playerID
,b1.stint
,b1.H
UNION ALL
SELECT b1.playerID
,'MLB' as category
,b1.stint
,b1.H
,wct.PERCENTRANK(b2.H, b1.H, 3) Percentile
FROM BATTING b1, BATTING b2
WHERE b1.yearID = 2010
AND b1.AB > 199
AND b2.yearID = b1.yearID
AND b2.AB > 199
GROUP by b1.playerID
,b1.stint
,b1.H
UNION ALL
SELECT b1.playerID
,b1.lgid
,b1.stint
,b1.H
,wct.PERCENTRANK(b2.H, b1.H, 3)
FROM BATTING b1, BATTING b2
WHERE b1.yearID = 2010
AND b1.AB > 199
AND b2.yearID = b1.yearID
AND b2.AB > 199
AND b2.lgid = b1.lgid
GROUP by b1.playerID
,b1.lgID
,b1.stint
,b1.H
UNION ALL
SELECT b1.playerID
,b1.teamID
,b1.stint
,b1.H
,wct.PERCENTRANK(b2.H, b1.H, 3)
FROM BATTING b1, BATTING b2
WHERE b1.yearID = 2010
AND b1.AB > 199
AND b2.yearID = b1.yearID
AND b2.AB > 199
AND b2.teamid = b1.teamid
GROUP by b1.playerID
,b1.teamID
,b1.stint
,b1.H
) SELECT master.nameLast + ', ' + master.nameFirst as Player
,m0.H
,m0.Percentile as [3Y Percentile]
,m1.Percentile as [MLB Percentile]
,m2.Percentile as [League Percentile]
,m3.Percentile as [Team Percentile]
FROM mycte m1, mycte m2, mycte m3, mycte m0, master
WHERE m0.playerID = master.playerID
AND m0.category = '3Y'
AND m1.playerID = m0.playerID
AND m1.stint = m0.stint
AND m1.category = 'MLB'
AND m2.playerID = m0.playerID
AND m2.stint = m0.stint
AND m2.category in('AL','NL')
AND m3.playerID = m0.playerID
AND m3.stint = m0.stint
AND m3.category not in('AL','NL','MLB','3Y')
ORDER BY 2 DESC, 1 ASC
Here are the first few rows of the resultant table.



Ichiro Suzuki is number one on the list (no surprise there) with 214 hits, which is in the 99.8 percentile for the last three years, was in the 100th percentile for MLB in 2010, was in the 100th percentile for the American League, and was in the 100th percentile for his team, the Seattle Mariners. Michael Young was in 10th place, with 186 hits which is in the 96.7 percentile for the last three years, 97.2 percentile for all of MLB in 2010, 96.1 percentile for his league and 88.8 for his team. Since he and Josh Hamilton (8th on the list) play for the same team, they each have 88.8 as the team percentile as they tied for the team lead.
It is interesting to note, that the PERCENTRANK function does return a value of 1, meaning the 100th percentile. There are a lot of people that will argue that a percentile value of 1 (or 100) doesn’t make any sense. They might also argue that a percentile value of 0 doesn’t make any sense, either.
Here’s the way that I think about this. If I am doing an analysis of an entire population (in this case, all the MLB baseball players who had more than 199 at-bats in 2010), then it seems entirely reasonable that one player has achieved the maximum performance, and that performance could be assigned a value of 1, as it is greater than all other values. This is why neither Josh Hamilton, nor Michael Young gets a 1 for the team percentile statistic, as neither was greater than all the others in the team.
However, if we are calculating percentiles for a sample, then it seems unlikely that the sample would include either the maximum value or the minimum value. And if we want to make assessments about the population based on the sample, I think that it makes sense to assume that maximum value in the sample must be less than the 100th percentile and that the minimum value in the sample must be greater than the 0th percentile.
This is why, in addition to the PERCENTRANK function, there is also a PERCENTRANK_EXC function (which is the equivalent of the PERCENTRANK.EXC function in EXCEL). The PERCENTRANK_EXC function EXCludes the 0th and 100th percentile. If we change our SQL to use the PERCENTRANK_EXC function instead of the PERCENTRANK function, then our results look like this:

 
If we wanted to answer the question, how many hits would I have needed to be in the 95th percentile in 2010, then we can use the PERCENTILE function. Like the PERCENTRANK_EXC function, there is a PERCENTILE_EXC function, excluding the 0th and 100th percentile.
How does this differ from MATLAB? First, MATLAB does not have the equivalent of the PERCENTRANK function. If you want to calculate the percentile for a given score, you have to write your own MATLAB routine.
MATLAB also has its own implementation of PERCENTILE, which does not agree with either the PERCENTILE or the PERCENTILE_EXC functions. Here’s a very simple MATLAB example.
x=(1:5)
x =
     1     2     3     4     5
 
y=prctile(x, [25, 50, 75])
y =
    1.7500    3.0000    4.2500
 

Here’s what we get with the PERCENTILE and the PERCENTILE_EXC function.
SELECT m.p as PERCENTILE
,wct.PERCENTILE(k.x,m.p) as [SCORE (Include)]
,wct.PERCENTILE_EXC(k.x,m.p) as [SCORE (Exclude)]
FROM (VALUES (1),(2),(3),(4),(5)) k(x)
CROSS APPLY(VALUES (.25),(.50),(.75)) m(p)
GROUP BY m.p
ORDER BY 1
This produces the following results.

            PERCENTILE        SCORE (Include)        SCORE (Exclude)
---------------------- ---------------------- ----------------------
                  0.25                      2                    1.5
                  0.50                      3                    3.0
                  0.75                      4                    4.5
 

As you can see, there is a significant difference. The MATLAB implementation is not supported in either EXCEL or in XLeratorDB.

In our SQL Server 2008 R2 test environment we have about 5.5 million rows of test data for the PERCENTRANK function. Our tests, which produce about 100,000 rows in the resultant table, run in about 11 seconds. Your performance may vary.

 

See Also

 

Tags:
Categories:
Copyright 2008-2013 WestClinTech LLC         Privacy Policy        Terms of Service