*Using the XLeratorDB/statistics 2008 PERCENTILE and PERCENTRANK functions in SQL Server 2008.*

**UPDATE:** XLeratorDB has released a new table-valued PERCENTILE function that offers significant performance improvements over the aggregate function.

If you go to Wikipedia and look up the definition of PERCENTILE, the definition starts “There is no standard definition of percentile, however all definitions yield similar results when the number of observations is very large.” If you are an EXCEL 2010 user, you might have already guessed this, since the PERCENTILE function has morphed into 2 functions; PERCENTILE.INC and PERCENTILE.EXC. If you are not an EXCEL 2010 user, you may have been scratching your head over different results produced by different packages, say EXCEL and Matlab, and wondering which is right.

Considering how often percentiles are used (height and weight charts, test scoring, performance evaluations, figuring wages, etc.) it’s important to understand the different ways to calculate percentile and what a percentile value actually means.

Generally, percentile calculations can be broken down into two functions. The first function evaluates a score (*k*) of some type (something like height, weight, test scores, length, etc.) within a set of scores and returns a value between 0 and 1 (*p*) indicating a relative ranking of that score in relation to all the other scores in the set. This ranking will indicate the percentage of scores which fall below *k *in the set. This is the PERCENTRANK function.

The second function evaluates a percentage (*p*) on a set of scores and returns a value (k) such that the percentage of scores in the set falls below *k*. The value *k* need not be a member of the set, though it is bound by the minimum and maximum values of the set. This is the PERCENTILE function.

For example, let’s say you have a 4^{th} grade student in a class of 40 who takes a standardized exam that is given to all 4^{th} grade students in the United States. You receive results broken down into a national percentile, a percentile for all private schools, and a percentile within school. Let’s say that this student scores in the 99^{th} percentile nationally, 99^{th} percentile for private schools, and the 21^{st} percentile in-school. Is it possible to conclude that this student is near the bottom of his class and that 78% of his classmates outperformed him on this exam?

The maximum possible score on the exam is 100. In the class of 40, 32 students got a perfect score of 100, and the remaining students got a score of 99 (scores are not disclosed in the reporting of the results). Let’s calculate the value for the 21^{st} percentile in this group.

SELECT wct.PERCENTILE(x,.21) as PERCENTILE

FROM (VALUES

(99),(99),(99),(99),(99),(99),(99),(99),

(100),(100),(100),(100),(100),(100),(100),(100),

(100),(100),(100),(100),(100),(100),(100),(100),

(100),(100),(100),(100),(100),(100),(100),(100),

(100),(100),(100),(100),(100),(100),(100),(100)

) n(x)

This produces the following result.

PERCENTILE

----------------------

100

Wait, but isn’t 100 the maximum score? How could someone get the maximum score and still be in the 21

This gets back to the definition of percentile. While there is no standard definition, the differences in the definitions are about methods of computation rather than interpretation of the resulting statistic. The percentile statistic is the value below which a certain percent of observations fall. In this example, we know that 21 percent of the observations fall below a value of 100. We cannot draw any conclusions about how many observations are above.

Let’s look at another example. There are 40 scores in this example, with a maximum value of 300 for the score. We will calculate the 90^{th}, 95^{th}, and 99^{th} percentile.

SELECT m.p as PERCENTILE

,wct.PERCENTILE(n.x,m.p) as SCORE

FROM (VALUES

(257),(261),(275),(283),(282),(261),(279),(256),

(263),(261),(283),(282),(249),(300),(257),(263),

(286),(261),(275),(245),(288),(267),(292),(255),

(279),(270),(300),(286),(285),(297),(278),(300),

(260),(269),(264),(273),(256),(300),(300),(290)

) n(x)

CROSS APPLY(VALUES (.90),(.95),(.99)) m(p)

GROUP BY m.p

ORDER BY 1

This produces the following result.

PERCENTILE SCORE

--------------------------------------- ----------------------

0.90 300

0.95 300

0.99 300

We can see that all three percentiles contain the maximum score, so all we can safely conclude is that a score of 300 is better than 90% of the reported scores. Since 300 is the maximum score, there is no ‘room for improvement’ for scores in the 90

Frequently, instead of coming up with the score based on a percentile, we will need to come up with percentile based upon the score. To do this, we use the PERCENTRANK function.

In this example, we will use the same test scores from the previous example and for each test score we will calculate the corresponding percentile value. Even though there are 40 scores in our data set, the resultant table will only contain 25 scores as the duplicates are eliminated.

SELECT *

INTO #c

FROM (VALUES

(257),(261),(275),(283),(282),(261),(279),(256),

(263),(261),(283),(282),(249),(300),(257),(263),

(286),(261),(275),(245),(288),(267),(292),(255),

(279),(270),(300),(286),(285),(297),(278),(300),

(260),(269),(264),(273),(256),(300),(300),(290)

) n(x)

SELECT c1.x

,wct.PERCENTRANK(c2.x,c1.x,5) as PERCENTRANK

FROM #c c1, #c c2

GROUP BY c1.x

This produces the following result.

x PERCENTRANK

----------- ----------------------

245 0

249 0.02564

255 0.05128

256 0.07594

257 0.12658

260 0.17948

261 0.20125

263 0.30379

264 0.35897

267 0.38461

269 0.41025

270 0.43589

273 0.46153

275 0.48101

278 0.53846

279 0.55696

282 0.60759

283 0.65822

285 0.71794

286 0.73417

288 0.79487

290 0.82051

292 0.84615

297 0.87179

300 0.87939

We can see that a score of 300 corresponds to the 87.9 percentile; a score of 245 corresponds to the 0 percentile. We can see that the 50

Let’s assume that the students have the opportunity to take the exam again, the teacher reviews the materials, and everybody retakes the exam.

SELECT *

INTO #c

FROM (VALUES

(248),(252),(258),(259),(259),(260),(260),(263),

(264),(264),(264),(264),(266),(266),(267),(270),

(272),(273),(276),(278),(278),(281),(282),(282),

(285),(285),(286),(286),(288),(289),(289),(291),

(293),(295),(300),(300),(300),(300),(300),(300)

) n(x)

SELECT c1.x

,wct.PERCENTRANK(c2.x,c1.x,5) as PERCENTRANK

FROM #c c1, #c c2

GROUP BY c1.x

DROP TABLE #c

This produces the following result.

x PERCENTRANK

----------- ----------------------

248 0

252 0.02564

258 0.05128

259 0.07594

260 0.12658

263 0.17948

264 0.20125

266 0.30379

267 0.35897

270 0.38461

272 0.41025

273 0.43589

276 0.46153

278 0.48101

281 0.53846

282 0.55696

285 0.60759

286 0.65822

288 0.71794

289 0.73417

291 0.79487

293 0.82051

295 0.84615

300 0.85355

We can see that there was no improvement in PERCENTRANK for any of their test scores and an actual decline in the percentile value for those with scores of 300. Let’s analyze this data a little further by assigning a test score to an id and keeping track of the scores for test 1 and test 2. We can then select the id, the test scores for test 1 and test 2, and the percentiles for test 1 and test 2.

SELECT *

INTO #d

FROM (VALUES

(15,1,245),(15,2,248),(9,1,249),(9,2,252),(10,1,255),(10,2,258),

(7,1,256),(7,2,259),(17,1,256),(17,2,259),(32,1,257),(32,2,260),

(33,1,257),(33,2,260),(31,1,260),(31,2,263),(26,1,261),(26,2,264),

(19,1,261),(19,2,264),(8,1,261),(8,2,264),(21,1,261),(21,2,264),

(22,1,263),(22,2,266),(6,1,263),(6,2,266),(14,1,264),(14,2,267),

(35,1,267),(35,2,270),(18,1,269),(18,2,272),(28,1,270),(28,2,273),

(4,1,273),(4,2,276),(20,1,275),(20,2,278),(25,1,275),(25,2,278),

(36,1,278),(36,2,281),(3,1,279),(3,2,282),(37,1,279),(37,2,282),

(13,1,282),(13,2,285),(23,1,282),(23,2,285),(29,1,283),(29,2,286),

(40,1,283),(40,2,286),(16,1,285),(16,2,288),(24,1,286),(24,2,289),

(39,1,286),(39,2,289),(1,1,288),(1,2,291),(38,1,290),(38,2,293),

(12,1,292),(12,2,295),(34,1,297),(34,2,300),(27,1,300),(27,2,300),

(11,1,300),(11,2,300),(2,1,300),(2,2,300),(5,1,300),(5,2,300),

(30,1,300),(30,2,300)

) n(id,tn,sc)

SELECT d1.id

,d1.sc as [Test 1 score]

,d3.sc as [Test 2 score]

,wct.PERCENTRANK(d2.sc, d1.sc, 5) [Test 1 PERCENTILE]

,wct.PERCENTRANK(d4.sc, d3.sc, 5) [Test 2 PERCENTILE]

FROM #d d1, #d d2, #d d3, #d d4

WHERE d1.tn = 1

AND d2.tn = 1

AND d3.tn = 2

AND d4.tn = 2

AND d3.id = d1.id

GROUP BY d1.id

,d1.sc

,d3.sc

This produces the following result.

id Test 1 score Test 2 score Test 1 PERCENTILE Test 2 PERCENTILE

----------- ------------ ------------ ---------------------- ----------------------

1 288 291 0.77548 0.77548

2 300 300 0.87554 0.85053

3 279 282 0.55034 0.55034

4 273 276 0.45028 0.45028

5 300 300 0.87554 0.85053

6 263 266 0.30018 0.30018

7 256 259 0.07504 0.07504

8 261 264 0.20012 0.20012

9 249 252 0.02501 0.02501

10 255 258 0.05003 0.05003

11 300 300 0.87554 0.85053

12 292 295 0.82551 0.82551

13 282 285 0.60037 0.60037

14 264 267 0.35021 0.35021

15 245 248 0 0

16 285 288 0.70043 0.70043

17 256 259 0.07504 0.07504

18 269 272 0.40025 0.40025

19 261 264 0.20012 0.20012

20 275 278 0.47529 0.47529

21 261 264 0.20012 0.20012

22 263 266 0.30018 0.30018

23 282 285 0.60037 0.60037

24 286 289 0.72545 0.72545

25 275 278 0.47529 0.47529

26 261 264 0.20012 0.20012

27 300 300 0.87554 0.85053

28 270 273 0.42526 0.42526

29 283 286 0.6504 0.6504

30 300 300 0.87554 0.85053

31 260 263 0.1751 0.1751

32 257 260 0.12507 0.12507

33 257 260 0.12507 0.12507

34 297 300 0.85053 0.85053

35 267 270 0.37523 0.37523

36 278 281 0.52532 0.52532

37 279 282 0.55034 0.55034

38 290 293 0.8005 0.8005

39 286 289 0.72545 0.72545

40 283 286 0.6504 0.6504

With just a slight adjustment to the SQL, we can get a little clearer picture of how the results changed.

SELECT p.id

,p.[Test 2 score] - p.[Test 1 score] as [Change in Score]

,p.[Test 2 PERCENTILE] - p.[Test 1 PERCENTILE] as [Change In Percentile]

FROM (

SELECT d1.id

,d1.sc as [Test 1 score]

,d3.sc as [Test 2 score]

,wct.PERCENTRANK(d2.sc, d1.sc, 5) [Test 1 PERCENTILE]

,wct.PERCENTRANK(d4.sc, d3.sc, 5) [Test 2 PERCENTILE]

FROM #d d1, #d d2, #d d3, #d d4

WHERE d1.tn = 1

AND d2.tn = 1

AND d3.tn = 2

AND d4.tn = 2

AND d3.id = d1.id

GROUP BY d1.id

,d1.sc

,d3.sc

) p

This produces the following result.

id Change in Score Change In Percentile

----------- --------------- ----------------------

1 3 0

2 0 -0.02501

3 3 0

4 3 0

5 0 -0.02501

6 3 0

7 3 0

8 3 0

9 3 0

10 3 0

11 0 -0.02501

12 3 0

13 3 0

14 3 0

15 3 0

16 3 0

17 3 0

18 3 0

19 3 0

20 3 0

21 3 0

22 3 0

23 3 0

24 3 0

25 3 0

26 3 0

27 0 -0.02501

28 3 0

29 3 0

30 0 -0.02501

31 3 0

32 3 0

33 3 0

34 3 0

35 3 0

36 3 0

37 3 0

38 3 0

39 3 0

40 3 0

It’s clear from this analysis, that everybody who did not have a perfect score on the first test improved their test score by 3 points and that everybody who got a perfect score on the first test got a perfect score on the second. However, none of the test-takers showed an improvement in percentile and the best-performing group, the group with perfect scores, actually showed a decline in percentile. Despite the fact that everybody’s performance improved, the top percentile went from 0.87554 to 0.85053. Thus, the statement that none of the test-takers scored above the 86

Let’s look at something a little more complicated. In this example, I use the baseball database from Sean Lahman’s site, http://baseball1.com/statistics/. In this database, he has a table of batting statistics which contains every player for every season for a total of 93,955 rows, which is a pretty small table for a database, but a massive amount of data for a spreadsheet.

Here’s the challenge. We want to calculate a PERCENTRANK for all the players who played in the 2010 season who had more than 199 at-bats (which is a relatively easy way to identify players who are everyday players). But, we want to calculate 4 different statistics all using players who had more than 199 at-bats. First, we want to calculate the percentile using the previous 3 years batting statistics for all MLB players. Second, we want to calculate the percentile using the 2010 statistics for all MLB players. Third, we want to calculate the percentile using the 2010 statistics for each league (AL or NL). Finally, we want to calculate the percentile for the team.

Here’s a picture of the batting table.

Our SQL will use a common table expression to calculate the 4 percentiles for each player, and then we will join to another table to get the player’s first name and last name to produce the report.

;with mycte as (

SELECT b1.playerID

,'3Y' as category

,b1.stint

,b1.H

,wct.PERCENTRANK(b2.H, b1.H, 3) Percentile

FROM BATTING b1, BATTING b2

WHERE b1.yearID = 2010

AND b1.AB > 199

AND b2.yearID > 2007

AND b2.AB > 199

GROUP by b1.playerID

,b1.stint

,b1.H

UNION ALL

SELECT b1.playerID

,'MLB' as category

,b1.stint

,b1.H

,wct.PERCENTRANK(b2.H, b1.H, 3) Percentile

FROM BATTING b1, BATTING b2

WHERE b1.yearID = 2010

AND b1.AB > 199

AND b2.yearID = b1.yearID

AND b2.AB > 199

GROUP by b1.playerID

,b1.stint

,b1.H

UNION ALL

SELECT b1.playerID

,b1.lgid

,b1.stint

,b1.H

,wct.PERCENTRANK(b2.H, b1.H, 3)

FROM BATTING b1, BATTING b2

WHERE b1.yearID = 2010

AND b1.AB > 199

AND b2.yearID = b1.yearID

AND b2.AB > 199

AND b2.lgid = b1.lgid

GROUP by b1.playerID

,b1.lgID

,b1.stint

,b1.H

UNION ALL

SELECT b1.playerID

,b1.teamID

,b1.stint

,b1.H

,wct.PERCENTRANK(b2.H, b1.H, 3)

FROM BATTING b1, BATTING b2

WHERE b1.yearID = 2010

AND b1.AB > 199

AND b2.yearID = b1.yearID

AND b2.AB > 199

AND b2.teamid = b1.teamid

GROUP by b1.playerID

,b1.teamID

,b1.stint

,b1.H

) SELECT master.nameLast + ', ' + master.nameFirst as Player

,m0.H

,m0.Percentile as [3Y Percentile]

,m1.Percentile as [MLB Percentile]

,m2.Percentile as [League Percentile]

,m3.Percentile as [Team Percentile]

FROM mycte m1, mycte m2, mycte m3, mycte m0, master

WHERE m0.playerID = master.playerID

AND m0.category = '3Y'

AND m1.playerID = m0.playerID

AND m1.stint = m0.stint

AND m1.category = 'MLB'

AND m2.playerID = m0.playerID

AND m2.stint = m0.stint

AND m2.category in('AL','NL')

AND m3.playerID = m0.playerID

AND m3.stint = m0.stint

AND m3.category not in('AL','NL','MLB','3Y')

ORDER BY 2 DESC, 1 ASC

Here are the first few rows of the resultant table.

Ichiro Suzuki is number one on the list (no surprise there) with 214 hits, which is in the 99.8 percentile for the last three years, was in the 100^{th} percentile for MLB in 2010, was in the 100^{th} percentile for the American League, and was in the 100^{th} percentile for his team, the Seattle Mariners. Michael Young was in 10^{th} place, with 186 hits which is in the 96.7 percentile for the last three years, 97.2 percentile for all of MLB in 2010, 96.1 percentile for his league and 88.8 for his team. Since he and Josh Hamilton (8^{th} on the list) play for the same team, they each have 88.8 as the team percentile as they tied for the team lead.

It is interesting to note, that the PERCENTRANK function does return a value of 1, meaning the 100^{th} percentile. There are a lot of people that will argue that a percentile value of 1 (or 100) doesn’t make any sense. They might also argue that a percentile value of 0 doesn’t make any sense, either.

Here’s the way that I think about this. If I am doing an analysis of an entire population (in this case, all the MLB baseball players who had more than 199 at-bats in 2010), then it seems entirely reasonable that one player has achieved the maximum performance, and that performance could be assigned a value of 1, as it is greater than all other values. This is why neither Josh Hamilton, nor Michael Young gets a 1 for the team percentile statistic, as neither was greater than all the others in the team.

However, if we are calculating percentiles for a sample, then it seems unlikely that the sample would include either the maximum value or the minimum value. And if we want to make assessments about the population based on the sample, I think that it makes sense to assume that maximum value in the sample must be less than the 100^{th} percentile and that the minimum value in the sample must be greater than the 0^{th} percentile.

This is why, in addition to the PERCENTRANK function, there is also a PERCENTRANK_EXC function (which is the equivalent of the PERCENTRANK.EXC function in EXCEL). The PERCENTRANK_EXC function EXCludes the 0^{th} and 100^{th} percentile. If we change our SQL to use the PERCENTRANK_EXC function instead of the PERCENTRANK function, then our results look like this:

If we wanted to answer the question, how many hits would I have needed to be in the 95^{th} percentile in 2010, then we can use the PERCENTILE function. Like the PERCENTRANK_EXC function, there is a PERCENTILE_EXC function, excluding the 0^{th} and 100^{th} percentile.

How does this differ from MATLAB? First, MATLAB does not have the equivalent of the PERCENTRANK function. If you want to calculate the percentile for a given score, you have to write your own MATLAB routine.

MATLAB also has its own implementation of PERCENTILE, which does not agree with either the PERCENTILE or the PERCENTILE_EXC functions. Here’s a very simple MATLAB example.

x=(1:5)

x =

1 2 3 4 5

y=prctile(x, [25, 50, 75])

y =

1.7500 3.0000 4.2500

Here’s what we get with the PERCENTILE and the PERCENTILE_EXC function.

SELECT m.p as PERCENTILE

,wct.PERCENTILE(k.x,m.p) as [SCORE (Include)]

,wct.PERCENTILE_EXC(k.x,m.p) as [SCORE (Exclude)]

FROM (VALUES (1),(2),(3),(4),(5)) k(x)

CROSS APPLY(VALUES (.25),(.50),(.75)) m(p)

GROUP BY m.p

ORDER BY 1

This produces the following results.

PERCENTILE SCORE (Include) SCORE (Exclude)

---------------------- ---------------------- ----------------------

0.25 2 1.5

0.50 3 3.0

0.75 4 4.5

As you can see, there is a significant difference. The MATLAB implementation is not supported in either EXCEL or in XLeratorDB.

In our SQL Server 2008 R2 test environment we have about 5.5 million rows of test data for the PERCENTRANK function. Our tests, which produce about 100,000 rows in the resultant table, run in about 11 seconds. Your performance may vary.

· PERCENTILES - table-valued function to calculate one or many percentile values from a single dataset

Archive

Monthly

Go

| |||||||||

Sun | Mon | Tue | Wed | Thu | Fri | Sat | |||
---|---|---|---|---|---|---|---|---|---|

28 | 29 | 30 | 1 | 2 | 3 | 4 | |||

5 | 6 | 7 | 8 | 9 | 10 | 11 | |||

12 | 13 | 14 | 15 | 16 | 17 | 18 | |||

19 | 20 | 21 | 22 | 23 | 24 | 25 | |||

26 | 27 | 28 | 29 | 30 | 31 | 1 | |||

2 | 3 | 4 | 5 | 6 | 7 | 8 |

Go