Calculating Percentiles in SQL Server: an update
Aug
7
Written by:
Charles Flock
8/7/2015 2:33 PM
Our original blog on Calculating percentiles in SQL Server, published in 2011, is one of the most popular pages at westclintech.com. In this article, we talk about a tablevalued approach to calculating percentiles which in some situations may improve performance dramatically over the PERCENTILE aggregate function.
Calculating percentiles requires two things: an ordered set of values and the desired percentile, which, for our purposes, is expressed as P/100 (so the 95^{th} percentile would be 0.95). In 2011 we introduced the PERCENTILE and PERCENTILE_EXC aggregate functions for SQL Server 2008 (and 2012 and 2014). These functions are the equivalent of the Excel PERCENTILE.INC and PERCENTILE.EXC functions.
Userdefined aggregate functions in SQL Server are not able to control the order in which data are passed into the function; data arrives in the order that the optimizer thinks best. The percentile functions must store the data and sort it before calculating the percentile. In our environment, we refer to these types of functions as twopass functions as we need to pass over the data twice; first when the function receives the data it must be saved, and second, the saved data must be sorted before the result can be calculated.
Thus, the number of values that need to be sorted has a direct impact on the performance of these two functions. In our testing, these functions perform really well up to about 10,000 rows and then performance starts to lag perceptibly.
Because of this, we have created a new tablevalued function, PERCENTILES, which can be used to calculate one or many percentiles from an ordered dataset. Unlike the aggregate functions which relies on the input dataset to be sorted; it does no ordering of the data. Since the tablevalued function uses dynamic SQL to get the data, simply make sure that the SQL string sent into the function contains the appropriate ORDER BY.
This table tells us the elapsed time, in seconds, to return a single percentile value from a dataset with the indicated number of rows and compares the performance of the aggregate function and the tablevalued function.
Number of Rows

Elapsed Time


Aggregate

TVF

1,000

0.003

0.03

10,000

0.113

0.043

20,000

0.67

0.103

30,000

1.103

0.143

40,000

2.293

0.203

50,000

3.413

0.326

60,000

4.52

0.376

70,000

6.31

0.39

80,000

7.713

0.57

90,000

9.493

0.516

100,000

11.5

0.5

200,000

39.97

1.083

300,000

82.846

1.71

400,000

138.2

2.193

500,000

207.126

2.813

Here's a graphical representation of the data.
As you can see, the tablevalued function performs significantly better than the aggregate as input volumes increase, exhibiting almost linear growth rather than the exponential growth of the aggregate function. In our environment with millions of rows of data with 100 percentile points to be evaluated we have found the tablevalued function to perform thousands of times faster than the aggregate function.
We did further analysis and looked at performance based not only on the size of the input dataset but also based on the number of points of interest. For example, for a given dataset you might be interested in calculating the 0^{th}, 25^{th}, 50^{th}, 75^{th}, and 100^{th} percentile (the quartiles). This results in 5 invocations of the aggregate function. If you were interested in the quintiles, there would be 6 invocations; the deciles, 11 invocations, and so on.
We ran a script that created a normal distribution with a mean of 100 and a standard deviation of 15 for 10,000 to 100,000 rows in multiples of 10,000. For each of these datasets we then calculated 1, 3, 5, 6, 11, 21, 51, and 101 percentile values. The following table summarizes the results for the aggregate function.

Elapsed Time in Seconds


1

3

5

6

11

21

51

101

10,000

0.09

0.13

0.196

0.233

0.37

0.683

1.623

3.056

20,000

0.526

0.956

1.136

1.246

2.316

3.44

9.956

18.796

30,000

1.383

2.363

3.013

2.943

5.236

9.246

23.626

46.536

40,000

2.61

3.68

5.09

5.916

10.84

18.233

42.253

85.866

50,000

4.206

6.06

7.98

8.806

15.913

29.496

68.583

134.423

60,000

5.336

8.096

11.96

12.93

22.776

41.67

99.033

195.47

70,000

7.15

10.753

15.493

17.536

31.146

56.303

136.98

266.03

80,000

9.03

14.643

21.13

22.84

40.01

75.433

177.636

350.563

90,000

10.94

18.373

26.546

29.346

52.6

95.79

224.196

443.503

100,000

13.29

21.596

30.676

34.79

63.193

116.42

279.21

551.986

Calculating a single percentile value from a dataset containing 10,000 rows took 0.09 seconds. When we increase the number of rows to 100,000 it took 13.29 second.
Calculating 101 percentile values from a dataset contain 10,000 rows took 3.056 seconds. When we increase the number of rows to 100,000 it took 551.986 seconds.
Here's a graphical representation of the results.
We modified the script to perform the same calculations using the tablevalued function. We couldn’t use the same script because the syntax for tablevalued functions and aggregate function are different. Here are the results of that test.

Elapsed Time in Seconds


1

3

5

6

11

21

51

101

10,000

0.043

0.043

0.053

0.043

0.046

0.04

0.043

0.043

20,000

0.09

0.09

0.09

0.09

0.09

0.103

0.126

0.116

30,000

0.136

0.133

0.133

0.133

0.136

0.136

0.136

0.133

40,000

0.193

0.176

0.183

0.18

0.183

0.183

0.176

0.183

50,000

0.23

0.263

0.236

0.226

0.223

0.226

0.23

0.226

60,000

0.273

0.28

0.276

0.293

0.276

0.276

0.27

0.276

70,000

0.366

0.336

0.346

0.33

0.346

0.336

0.366

0.34

80,000

0.45

0.396

0.39

0.396

0.42

0.383

0.39

0.41

90,000

0.43

0.456

0.443

0.453

0.473

0.436

0.436

0.446

100,000

0.543

0.506

0.483

0.583

0.486

0.526

0.49

0.513

Calculating a single percentile from a dataset containing 10,000 rows took 0.043 seconds. Calculating 101 percentiles from the dataset also taking 0.043 seconds.
Calculating a single percentile from a dataset containing 100,000 rows took 0.543 seconds. Calculating 101 percentiles from the dataset took 0.513 seconds.
In other words, from 10,000 rows to 100,000 rows and from 1 point to 101 points it never took more 0.543 seconds to calculate the results; far better than the aggregate's performance.
Here are the results of the tests of the tablevalued function, using the same scale as the results from aggregate function.
XLeratorDB will continue to provide the aggregate functions PERCENTILE and PERCENITLE_EXC as well as providing the new tablevalued function PERCENTILES. You can decide which better suits your needs.
If you are already using XLeratorDB, just login to your account and redownload the software from your 'My Orders' page to check it out. If haven't already tried XLeratorDB, just click here to get the 15day trial and discover how to turn your SQL Server into an analytics engine.