*A look at the 32 new aggregate functions in XLeratorDB/statistics2008 v1.08.*

A few months ago, we released XLeratorDB/statistics2008 which included 16 functions that had been converted from scalar functions to aggregates to take advantage of the *multi-input* aggregate function architecture available in SQL Server 2008. Those new aggregate functions perform up to 50 times faster than their scalar equivalents and are much easier to use. Those functions all had one thing in common; they were single-pass solutions. Basically this means that there was no reason to store the data for subsequent calculation. All the calculations requiring the data are done as the data are received into the function.

With the new release of XLeratorDB/statistics2008, we have taken the next logical step: creating functions that require two passes of the data. This means that the data are retained in the function and once all the data have been captured, the calculation can be performed. A simple example of this is the MEDIAN function. It is not possible to calculate the MEDIAN of a dataset until you know the number of items in the dataset and the data are sorted. In an aggregate function there is no way to force or guarantee the order of the data. Thus, the calculation is done once all the data are collected.

In our testing, the new aggregates performed, on average, about 10 times faster than the equivalent scalar implementations, but that will vary by machine. We also found that the more complex the SQL, the greater the performance improvement. We also think that implementing these functions as aggregates makes the SQL simpler and more intuitive.

We have converted twenty-five scalar functions to be multi-input aggregates and added seven new functions. The new aggregate functions that were converted from scalars are:

If you want to continue to use scalar versions of the functions, simply use the function name with ‘_q’ added at the end. For example, the scalar implementation of PERCENTILE, PERCENTILE_q, is still part of the package.

We also added four new functions to maintain parity with EXCEL 2010.

We also added three new functions to calculate the range of a dataset, the inter-quartile range, and the range between any two percentiles.

Let’s look at the PERCENTILE function to see what the change from scalar to aggregate means for SQL Server 2008 users. For purposes of this example, I am using a database of baseball statistics. The table that we are interested in has the following definition:

The first thing that we will calculate is the 95^{th} percentile for home runs (HR) in a season. We are only interested in those batters who have had 200 or more at-bats (AB) and we are only interested in years (yearID) subsequent to 1920. The SQL looks like this:

SELECT yearID

,wct.PERCENTILE(HR,.95) as HR --95th percentile

FROM Batting

WHERE AB > 199 --200 or more at-bats

AND yearID > 1920 --1921 to 2010

GROUP BY yearID

ORDER BY 1 DESC

Here are the first few rows returned by the SQL.

To do the same thing with the scalar implementation, using the PERCENTILE_q function, we would enter the following SQL.

SELECT b.yearid

,wct.PERCENTILE_q(

'SELECT HR

FROM batting

WHERE AB > 199

AND yearid = ' + cast(b.yearid as varchar)

,.95) as HR

FROM batting b

WHERE b.yearid > 1920

GROUP BY b.yearid

ORDER BY 1 DESC

This produces the same results as the aggregate, but the SQL is more complicated. In SQL Server 2005, the PERCENTILE function remains a scalar function. It it designed for simpler statements (PERCENTILE_q, above, allows for far greater flexibility), and to use it in this case requires staging of the data into a temporary table. In SQL Server 2005, we would enter something like this.

SELECT yearid, HR

INTO #b

FROM Batting

WHERE AB > 199

AND YEARID > 1920

SELECT #b.yearid

,wct.PERCENTILE('#b','HR','yearid',#b.yearid, .95)

from #b

GROUP BY #b.yearid

order by 1 DESC

DROP TABLE #b

This, also, produces the same results as the aggregate.

SQL Server 2005 users may be aware of the fact that we have taken some of our multi-input aggregates and implemented them as single input aggregates in SQL Server 2005, by concatenating values together. For example, the SLOPE function in SQL Server 2005 takes fmtSLOPE(y, x) as input, where is fmtSLOPE is a string with the concatenated y- and x-values.

We were able to do that for functions like SLOPE, because they have single-pass implementations. PERCENTILE and the 24 other functions that are included as aggregates in this release, do not. As a result, we will be continuing to maintain their scalar implementations in SQL Server 2005.

Aggregates in SQL Server 2005 have a MaxByteSize of 8000 which means that the aggregate functions can store up to 8000 bytes of data. For some functions this might be OK, but for functions like PERCENTILE or MEDIAN, this seems to be too little data.

SQL Server 2005 users will continue to have the scalar PERCENTILE and PERCENTILE_q functions available in XLeratorDB/statistics.

Let’s get back to the PERCENTILE aggregate. Now that we see how easy it is to use, let’s look at a few more examples.

We now realize that in our example that we want to split out the National League and the American League. We can simply add another column to our SQL.

SELECT yearID

,lgID

,wct.PERCENTILE(HR,.95) as HR --95th percentile

FROM Batting

WHERE AB > 199 --200 or more at-bats

AND yearID > 1920 --1921 to 2010

GROUP BY yearID, lgID

ORDER BY 1 DESC

Here are the first few rows returned by the SQL.

Let’s add the names of the players who were in the 95

SELECT B.playerid

,b.lgid

,B.yearid

,B.HR

FROM (

SELECT yearID

,lgID

,wct.PERCENTILE(HR,.95) as HR

FROM Batting

WHERE AB > 199

AND yearID > 1920

GROUP BY yearID, lgID

) N, BATTING B

WHERE B.yearID = N.yearID

AND B.lgID = N.lgID

AND B.HR >= N.HR

ORDER BY 3 desc, 2 asc, 1 ASC

Here are the first few rows returned by preceding statement.

Finally, let’s see which players have been at the 95

SELECT master.nameLast + ', ' + master.nameFirst as Player

,COUNT(*) as [Number of Seasons]

FROM (

SELECT B.playerid

,b.lgid

,B.yearid

,B.HR

FROM (

SELECT yearID

,lgID

,wct.PERCENTILE(HR,.95) as HR

FROM Batting

WHERE AB > 199

AND yearID > 1920

GROUP BY yearID, lgID

) N, BATTING B

WHERE B.yearID = N.yearID

AND B.lgID = N.lgID

AND B.HR >= N.HR

) m, MASTER

WHERE m.playerID = Master.playerID

GROUP BY master.nameLast, master.nameFirst

ORDER BY 2 DESC

Here are the first few rows returned from this SQL.

On my 32-bit laptop, with 94,000 rows in the Batting table, this last piece of SQL returned 368 rows and ran in less than a second. Could you even do something like this in EXCEL? I certainly wouldn’t know how.

We think that the new aggregate functions in XLeratorDB/statistics2008 are a major improvement over the previous scalar versions. They are more powerful, easier to use, and more intuitive. Let us know what you think.

Archive

Monthly

Go

| |||||||||

Sun | Mon | Tue | Wed | Thu | Fri | Sat | |||
---|---|---|---|---|---|---|---|---|---|

24 | 25 | 26 | 27 | 28 | 29 | 30 | |||

1 | 2 | 3 | 4 | 5 | 6 | 7 | |||

8 | 9 | 10 | 11 | 12 | 13 | 14 | |||

15 | 16 | 17 | 18 | 19 | 20 | 21 | |||

22 | 23 | 24 | 25 | 26 | 27 | 28 | |||

29 | 30 | 31 | 1 | 2 | 3 | 4 |

Go