We are happy to announce that we have re-implemented the following functions as multi-input aggregate functions in SQL Server 2008:
· CORREL
· COVAR
· DEVSQ
· FORECAST
· GROWTH
· INTERCEPT
· KURT
· PEARSON
· RSQ
· SKEW
· SLOPE
· STEYX
· TREND
· ZTEST
We found that after implementing these functions as aggregates, they performed up to twenty-five times faster than their scalar equivalents in our test harness.
As aggregate functions, they require a new syntax. For example, given the following table and data:
CREATE TABLE #c (
PROD nvarchar(10),
YR bigint,
SALES float)
INSERT INTO #c VALUES ('ABC',2005,10000)
INSERT INTO #c VALUES ('ABC',2006,11000)
INSERT INTO #c VALUES ('ABC',2007,12000)
INSERT INTO #c VALUES ('ABC',2008,13000)
INSERT INTO #c VALUES ('ABC',2009,14000)
INSERT INTO #c VALUES ('DEF',2005,10000)
INSERT INTO #c VALUES ('DEF',2006,12000)
INSERT INTO #c VALUES ('DEF',2007,14000)
INSERT INTO #c VALUES ('DEF',2008,16000)
INSERT INTO #c VALUES ('DEF',2009,18000)
INSERT INTO #c VALUES ('GHI',2005,10000)
INSERT INTO #c VALUES ('GHI',2006,12000)
INSERT INTO #c VALUES ('GHI',2007,14000)
INSERT INTO #c VALUES ('GHI',2008,12000)
INSERT INTO #c VALUES ('GHI',2009,10000)
in SQL Server 2005, we would enter the following SQL to calculate the SLOPE:
SELECT PROD,
wct.SLOPE('#c','sales','yr','PROD',PROD)
FROM #C
GROUP BY PROD
This produdes the following result.
PROD SLOPE
---------- ----------------------
ABC 1000
DEF 2000
GHI 0
(3 row(s) affected)
In SQL Server 2008 the SQL looks like this:
SELECT PROD,
wct.SLOPE(SALES,YR) as SLOPE
FROM #C
GROUP BY PROD
which produces exactly the same result.
PROD SLOPE
---------- ----------------------
ABC 1000
DEF 2000
GHI 0
(3 row(s) affected)
A similar change occurs in the FORECAST function where this:
SELECT PROD,
wct.FORECAST(2010, '#c','sales','yr','PROD',PROD) as [2010 Sales]
FROM #C
GROUP BY PROD
producing the following result
PROD 2010 Sales
---------- ----------------------
ABC 15000
DEF 20000
GHI 11600
(3 row(s) affected)
becomes this
SELECT PROD,
wct.FORECAST(2010,SALES,YR) as [2010 Sales]
FROM #C
GROUP BY PROD
returning the same result.
PROD 2010 Sales
---------- ----------------------
ABC 15000
DEF 20000
GHI 11600
(3 row(s) affected)
Prior to making this change, XLeratorDB had implemented a type of scalar function that we called ‘range queries’. The range query allows many rows of multi-input input data for a function to return a single result. In SQL Server 2005, aggregate functions could only have a single input. Range queries overcame that limitation. Implemented as scalar functions, they could not take advantage of the aggregate function architecture. Many of these functions would collect multiple rows and columns of data as well as scalar values to perform complex financial, statistical, or mathematical calculations.
In SQL Server 2008, it became possible to re-implement some of these scalar range-query functions as multi-input aggregate functions. One of our clients was using the SLOPE function on a database containing 430,000 rows and his query ran in under twelve seconds. He asked to see if we could get the time down to around a second. In order to do this, we took a hard look at the new multi-input aggregate function feature in SQL Server 2008.
We realized almost immediately that functions like these required a ‘one-pass’ solution. The existing XLeratorDB implementations were ‘two-pass’ solutions, matching the precision of the comparable EXCEL functions and, initially, we were worried that the potential loss of significance in a one-pass solution might be a problem. This led us to decide to make both options available, so there are one-pass and two-pass solutions available in XLeratorDB/statistics 2008. However, the two-pass implementation is always slower.
The one-pass implementation lends itself very nicely to the features of the aggregate function. We did not need to keep track of the order of the data received into the function and there was a simple calculation performed at the end of the processing.
Is the one-pass calculation vs. two-pass calculation really much of an issue? On a practical level, I do not think so. This is an example that we got from the EXCEL knowledge base (which you can find here).
CREATE TABLE #n(
y float,
x float
)
INSERT INTO #n VALUES (1, 1000000003)
INSERT INTO #n VALUES (2, 1000000004)
INSERT INTO #n VALUES (3, 1000000002)
INSERT INTO #n VALUES (4, 1000000005)
INSERT INTO #n VALUES (5, 1000000004)
INSERT INTO #n VALUES (6, 1000000007)
SELECT wct.SLOPE(y,x) as [One-Pass]
,wct.SLOPE_q('SELECT y, x from #n') as [Two-Pass]
from #n
In this example we are using the one-pass multi-input aggregate SLOPE function and the two-pass range query (scalar) SLOPE_q function and displaying the results side-by-side. This produces the following result.
One-Pass Two-Pass
---------------------- ----------------------
NULL 0.775280898876404
(1 row(s) affected)
The one-pass solution returns a NULL and the two-pass solution returns the same value as EXCEL. As the EXCEL documentation points out, this occurs when the data contains a high number of significant digits combined with relatively little variation between data values. If your data meet these criteria, you can simply use the SLOPE_q function, or, if you prefer, use the SLOPE function and adjust the x-values by the mean of x:
SELECT wct.SLOPE(y,x-c.mean) as SLOPE
FROM (
SELECT AVG(x) as mean
from #n
) c, #n
This produces the following result.
SLOPE
----------------------
0.775280898876405
(1 row(s) affected)
We picked these 14 functions because they lent themselves quite nicely to this type of implementation. Other range query functions, like AVEDEV for example, did not lend themselves to one-pass solutions.
The improvement in performance in our environment was spectacular. We compared the results of the scalar functions to the aggregate functions on a Dell Inspiron 5305 with an Intel Pentium Dual CPU E2160 processor @ 1.80GHz with 4 GB of RAM running Windows Vista. The following table summarizes the performance improvements:
Aggregate
|
Scalar
|
Performance Multiple
|
CORREL
|
CORREL_q
|
9.6
|
COVAR
|
COVAR_q
|
25.3
|
DEVSQ
|
DEVSQ_q
|
15.6
|
FORECAST
|
FORECAST_q
|
19.2
|
GROWTH
|
GROWTH_q
|
14.9
|
INTERCEPT
|
INTERCEPT_q
|
23.4
|
KURT
|
KURT_q
|
14.9
|
PEARSON
|
PEARSON_q
|
11.3
|
RSQ
|
RSQ_q
|
11.8
|
SKEW
|
SKEW_q
|
15.6
|
SLOPE
|
SLOPE_q
|
24.3
|
STEYX
|
STEYX_q
|
20.2
|
TREND
|
TREND_q
|
19.5
|
ZTEST
|
ZTEST_q
|
11
|
Our tests were run against datasets containing anywhere from 1.5 million to 5.5 million rows, depending on the function, comprising 100,000 rows in the resultant table for each function. The following table summarizes the average number of rows processed each second for each function:
Aggregate
|
Rows per second
|
Scalar
|
Rows per second
|
CORREL
|
570,002
|
CORREL_q
|
59,540
|
COVAR
|
599,867
|
COVAR_q
|
23,685
|
DEVSQ
|
761,705
|
DEVSQ_q
|
48,685
|
FORECAST
|
376,916
|
FORECAST_q
|
19,615
|
GROWTH
|
264,807
|
GROWTH_q
|
17,717
|
INTERCEPT
|
475,366
|
INTERCEPT_q
|
20,342
|
KURT
|
709,185
|
KURT_q
|
47,577
|
PEARSON
|
514,062
|
PEARSON_q
|
45,420
|
RSQ
|
541,770
|
RSQ_q
|
45,890
|
SKEW
|
723,648
|
SKEW_q
|
46,417
|
SLOPE
|
488,503
|
SLOPE_q
|
20,140
|
STEYX
|
1,439,992
|
STEYX_q
|
71,226
|
TREND
|
391,245
|
TREND_q
|
20,019
|
ZTEST
|
514,829
|
ZTEST_q
|
46,937
|
We believe that for the first time, you will be able to perform these types of calculations on large datasets directly in T-SQL.
Here is one example: If you want to calculate the next year’s sales for every company listed in the NYSE, NASDAQ, FTSE and any other exchange you can think of (assuming that you have the historical financial data) using the FORECAST function, now you can.
In about a second.