We are happy to announce that we have reimplemented the following functions as multiinput aggregate functions in SQL Server 2008:
· CORREL
· COVAR
· DEVSQ
· FORECAST
· GROWTH
· INTERCEPT
· KURT
· PEARSON
· RSQ
· SKEW
· SLOPE
· STEYX
· TREND
· ZTEST
We found that after implementing these functions as aggregates, they performed up to twentyfive times faster than their scalar equivalents in our test harness.
As aggregate functions, they require a new syntax. For example, given the following table and data:
CREATE TABLE #c (
PROD nvarchar(10),
YR bigint,
SALES float)
INSERT INTO #c VALUES ('ABC',2005,10000)
INSERT INTO #c VALUES ('ABC',2006,11000)
INSERT INTO #c VALUES ('ABC',2007,12000)
INSERT INTO #c VALUES ('ABC',2008,13000)
INSERT INTO #c VALUES ('ABC',2009,14000)
INSERT INTO #c VALUES ('DEF',2005,10000)
INSERT INTO #c VALUES ('DEF',2006,12000)
INSERT INTO #c VALUES ('DEF',2007,14000)
INSERT INTO #c VALUES ('DEF',2008,16000)
INSERT INTO #c VALUES ('DEF',2009,18000)
INSERT INTO #c VALUES ('GHI',2005,10000)
INSERT INTO #c VALUES ('GHI',2006,12000)
INSERT INTO #c VALUES ('GHI',2007,14000)
INSERT INTO #c VALUES ('GHI',2008,12000)
INSERT INTO #c VALUES ('GHI',2009,10000)
in SQL Server 2005, we would enter the following SQL to calculate the SLOPE:
SELECT PROD,
wct.SLOPE('#c','sales','yr','PROD',PROD)
FROM #C
GROUP BY PROD
This produdes the following result.
PROD SLOPE
 
ABC 1000
DEF 2000
GHI 0
(3 row(s) affected)
In SQL Server 2008 the SQL looks like this:
SELECT PROD,
wct.SLOPE(SALES,YR) as SLOPE
FROM #C
GROUP BY PROD
which produces exactly the same result.
PROD SLOPE
 
ABC 1000
DEF 2000
GHI 0
(3 row(s) affected)
A similar change occurs in the FORECAST function where this:
SELECT PROD,
wct.FORECAST(2010, '#c','sales','yr','PROD',PROD) as [2010 Sales]
FROM #C
GROUP BY PROD
producing the following result
PROD 2010 Sales
 
ABC 15000
DEF 20000
GHI 11600
(3 row(s) affected)
becomes this
SELECT PROD,
wct.FORECAST(2010,SALES,YR) as [2010 Sales]
FROM #C
GROUP BY PROD
returning the same result.
PROD 2010 Sales
 
ABC 15000
DEF 20000
GHI 11600
(3 row(s) affected)
Prior to making this change, XLeratorDB had implemented a type of scalar function that we called ‘range queries’. The range query allows many rows of multiinput input data for a function to return a single result. In SQL Server 2005, aggregate functions could only have a single input. Range queries overcame that limitation. Implemented as scalar functions, they could not take advantage of the aggregate function architecture. Many of these functions would collect multiple rows and columns of data as well as scalar values to perform complex financial, statistical, or mathematical calculations.
In SQL Server 2008, it became possible to reimplement some of these scalar rangequery functions as multiinput aggregate functions. One of our clients was using the SLOPE function on a database containing 430,000 rows and his query ran in under twelve seconds. He asked to see if we could get the time down to around a second. In order to do this, we took a hard look at the new multiinput aggregate function feature in SQL Server 2008.
We realized almost immediately that functions like these required a ‘onepass’ solution. The existing XLeratorDB implementations were ‘twopass’ solutions, matching the precision of the comparable EXCEL functions and, initially, we were worried that the potential loss of significance in a onepass solution might be a problem. This led us to decide to make both options available, so there are onepass and twopass solutions available in XLeratorDB/statistics 2008. However, the twopass implementation is always slower.
The onepass implementation lends itself very nicely to the features of the aggregate function. We did not need to keep track of the order of the data received into the function and there was a simple calculation performed at the end of the processing.
Is the onepass calculation vs. twopass calculation really much of an issue? On a practical level, I do not think so. This is an example that we got from the EXCEL knowledge base (which you can find here).
CREATE TABLE #n(
y float,
x float
)
INSERT INTO #n VALUES (1, 1000000003)
INSERT INTO #n VALUES (2, 1000000004)
INSERT INTO #n VALUES (3, 1000000002)
INSERT INTO #n VALUES (4, 1000000005)
INSERT INTO #n VALUES (5, 1000000004)
INSERT INTO #n VALUES (6, 1000000007)
SELECT wct.SLOPE(y,x) as [OnePass]
,wct.SLOPE_q('SELECT y, x from #n') as [TwoPass]
from #n
In this example we are using the onepass multiinput aggregate SLOPE function and the twopass range query (scalar) SLOPE_q function and displaying the results sidebyside. This produces the following result.
OnePass TwoPass
 
NULL 0.775280898876404
(1 row(s) affected)
The onepass solution returns a NULL and the twopass solution returns the same value as EXCEL. As the EXCEL documentation points out, this occurs when the data contains a high number of significant digits combined with relatively little variation between data values. If your data meet these criteria, you can simply use the SLOPE_q function, or, if you prefer, use the SLOPE function and adjust the xvalues by the mean of x:
SELECT wct.SLOPE(y,xc.mean) as SLOPE
FROM (
SELECT AVG(x) as mean
from #n
) c, #n
This produces the following result.
SLOPE

0.775280898876405
(1 row(s) affected)
We picked these 14 functions because they lent themselves quite nicely to this type of implementation. Other range query functions, like AVEDEV for example, did not lend themselves to onepass solutions.
The improvement in performance in our environment was spectacular. We compared the results of the scalar functions to the aggregate functions on a Dell Inspiron 5305 with an Intel Pentium Dual CPU E2160 processor @ 1.80GHz with 4 GB of RAM running Windows Vista. The following table summarizes the performance improvements:
Aggregate

Scalar

Performance Multiple

CORREL

CORREL_q

9.6

COVAR

COVAR_q

25.3

DEVSQ

DEVSQ_q

15.6

FORECAST

FORECAST_q

19.2

GROWTH

GROWTH_q

14.9

INTERCEPT

INTERCEPT_q

23.4

KURT

KURT_q

14.9

PEARSON

PEARSON_q

11.3

RSQ

RSQ_q

11.8

SKEW

SKEW_q

15.6

SLOPE

SLOPE_q

24.3

STEYX

STEYX_q

20.2

TREND

TREND_q

19.5

ZTEST

ZTEST_q

11

Our tests were run against datasets containing anywhere from 1.5 million to 5.5 million rows, depending on the function, comprising 100,000 rows in the resultant table for each function. The following table summarizes the average number of rows processed each second for each function:
Aggregate

Rows per second

Scalar

Rows per second

CORREL

570,002

CORREL_q

59,540

COVAR

599,867

COVAR_q

23,685

DEVSQ

761,705

DEVSQ_q

48,685

FORECAST

376,916

FORECAST_q

19,615

GROWTH

264,807

GROWTH_q

17,717

INTERCEPT

475,366

INTERCEPT_q

20,342

KURT

709,185

KURT_q

47,577

PEARSON

514,062

PEARSON_q

45,420

RSQ

541,770

RSQ_q

45,890

SKEW

723,648

SKEW_q

46,417

SLOPE

488,503

SLOPE_q

20,140

STEYX

1,439,992

STEYX_q

71,226

TREND

391,245

TREND_q

20,019

ZTEST

514,829

ZTEST_q

46,937

We believe that for the first time, you will be able to perform these types of calculations on large datasets directly in TSQL.
Here is one example: If you want to calculate the next year’s sales for every company listed in the NYSE, NASDAQ, FTSE and any other exchange you can think of (assuming that you have the historical financial data) using the FORECAST function, now you can.
In about a second.