Login     Register

        Contact Us     Search

Introducing XLeratorDB/statistics for SQL Server 2008

Oct 8

Written by: Charles Flock
10/8/2010 8:20 AM  RssIcon

We have changed 14 scalar functions to take advantage of the multi-input aggregate capabilities of SQL Server 2008 and improved performance more than twenty-five-fold.
We are happy to announce that we have re-implemented the following functions as multi-input aggregate functions in SQL Server 2008:
·         CORREL
·         COVAR
·         DEVSQ
·         FORECAST
·         GROWTH
·         INTERCEPT
·         KURT
·         PEARSON
·         RSQ
·         SKEW
·         SLOPE
·         STEYX
·         TREND
·         ZTEST
We found that after implementing these functions as aggregates, they performed up to twenty-five times faster than their scalar equivalents in our test harness.
As aggregate functions, they require a new syntax. For example, given the following table and data:
CREATE TABLE #c (
      PROD nvarchar(10),
      YR          bigint,
      SALES float)
 
INSERT INTO #c VALUES ('ABC',2005,10000)
INSERT INTO #c VALUES ('ABC',2006,11000)
INSERT INTO #c VALUES ('ABC',2007,12000)
INSERT INTO #c VALUES ('ABC',2008,13000)
INSERT INTO #c VALUES ('ABC',2009,14000)
INSERT INTO #c VALUES ('DEF',2005,10000)
INSERT INTO #c VALUES ('DEF',2006,12000)
INSERT INTO #c VALUES ('DEF',2007,14000)
INSERT INTO #c VALUES ('DEF',2008,16000)
INSERT INTO #c VALUES ('DEF',2009,18000)
INSERT INTO #c VALUES ('GHI',2005,10000)
INSERT INTO #c VALUES ('GHI',2006,12000)
INSERT INTO #c VALUES ('GHI',2007,14000)
INSERT INTO #c VALUES ('GHI',2008,12000)
INSERT INTO #c VALUES ('GHI',2009,10000)

in SQL Server 2005, we would enter the following SQL to calculate the SLOPE:
SELECT PROD,
wct.SLOPE('#c','sales','yr','PROD',PROD)
FROM #C
GROUP BY PROD
This produdes the following result.
PROD       SLOPE
---------- ----------------------
ABC        1000
DEF        2000
GHI        0
 
(3 row(s) affected)

In SQL Server 2008 the SQL looks like this:
SELECT PROD,
wct.SLOPE(SALES,YR) as SLOPE
FROM #C
GROUP BY PROD
which produces exactly the same result.
PROD       SLOPE
---------- ----------------------
ABC        1000
DEF        2000
GHI        0
 
(3 row(s) affected)

A similar change occurs in the FORECAST function where this:
SELECT PROD,
wct.FORECAST(2010, '#c','sales','yr','PROD',PROD) as [2010 Sales]
FROM #C
GROUP BY PROD
producing the following result
PROD       2010 Sales
---------- ----------------------
ABC        15000
DEF        20000
GHI        11600
 
(3 row(s) affected)
becomes this
SELECT PROD,
wct.FORECAST(2010,SALES,YR) as [2010 Sales]
FROM #C
GROUP BY PROD
 
returning the same result.
PROD       2010 Sales
---------- ----------------------
ABC        15000
DEF        20000
GHI        11600
 
(3 row(s) affected)

Prior to making this change, XLeratorDB had implemented a type of scalar function that we called ‘range queries’. The range query allows many rows of multi-input input data for a function to return a single result. In SQL Server 2005, aggregate functions could only have a single input. Range queries overcame that limitation. Implemented as scalar functions, they could not take advantage of the aggregate function architecture. Many of these functions would collect multiple rows and columns of data as well as scalar values to perform complex financial, statistical, or mathematical calculations.
In SQL Server 2008, it became possible to re-implement some of these scalar range-query functions as multi-input aggregate functions. One of our clients was using the SLOPE function on a database containing 430,000 rows and his query ran in under twelve seconds. He asked to see if we could get the time down to around a second. In order to do this, we took a hard look at the new multi-input aggregate function feature in SQL Server 2008.
We realized almost immediately that functions like these required a ‘one-pass’ solution. The existing XLeratorDB implementations were ‘two-pass’ solutions, matching the precision of the comparable EXCEL functions and, initially, we were worried that the potential loss of significance in a one-pass solution might be a problem. This led us to decide to make both options available, so there are one-pass and two-pass solutions available in XLeratorDB/statistics 2008. However, the two-pass implementation is always slower.
The one-pass implementation lends itself very nicely to the features of the aggregate function. We did not need to keep track of the order of the data received into the function and there was a simple calculation performed at the end of the processing.
Is the one-pass calculation vs. two-pass calculation really much of an issue? On a practical level, I do not think so. This is an example that we got from the EXCEL knowledge base (which you can find here).

 CREATE TABLE #n(
      y     float,
      x     float
      )
 
INSERT INTO #n VALUES (1, 1000000003)
INSERT INTO #n VALUES (2, 1000000004)
INSERT INTO #n VALUES (3, 1000000002)
INSERT INTO #n VALUES (4, 1000000005)
INSERT INTO #n VALUES (5, 1000000004)
INSERT INTO #n VALUES (6, 1000000007)
 
SELECT wct.SLOPE(y,x) as [One-Pass]
,wct.SLOPE_q('SELECT y, x from #n') as [Two-Pass]
from #n

In this example we are using the one-pass multi-input aggregate SLOPE function and the two-pass range query (scalar) SLOPE_q function and displaying the results side-by-side. This produces the following result.
One-Pass               Two-Pass
---------------------- ----------------------
NULL                   0.775280898876404
 
(1 row(s) affected)

The one-pass solution returns a NULL and the two-pass solution returns the same value as EXCEL. As the EXCEL documentation points out, this occurs when the data contains a high number of significant digits combined with relatively little variation between data values. If your data meet these criteria, you can simply use the SLOPE_q function, or, if you prefer, use the SLOPE function and adjust the x-values by the mean of x:
SELECT wct.SLOPE(y,x-c.mean) as SLOPE
FROM (
      SELECT AVG(x) as mean
      from #n
      ) c, #n
 
This produces the following result.
SLOPE
----------------------
0.775280898876405
 
(1 row(s) affected)

We picked these 14 functions because they lent themselves quite nicely to this type of implementation. Other range query functions, like AVEDEV for example, did not lend themselves to one-pass solutions.
The improvement in performance in our environment was spectacular. We compared the results of the scalar functions to the aggregate functions on a Dell Inspiron 5305 with an Intel Pentium Dual CPU E2160 processor @ 1.80GHz with 4 GB of RAM running Windows Vista. The following table summarizes the performance improvements:
Aggregate
Scalar
Performance Multiple
CORREL
CORREL_q
9.6
COVAR
COVAR_q
25.3
DEVSQ
DEVSQ_q
15.6
FORECAST
FORECAST_q
19.2
GROWTH
GROWTH_q
14.9
INTERCEPT
INTERCEPT_q
23.4
KURT
KURT_q
14.9
PEARSON
PEARSON_q
11.3
RSQ
RSQ_q
11.8
SKEW
SKEW_q
15.6
SLOPE
SLOPE_q
24.3
STEYX
STEYX_q
20.2
TREND
TREND_q
19.5
ZTEST
ZTEST_q
11
 
Our tests were run against datasets containing anywhere from 1.5 million to 5.5 million rows, depending on the function, comprising 100,000 rows in the resultant table for each function. The following table summarizes the average number of rows processed each second for each function:

Aggregate
Rows per second
Scalar
Rows per second
CORREL
570,002
CORREL_q
59,540
COVAR
599,867
COVAR_q
23,685
DEVSQ
761,705
DEVSQ_q
48,685
FORECAST
376,916
FORECAST_q
19,615
GROWTH
264,807
GROWTH_q
17,717
INTERCEPT
475,366
INTERCEPT_q
20,342
KURT
709,185
KURT_q
47,577
PEARSON
514,062
PEARSON_q
45,420
RSQ
541,770
RSQ_q
45,890
SKEW
723,648
SKEW_q
46,417
SLOPE
488,503
SLOPE_q
20,140
STEYX
1,439,992
STEYX_q
71,226
TREND
391,245
TREND_q
20,019
ZTEST
514,829
ZTEST_q
46,937
 

We believe that for the first time, you will be able to perform these types of calculations on large datasets directly in T-SQL.

Here is one example: If you want to calculate the next year’s sales for every company listed in the NYSE, NASDAQ, FTSE and any other exchange you can think of (assuming that you have the historical financial data) using the FORECAST function, now you can.
In about a second.

Tags:
Categories:
Copyright 2008-2017 Westclintech LLC         Privacy Policy        Terms of Service