XLeratorDB/statistics is now available
Mar
2
Written by:
Charles Flock
3/2/2009 2:16 PM
Some notes on the release of the statistics package and how it works in relation to EXCEL and in relation to existing SQL Server functions with the same or similar names.
We are pleased to announce the availability of the XLeratorDB/statistics module. It is our most robust module in terms of the number of functions and the complexity of those functions and includes all the familiar functions that you would find in the Excel Statistical Functions reference.
It does not include every function, though we think that it includes every important function. For example, we have included the COUNT function even though SQL Server has a COUNT function, because the COUNT function in EXCEL and the COUNT function in SQL Server do not behave the same way. And, in our opinion, it was very hard to get the SQL Server COUNT function to behave like the EXCEL COUNT function. However, we have not included the COUNTA, COUNTBLANK, COUNTIF, and COUNTIFS functions, as it seems quite straightforward to duplicate the EXCEL functionality using the appropriate WHERE clause. Of course, if you think otherwise, post a comment on the forum and we would be happy to look at adding those functions.
The same is true for AVERAGE which is similar to the SQL Server AVG function. But, as with COUNT, the behavior is not quite the same and we felt that it was important to have an AVERAGE function that behaves in exactly the same way as EXCEL. But we have not included AVERAGEA, AVERAGEIF, or AVERAGEIFS.
In implementing many of these functions we had to make a choice between implementing them as aggregate functions and scalar functions. Aggregate functions require you to use a GROUP BY clause in your SQL if you have selected more than just the aggregate function. Aggregates can also produce some unexpected results if columns from two different tables are not properly joined. We felt that you simply had to know a lot more about T-SQL to use aggregate functions and that this makes the transition from a spreadsheet to a data base that much harder. So, there are no aggregate functions in the statistics package, even though analogous functions in SQL Server are implemented as aggregates. This allows you to use aggregated results in your simple T-SQL statements without having to worry about how aggregates behave; we handle all of that.
We have split the documentation into 2 sections; probability functions and statistical functions. All of the statistical functions will calculate results based on receiving a variable number of rows as input. All of the probability functions manipulate a value in some way to return another value. For example, the CHITEST function is classified as a statistical function since it may receive multiple rows of input for the observed results. The CHIDIST and CHIINV functions, however, are classified as probability functions since they take the supplied values and return a value. While you can call CHIDIST and CHIINV for many rows in a T-SQL statement, they will return a value for each row, whereas CHITEST will not. CHITEST will return a result grouped by whatever value (or values) you have supplied.
In our testing, for example, we have millions of test values for CHIDIST and CHIINV and we get millions of rows returned. For CHITEST, however, we have millions of rows and we get substantially fewer returned, since the CHITEST result is calculated from many rows in the resultant table. Thus, we classify CHITEST as a statistical function, but CHIINV and CHIDIST as probability functions.
For the most part, we have been faithful to the EXCEL paradigm so that if you are familiar with EXCEL using these functions in a SQL Server data base will be very straightforward. However, we changed the CHITEST function to automatically calculate the expected results as we felt that this made the function much easier to use. We also wonder why EXCEL doesn’t do that as there were several articles in the knowledge base about how to calculate the expected values.
We also added some functions that we thought made sense. For example, in EXCEL there is no GAMMA function. There is a GAMMALN function and if you enter EXP(GAMMALN(n)) EXCEL will return G(n). But, why not just have a GAMMA function? So, we added one.
Similarly, there is a GAMMALN function but no FACTLN function. And while we realize that n! = G(n+1), we thought that it made sense to have a FACTLN function. We also thought that it made sense to add BETA (the beta function), BETAI (the incomplete beta function), BICO (the binomial coefficient), GAMMAINC (the incomplete gamma function), GAMMAP and GAMMAQ (the regularized incomplete gamma functions), and INVGAMMAP (the inverse of the regularized incomplete gamma function).
A big part of the release process is the testing. We tested in SQL Server 2005 and 2008, running XP, Vista, 32-bit and 64-bit. We have results for over 3.2 billion tests. We calculate expected values for all the tests by using the worksheetfunction object and we store all the results in the data base. We then construct T-SQL statements to compare the results of XLeratorDB/statistics calculations to the expected values. We do this until we get the differences between the expected results and actual results as close to zero as possible.
For the statistical functions, we agree with most functions to 14 decimal places and we agree with EXCEL out to 7 decimal places for every function except MODE. In EXCEL, MODE is dependent on the order of data when there are ties. We have made a change to the function in SQL Server that allows you to specify whether you want the first tie, the last tie, or NULL returned. Of course, if there are no ties, MODE behaves exactly the same as EXCEL.
In terms of the probability functions, we were only able to match EXCEL out to 6 decimal places (and, in some cases with extreme values, 5 decimal places). Many of the probability functions are based on the calculation of GAMMA. Since EXCEL has no explicit GAMMA calculation and since there is no documentation on the precise implementation of GAMMALN, we cannot really make any comments on the accuracy of GAMMA within EXCEL. We implemented GAMMA using a Lanczos approximation and feel comfortable that our calculations are reliable to a minimum of 8 decimal places and have compared the results to Mathemetica and the Keisan calculation library by Casio.
We think that the statistical functions in this package make SQL Server a much more powerful analytical tool. If you have large samples or large datasets, we feel that you will find the combination of XLeratorDB and SQL Server to be more robust and easier to use then a spreadsheet or even a spreadsheet in combination with a data base. We hope that you find it a satisfying experience. If you have any comments, please post them on the forum , and if you have any questions or problems you can send us an e-mail at support@westclintech.com or call us at 914 231 7137.