In applied statistics there are times when you want to take your sample data and estimate the parameters of the distribution that best describes it. Without going too deeply into the math we look at how to fit a gamma distribution to the data. Specifically, we are trying to find the alpha and beta parameters to pass to the GAMMADIST function to calculate the probability density function. Doing this on the database is interesting because 1) that’s where our data is and 2) it makes it easy to include the calculation in work flows where, for example, we might want to do predictive analytics. Plus, it’s interesting and pretty easy to do.

The focus of the fitting function is on the alpha (α) parameter as the beta (β) parameter can be calculated directly from α.

The initial estimate for α is

The update step is

where ψ is the digamma function and ψ′ is the trigamma function.

Let’s look at our SQL Server implementation.

First, let’s declare some variables

DECLARE @tol as float = 0.00000001490116119384765625;

DECLARE @xbar as float;

DECLARE @logxbar as float;

DECLARE @meanlogx as float;

SELECT

@xbar = AVG(x)

,@logxbar = LOG(AVG(x))

,@meanlogx = AVG(LOG(x))

FROM

(SELECT x as x FROM wct.RANDGAMMA(1000,5,3))n;

Use a recursive CTE to calculate the alpha parameter of the gamma distribution. In this CTE we want to keep track of the number of iterations, the value of alpha and an error term. The number of iterations starts at 0 and terminates at 100. The initial value of alpha is as described above. The error term is simply the absolute value of the change in alpha |α_{(n+1)}-α_{n}|. We will initialize the error term to 1 and then iterate until the error term is less than the tolerance (defined above) or we have reached the maximum number of iterations. The anchor part of our CTE looks like this:

SELECT

0 as iter

,0.5 / (@logxbar - @meanlogx) as alpha

,1e+00 as err

The recursive part of the CTE does three things. First, it increments the iter column by 1. Second, the alpha column incorporates the update equation described above using the XLeratorDB DIGAMMA and TRIGAMMA functions. Third, the err column is populated as the absolute value of the difference between the current alpha value and the alpha value from the previous iteration.

SELECT

n.iter + 1

,n.alpha1

,ABS(n.alpha1 - n.alpha0) as err

FROM (

SELECT iter, alpha, 1/(1/alpha + (@meanlogx - @logxbar + LOG(alpha) - wct.digamma(alpha))/(POWER(alpha,2)*(1/alpha-wct.trigamma(alpha))))

FROM x

WHERE err > @tol

with x as (

SELECT

0 as iter

,0.5 / (@logxbar - @meanlogx) as alpha

,1e+00 as err

UNION ALL

SELECT

n.iter + 1

,n.alpha1

,ABS(n.alpha1 - n.alpha0) as err

FROM (

SELECT iter, alpha, 1/(1/alpha + (@meanlogx - @logxbar + LOG(alpha) - wct.digamma(alpha))/(POWER(alpha,2)*(1/alpha-wct.trigamma(alpha))))

FROM x

WHERE err > @tol

)n(iter, alpha0, alpha1)

)

SELECT

*

FROM

x

ORDER BY

iter ASC

OPTION

(MAXRECURSION 100);

This produces the following result.

You can see that the root-finding algorithm got to within tolerance very quickly. In the next SELECT statement, we generate a new sample of data and we return just the final estimates of alpha and beta and the number of iterations to arrive at a solution.

SELECT

@xbar = AVG(x)

,@logxbar = LOG(AVG(x))

,@meanlogx = AVG(LOG(x))

FROM

(SELECT x as x FROM wct.RANDGAMMA(1000,7,0.5))n;

with x as (

SELECT

0 as iter

,0.5 / (@logxbar - @meanlogx) as alpha

,1e+00 as err

UNION ALL

SELECT

n.iter + 1

,n.alpha1

,ABS(n.alpha1 - n.alpha0) as err

FROM (

SELECT iter, alpha, 1/(1/alpha + (@meanlogx - @logxbar + LOG(alpha) - wct.digamma(alpha))/(POWER(alpha,2)*(1/alpha-wct.trigamma(alpha))))

FROM x

WHERE err > @tol

)n(iter, alpha0, alpha1)

)

SELECT TOP 1

alpha

,@xbar / alpha as beta

,iter

FROM

x

ORDER BY

iter DESC

OPTION

(MAXRECURSION 100);

This produces the following result.

As you can see, the fitting algorithm came with an alpha parameter of 7.35 and a beta parameter 0.48 compared to values of 7.0 and 0.5 which were used to generate the sample data. Obviously, the larger the sample size, the better the fit will be.

In our testing the fit always happens in less than 5 iterations and in no measurable clock time. Your results will depend on your configuration, but this really isn’t a very expensive calculation at all.

If you are interested in doing other sophisticated calculations like this in SQL Server using simple SQL statements download the free 15-day trial. If there are other calculations you would like to see or you have questions about XLeratorDB, please send an email to support@westclintech.com.

Archive

Monthly

Go

| |||||||||

Sun | Mon | Tue | Wed | Thu | Fri | Sat | |||
---|---|---|---|---|---|---|---|---|---|

24 | 25 | 26 | 27 | 28 | 29 | 30 | |||

1 | 2 | 3 | 4 | 5 | 6 | 7 | |||

8 | 9 | 10 | 11 | 12 | 13 | 14 | |||

15 | 16 | 17 | 18 | 19 | 20 | 21 | |||

22 | 23 | 24 | 25 | 26 | 27 | 28 | |||

29 | 30 | 31 | 1 | 2 | 3 | 4 |

Go