Login     Register

        Contact Us     Search

Logistic Regression in SQL Server

Mar 4

Written by: Charles Flock
3/4/2015 10:08 AM  RssIcon

We have added 5 new functions in the latest release of XLeratorDB / statistics 2008. Four of these functions, LOGIT, LOGITSUM, LOGITPRED, and LOGITPROB are SQL Server implementations of the logistic regression, often referred to as the logit regression. The remaining function, VIF, estimates the variance inflation factor, tolerance, and R2 for a set of independent variables allowing you to test for collinearity.
Given a set of independent variables X and dichotomous outcomes {0, 1} Y, you can use the logistic regression to calculate the probability that Y = 1.
Let's look at an example of what this means from Applied Logistic Regression, Third Edition by David W. Hosmer, Jr., Stanley and Rodney X. Sturdivant.
Their CHD data set provides 100 data points consisting of a patient's age and the absence {0} or presence {1} of coronary heart disease (this full example is explored in the LOGIT documentation on our website). A logistic regression of this data attempts to answer a question about the probability of coronary heart disease being present in a patient based on their age, since age is the only independent variable in our data.
The following table provides a summary of the CHD data.

 
 
Coronary Heart Disease
Age Group
n
Absent
Present
20 – 29
10
9
1
30 – 34
15
13
2
35 – 39
12
9
3
40 – 44
15
10
5
45 – 49
13
7
6
50 – 54
8
3
5
55 – 59
17
4
13
60 – 69
10
2
8
Total
100
57
43

Given that there is only a single independent variable, the table above would lead us to the conclusion that the probability that coronary heart disease is present in someone between the ages of 60 and 60 is approximately 80% while the probability that coronary heart disease is present in someone between the ages of 20 and 29 is approximately 10%.
Let's take the CHD dataset and put it into a temp table, #chd, and then use the LOGIT function to calculate the coefficients.
SELECT
   *
INTO
   #chd
FROM (VALUES
    (20,0),(23,0),(24,0),(25,0),(25,1),(26,0),(26,0),(28,0),(28,0),(29,0)
    ,(30,0),(30,0),(30,0),(30,0),(30,0),(30,1),(32,0),(32,0),(33,0),(33,0)
    ,(34,0),(34,0),(34,1),(34,0),(34,0),(35,0),(35,0),(36,0),(36,1),(36,0)
    ,(37,0),(37,1),(37,0),(38,0),(38,0),(39,0),(39,1),(40,0),(40,1),(41,0)
    ,(41,0),(42,0),(42,0),(42,0),(42,1),(43,0),(43,0),(43,1),(44,0),(44,0)
    ,(44,1),(44,1),(45,0),(45,1),(46,0),(46,1),(47,0),(47,0),(47,1),(48,0)
    ,(48,1),(48,1),(49,0),(49,0),(49,1),(50,0),(50,1),(51,0),(52,0),(52,1)
    ,(53,1),(53,1),(54,1),(55,0),(55,1),(55,1),(56,1),(56,1),(56,1),(57,0)
    ,(57,0),(57,1),(57,1),(57,1),(57,1),(58,0),(58,1),(58,1),(59,1),(59,1)
    ,(60,0),(60,1),(61,1),(62,1),(62,1),(63,1),(64,0),(64,1),(65,1),(69,1)
   )n(age,chd)
 
SELECT
   *
FROM
   wct.LOGIT(
   'SELECT
        age
       ,chd
    FROM
       #chd'
   ,2
   )
This returns the following values, but we are really only concerned with the first two.
stat_name          idx               stat_val
---------- ----------- ----------------------
b                    0      -5.30945337391905
b                    1        0.1109211422069
se                   0       1.13365463681529
se                   1     0.0240598358749988
z                    0      -4.68348401840844
z                    1       4.61022023521619
pval                 0   2.82039453738119E-06
pval                 1   4.02242615828777E-06
Wald                 0       21.9350225506872
Wald                 1       21.2541306171968
LL0               NULL      -68.3314913574166
LLM               NULL      -53.6765463471564
chisq             NULL       29.3098900205205
df                NULL                      1
p_chisq           NULL   6.16800830270808E-08
AIC               NULL       111.353092694313
BIC               NULL       116.563433066289
Nobs              NULL                    100
rsql              NULL      0.214468391061534
rsqcs             NULL      0.254051637397186
rsqn              NULL      0.340992792718107
D                 NULL     -0.482789163126702
Iterations        NULL                      5
Converged         NULL                      1
This tells us that based on age we would estimate the probability that coronary heart disease is present in a 24 year-old as
SELECT
   1/(1 + EXP(-(-5.30945 + 0.11092 * 24)))
or approximately 6.6%.
We have also implemented a LOGITSUM function so that you can analyze grouped data. Instead of passing in a column containing the dichotomous results for each observation you can pass in counts of success and failures for each grouping of the data. For example, let's say our coronary heart disease data were presented to us in the following way:
              mean age         chd           N
---------------------- ----------- -----------
                  25.4           1          10
                    32           2          15
               36.9167           3          12
               42.3333           5          15
               47.2308           6          13
                51.875           5           8
               56.8824          13          17
                    63           8          10
 
We could use the LOGITSUM function.
SELECT
   *
INTO
   #chdsum
FROM (VALUES
    (25.4,1,10)
   ,(32,2,15)
   ,(36.9167,3,12)
   ,(42.3333,5,15)
   ,(47.2308,6,13)
   ,(51.875,5,8)
   ,(56.8824,13,17)
   ,(63,8,10)
   )p([mean age],chd,N)
 
SELECT
   *
FROM
   wct.LOGITSUM('
       SELECT
          [mean age],
          chd,
          N-chd
       FROM
          #chdsum'
       ,2
       ,3
   )
This produces the following result.
stat_name          idx               stat_val
---------- ----------- ----------------------
b                    0      -5.20385902245454
b                    1      0.108651473280877
se                   0       1.11798776031707
se                   1     0.0237329519129878
z                    0      -4.65466546876925
z                    1       4.57808508942444
pval                 0   3.24507143474021E-06
pval                 1   4.69251822153992E-06
Wald                 0       21.6659106261529
Wald                 1       20.9588630860104
LL0               NULL      -68.3314913574166
LLM               NULL      -54.1790842644491
chisq             NULL       28.3048141859351
df                NULL                      1
p_chisq           NULL   1.03637918925054E-07
AIC               NULL       112.358168528898
BIC               NULL       117.568508900874
Nobs              NULL                    100
rsql              NULL      0.207113979394092
rsqcs             NULL      0.246516487110749
rsqn              NULL      0.330878975046836
D                 NULL     -0.464151599148838
Iterations        NULL                      5
Converged         NULL                      1
 
Now, when we calculate the probability of the presence of coronary heart disease in a 24 year-old, the formula would be:
SELECT
   1/(1 + EXP(-(-5.20386 + 0.10865 * 24)))
or approximately 6.9%.
You can use the LOGITPROB or the LOGITPRED functions to calculate the probability that Y = 1. For example,
SELECT
   wct.LOGITPRED('-5.20386, 0.10865','24')
produces the same result as the previous SELECT statement. When there are many independent variables and/or many predictions to be made, the LOGITPRED and LOGITPROB functions are simpler to use than explicitly embedding the equation in a SELECT statement.
The VIF function is used to detect collinearity and should be used in conjunction with not only the LOGIT and LOGITSUM functions but also with LINEST, LINEST_q, LOGEST, and LOGEST_q. VIF works by performing ordinary least squares regression on all the combinations of independent variables and calculating the R2 for each combination. From the R2 value, it then calculates the tolerance and the variance inflation factors. Based upon the results of this table-valued function you can determine whether or not to eliminate any of the independent variables from the regression analysis.

You can read more about these function and find more examples in our on-line documentation. You can also try out these functions today by downloading the free 15-day trial and installing it on your SQL Server database. We think that these functions are a great tool to have inside SQL Server as it permits you to do the analysis right where the data is, without having to drag the data across the network, and let's you use simple SQL commands instead of having to learn how to use another tool. We hope that you like it and, as always, if there is something that you would like to see in XLeratorDB, just send us an e-mail at support@westclintech.com.

 

See Also
·         LINEST - Linear regression
·         LOGIT - Logit regression

 

Tags:
Categories:
Copyright 2008-2017 Westclintech LLC         Privacy Policy        Terms of Service