Logistic Regression in SQL Server
Mar
4
Written by:
Charles Flock
3/4/2015 10:08 AM
We have added 5 new functions in the latest release of XLeratorDB / statistics 2008. Four of these functions, LOGIT, LOGITSUM, LOGITPRED, and LOGITPROB are SQL Server implementations of the logistic regression, often referred to as the logit regression. The remaining function, VIF, estimates the variance inflation factor, tolerance, and R2 for a set of independent variables allowing you to test for collinearity.
Given a set of independent variables X and dichotomous outcomes {0, 1} Y, you can use the logistic regression to calculate the probability that Y = 1.
Their CHD data set provides 100 data points consisting of a patient's age and the absence {0} or presence {1} of coronary heart disease (this full example is explored in the LOGIT documentation on our website). A logistic regression of this data attempts to answer a question about the probability of coronary heart disease being present in a patient based on their age, since age is the only independent variable in our data.
The following table provides a summary of the CHD data.
|
|
Coronary Heart Disease
|
Age Group
|
n
|
Absent
|
Present
|
20 – 29
|
10
|
9
|
1
|
30 – 34
|
15
|
13
|
2
|
35 – 39
|
12
|
9
|
3
|
40 – 44
|
15
|
10
|
5
|
45 – 49
|
13
|
7
|
6
|
50 – 54
|
8
|
3
|
5
|
55 – 59
|
17
|
4
|
13
|
60 – 69
|
10
|
2
|
8
|
Total
|
100
|
57
|
43
|
Given that there is only a single independent variable, the table above would lead us to the conclusion that the probability that coronary heart disease is present in someone between the ages of 60 and 60 is approximately 80% while the probability that coronary heart disease is present in someone between the ages of 20 and 29 is approximately 10%.
Let's take the CHD dataset and put it into a temp table, #chd, and then use the LOGIT function to calculate the coefficients.
SELECT
*
INTO
#chd
FROM (VALUES
(20,0),(23,0),(24,0),(25,0),(25,1),(26,0),(26,0),(28,0),(28,0),(29,0)
,(30,0),(30,0),(30,0),(30,0),(30,0),(30,1),(32,0),(32,0),(33,0),(33,0)
,(34,0),(34,0),(34,1),(34,0),(34,0),(35,0),(35,0),(36,0),(36,1),(36,0)
,(37,0),(37,1),(37,0),(38,0),(38,0),(39,0),(39,1),(40,0),(40,1),(41,0)
,(41,0),(42,0),(42,0),(42,0),(42,1),(43,0),(43,0),(43,1),(44,0),(44,0)
,(44,1),(44,1),(45,0),(45,1),(46,0),(46,1),(47,0),(47,0),(47,1),(48,0)
,(48,1),(48,1),(49,0),(49,0),(49,1),(50,0),(50,1),(51,0),(52,0),(52,1)
,(53,1),(53,1),(54,1),(55,0),(55,1),(55,1),(56,1),(56,1),(56,1),(57,0)
,(57,0),(57,1),(57,1),(57,1),(57,1),(58,0),(58,1),(58,1),(59,1),(59,1)
,(60,0),(60,1),(61,1),(62,1),(62,1),(63,1),(64,0),(64,1),(65,1),(69,1)
)n(age,chd)
SELECT
*
FROM
wct.LOGIT(
'SELECT
age
,chd
FROM
#chd'
,2
)
This returns the following values, but we are really only concerned with the first two.
stat_name idx stat_val
---------- ----------- ----------------------
b 0 -5.30945337391905
b 1 0.1109211422069
se 0 1.13365463681529
se 1 0.0240598358749988
z 0 -4.68348401840844
z 1 4.61022023521619
pval 0 2.82039453738119E-06
pval 1 4.02242615828777E-06
Wald 0 21.9350225506872
Wald 1 21.2541306171968
LL0 NULL -68.3314913574166
LLM NULL -53.6765463471564
chisq NULL 29.3098900205205
df NULL 1
p_chisq NULL 6.16800830270808E-08
AIC NULL 111.353092694313
BIC NULL 116.563433066289
Nobs NULL 100
rsql NULL 0.214468391061534
rsqcs NULL 0.254051637397186
rsqn NULL 0.340992792718107
D NULL -0.482789163126702
Iterations NULL 5
Converged NULL 1
This tells us that based on age we would estimate the probability that coronary heart disease is present in a 24 year-old as
SELECT
1/(1 + EXP(-(-5.30945 + 0.11092 * 24)))
or approximately 6.6%.
We have also implemented a LOGITSUM function so that you can analyze grouped data. Instead of passing in a column containing the dichotomous results for each observation you can pass in counts of success and failures for each grouping of the data. For example, let's say our coronary heart disease data were presented to us in the following way:
mean age chd N
---------------------- ----------- -----------
25.4 1 10
32 2 15
36.9167 3 12
42.3333 5 15
47.2308 6 13
51.875 5 8
56.8824 13 17
63 8 10
We could use the LOGITSUM function.
SELECT
*
INTO
#chdsum
FROM (VALUES
(25.4,1,10)
,(32,2,15)
,(36.9167,3,12)
,(42.3333,5,15)
,(47.2308,6,13)
,(51.875,5,8)
,(56.8824,13,17)
,(63,8,10)
)p([mean age],chd,N)
SELECT
*
FROM
wct.LOGITSUM('
SELECT
[mean age],
chd,
N-chd
FROM
#chdsum'
,2
,3
)
This produces the following result.
stat_name idx stat_val
---------- ----------- ----------------------
b 0 -5.20385902245454
b 1 0.108651473280877
se 0 1.11798776031707
se 1 0.0237329519129878
z 0 -4.65466546876925
z 1 4.57808508942444
pval 0 3.24507143474021E-06
pval 1 4.69251822153992E-06
Wald 0 21.6659106261529
Wald 1 20.9588630860104
LL0 NULL -68.3314913574166
LLM NULL -54.1790842644491
chisq NULL 28.3048141859351
df NULL 1
p_chisq NULL 1.03637918925054E-07
AIC NULL 112.358168528898
BIC NULL 117.568508900874
Nobs NULL 100
rsql NULL 0.207113979394092
rsqcs NULL 0.246516487110749
rsqn NULL 0.330878975046836
D NULL -0.464151599148838
Iterations NULL 5
Converged NULL 1
Now, when we calculate the probability of the presence of coronary heart disease in a 24 year-old, the formula would be:
SELECT
1/(1 + EXP(-(-5.20386 + 0.10865 * 24)))
or approximately 6.9%.
You can use the LOGITPROB or the LOGITPRED functions to calculate the probability that Y = 1. For example,
SELECT
wct.LOGITPRED('-5.20386, 0.10865','24')
produces the same result as the previous SELECT statement. When there are many independent variables and/or many predictions to be made, the LOGITPRED and LOGITPROB functions are simpler to use than explicitly embedding the equation in a SELECT statement.
The VIF function is used to detect collinearity and should be used in conjunction with not only the LOGIT and LOGITSUM functions but also with LINEST, LINEST_q, LOGEST, and LOGEST_q. VIF works by performing ordinary least squares regression on all the combinations of independent variables and calculating the R2 for each combination. From the R2 value, it then calculates the tolerance and the variance inflation factors. Based upon the results of this table-valued function you can determine whether or not to eliminate any of the independent variables from the regression analysis.
You can read more about these function and find more examples in our on-line documentation. You can also try out these functions today by downloading the free 15-day trial and installing it on your SQL Server database. We think that these functions are a great tool to have inside SQL Server as it permits you to do the analysis right where the data is, without having to drag the data across the network, and let's you use simple SQL commands instead of having to learn how to use another tool. We hope that you like it and, as always, if there is something that you would like to see in XLeratorDB, just send us an e-mail at support@westclintech.com.
See Also