WestClinTech - SQL Server Functions - Blog - Logistic Regression in SQL Server

Home

XLeratorDB function packages for SQL Server
financial view documentation pricing
statistics view documentation pricing
math view documentation pricing
engineering view documentation pricing
strings view documentation pricing
financial-options view documentation pricing
windowing view documentation pricing

XLeratorDB Compilation packages for SQL Server
Suite incl: financial, statistics, math, engineering & strings pricing
Suite (Developer) requires SQL Server Developer Edition pricing
Suite (Subscription) One-year non-recurring license pricing

SuitePLUS incl: all Suite packages PLUS financial-options pricing
SuitePLUS (Developer) requires SQL Server Developer Edition, also incl: financial-options pricing
SuitePLUS (Subscription) One-year non-recurring license, also incl: financial-options pricing

XLeratorDLL function packages Microsoft .NET API Library
financial (DLL) view documentation pricing SQL Server not required

View All Product Pricing ...

Download Free 15 Day Trial ...

Documentation

Purchase

XLeratorDB function packages for SQL Server (2008 & later)
financial
statistics
math
engineering
strings
financial-options
windowing

XLeratorDB Compilation packages for SQL Server (2008 & later)
Suite
Suite (Developer)
Suite (Subscription)

SuitePLUS
SuitePLUS (Developer)
SuitePLUS (Subscription)

XLeratorDLL function packages Microsoft .NET API Library
financial (DLL)

Legacy XLeratorDB Packages for SQL Server 2005
financial for SQL Server 2005 only
statistics for SQL Server 2005 only
math for SQL Server 2005 only

Suite for SQL Server 2005 only
Suite (Developer) for SQL Server 2005 only
SuitePLUS for SQL Server 2005 only
SuitePLUS (Developer) for SQL Server 2005 only

Download Trial
Case Studies
Blog
Support

Logistic Regression in SQL Server

Mar 4

Written by: Charles Flock
3/4/2015 10:08 AM

We have added 5 new functions in the latest release of XLeratorDB / statistics 2008. Four of these functions, LOGIT, LOGITSUM, LOGITPRED, and LOGITPROB are SQL Server implementations of the logistic regression, often referred to as the logit regression. The remaining function, VIF, estimates the variance inflation factor, tolerance, and R2 for a set of independent variables allowing you to test for collinearity.

Given a set of independent variables X and dichotomous outcomes {0, 1} Y, you can use the logistic regression to calculate the probability that Y = 1.

Let's look at an example of what this means from Applied Logistic Regression, Third Edition by David W. Hosmer, Jr., Stanley and Rodney X. Sturdivant.

Their CHD data set provides 100 data points consisting of a patient's age and the absence {0} or presence {1} of coronary heart disease (this full example is explored in the LOGIT documentation on our website). A logistic regression of this data attempts to answer a question about the probability of coronary heart disease being present in a patient based on their age, since age is the only independent variable in our data.

The following table provides a summary of the CHD data.

		Coronary Heart Disease
Age Group	n	Absent	Present
20 – 29	10	9	1
30 – 34	15	13	2
35 – 39	12	9	3
40 – 44	15	10	5
45 – 49	13	7	6
50 – 54	8	3	5
55 – 59	17	4	13
60 – 69	10	2	8
Total	100	57	43

Given that there is only a single independent variable, the table above would lead us to the conclusion that the probability that coronary heart disease is present in someone between the ages of 60 and 60 is approximately 80% while the probability that coronary heart disease is present in someone between the ages of 20 and 29 is approximately 10%.

Let's take the CHD dataset and put it into a temp table, #chd, and then use the LOGIT function to calculate the coefficients.

SELECT

INTO

#chd

FROM (VALUES

(20,0),(23,0),(24,0),(25,0),(25,1),(26,0),(26,0),(28,0),(28,0),(29,0)

,(30,0),(30,0),(30,0),(30,0),(30,0),(30,1),(32,0),(32,0),(33,0),(33,0)

,(34,0),(34,0),(34,1),(34,0),(34,0),(35,0),(35,0),(36,0),(36,1),(36,0)

,(37,0),(37,1),(37,0),(38,0),(38,0),(39,0),(39,1),(40,0),(40,1),(41,0)

,(41,0),(42,0),(42,0),(42,0),(42,1),(43,0),(43,0),(43,1),(44,0),(44,0)

,(44,1),(44,1),(45,0),(45,1),(46,0),(46,1),(47,0),(47,0),(47,1),(48,0)

,(48,1),(48,1),(49,0),(49,0),(49,1),(50,0),(50,1),(51,0),(52,0),(52,1)

,(53,1),(53,1),(54,1),(55,0),(55,1),(55,1),(56,1),(56,1),(56,1),(57,0)

,(57,0),(57,1),(57,1),(57,1),(57,1),(58,0),(58,1),(58,1),(59,1),(59,1)

,(60,0),(60,1),(61,1),(62,1),(62,1),(63,1),(64,0),(64,1),(65,1),(69,1)

)n(age,chd)

SELECT

FROM

wct.LOGIT(

'SELECT

age

,chd

FROM

#chd'

)

This returns the following values, but we are really only concerned with the first two.

stat_name idx stat_val

---------- ----------- ----------------------

b 0 -5.30945337391905

b 1 0.1109211422069

se 0 1.13365463681529

se 1 0.0240598358749988

z 0 -4.68348401840844

z 1 4.61022023521619

pval 0 2.82039453738119E-06

pval 1 4.02242615828777E-06

Wald 0 21.9350225506872

Wald 1 21.2541306171968

LL0 NULL -68.3314913574166

LLM NULL -53.6765463471564

chisq NULL 29.3098900205205

df NULL 1

p_chisq NULL 6.16800830270808E-08

AIC NULL 111.353092694313

BIC NULL 116.563433066289

Nobs NULL 100

rsql NULL 0.214468391061534

rsqcs NULL 0.254051637397186

rsqn NULL 0.340992792718107

D NULL -0.482789163126702

Iterations NULL 5

Converged NULL 1

This tells us that based on age we would estimate the probability that coronary heart disease is present in a 24 year-old as

SELECT

1/(1 + EXP(-(-5.30945 + 0.11092 * 24)))

or approximately 6.6%.

We have also implemented a LOGITSUM function so that you can analyze grouped data. Instead of passing in a column containing the dichotomous results for each observation you can pass in counts of success and failures for each grouping of the data. For example, let's say our coronary heart disease data were presented to us in the following way:

mean age chd N

---------------------- ----------- -----------

25.4 1 10

32 2 15

36.9167 3 12

42.3333 5 15

47.2308 6 13

51.875 5 8

56.8824 13 17

63 8 10

We could use the LOGITSUM function.

SELECT

INTO

#chdsum

FROM (VALUES

(25.4,1,10)

,(32,2,15)

,(36.9167,3,12)

,(42.3333,5,15)

,(47.2308,6,13)

,(51.875,5,8)

,(56.8824,13,17)

,(63,8,10)

)p([mean age],chd,N)

SELECT

FROM

wct.LOGITSUM('

SELECT

[mean age],

chd,

N-chd

FROM

#chdsum'

)

This produces the following result.

stat_name idx stat_val

---------- ----------- ----------------------

b 0 -5.20385902245454

b 1 0.108651473280877

se 0 1.11798776031707

se 1 0.0237329519129878

z 0 -4.65466546876925

z 1 4.57808508942444

pval 0 3.24507143474021E-06

pval 1 4.69251822153992E-06

Wald 0 21.6659106261529

Wald 1 20.9588630860104

LL0 NULL -68.3314913574166

LLM NULL -54.1790842644491

chisq NULL 28.3048141859351

df NULL 1

p_chisq NULL 1.03637918925054E-07

AIC NULL 112.358168528898

BIC NULL 117.568508900874

Nobs NULL 100

rsql NULL 0.207113979394092

rsqcs NULL 0.246516487110749

rsqn NULL 0.330878975046836

D NULL -0.464151599148838

Iterations NULL 5

Converged NULL 1

Now, when we calculate the probability of the presence of coronary heart disease in a 24 year-old, the formula would be:

SELECT

1/(1 + EXP(-(-5.20386 + 0.10865 * 24)))

or approximately 6.9%.

You can use the LOGITPROB or the LOGITPRED functions to calculate the probability that Y = 1. For example,

SELECT

wct.LOGITPRED('-5.20386, 0.10865','24')

produces the same result as the previous SELECT statement. When there are many independent variables and/or many predictions to be made, the LOGITPRED and LOGITPROB functions are simpler to use than explicitly embedding the equation in a SELECT statement.

The VIF function is used to detect collinearity and should be used in conjunction with not only the LOGIT and LOGITSUM functions but also with LINEST, LINEST_q, LOGEST, and LOGEST_q. VIF works by performing ordinary least squares regression on all the combinations of independent variables and calculating the R² for each combination. From the R² value, it then calculates the tolerance and the variance inflation factors. Based upon the results of this table-valued function you can determine whether or not to eliminate any of the independent variables from the regression analysis.

You can read more about these function and find more examples in our on-line documentation. You can also try out these functions today by downloading the free 15-day trial and installing it on your SQL Server database. We think that these functions are a great tool to have inside SQL Server as it permits you to do the analysis right where the data is, without having to drag the data across the network, and let's you use simple SQL commands instead of having to learn how to use another tool. We hope that you like it and, as always, if there is something that you would like to see in XLeratorDB, just send us an e-mail at support@westclintech.com.

See Also

· LINEST - Linear regression

· LOGEST - Logarithmic regression

· LOGIT - Logit regression

· LOGITPRED - Calculate predicted values based on a logit regression

· LOGITPROB - Calculate predicted values based on a logit regression

· LOGITSUM - Logit regression using summary data

· VIF - Variance inflation factors

Trackback Print

Tags:

Categories:

Location: Blogs Parent Separator