LOGIT
Updated: 27 February 2015
Use the tablevalued function LOGIT to calculate the binary logistic regression coefficients from a table of xvalues (the independent variables) and dichotomous yvalues (the dependent variable). The function supports the entry of the data in raw form, where the yvalues are a column of zeros and ones {0,1}. The yvalue can appear in any column and the column number of the yvalues is specified at input.
x1

x2

x3

…

xn

y

x_{1,1}

x_{1,2}

x_{1,3}

…

x_{1,n}

y_{1}

x_{2,1}

x_{2,2}

x_{2,3}

…

x_{2,n}

y_{2}

…

…

…

…

…

…

x_{m,1}

x_{m,2}

x_{m,3}

…

x_{m,n}

y_{m}

For summary values, where the yvalues are the counts of successes and failures, use the LOGITSUM function.
Logistic regression estimates the probability of an event occurring. Unlike ordinary least squares (see LINEST and LINEST_q) which estimates the value of a dependent variable based on the independent variables, LOGIT measures the probability (p) of an event occurring (1) or not occurring (0). The probability is estimated as:
The LOGIT function works by finding the coefficient (ß) values that maximize the loglikelihood statistic, which is defined as:
using a method of iteratively reweighted least squares.
Syntax
SELECT * FROM [wct].[LOGIT](
<@Matrix_RangeQuery, nvarchar(max),>
,<@y_ColumnNumber, int,>)
@Matrix_RangeQuery
the SELECT statement, as a string, which, when executed, creates the resultant table of x and yvalues used in the calculation.
@y_ColumnNumber
the number of the column in the resultant table returned by @Matrix_RangeQuery which contains the dichotomous outcomes. @y_ColumnNumber must be of the type int or of a type that implicitly converts to int.
Return Types
RETURNS TABLE(
[stat_name] [nvarchar](10) NULL,
[idx] [int] NULL,
[stat_val] [float] NULL
)
Table Description
stat_name
Identifies the statistic being returned:
b

estimated coefficient for each independent variable plus the intercept

se

standard error of b

z

z statistic for b

pval

pvalue (normal distribution) for the z statistic

Wald

Wald statistic

LL0

loglikelihood with just the intercept and no other coefficients

LLM

model loglikelihood

chisq

chi squared statistic

df

degrees of freedom

p_chisq

pvalue of the chisquared statistic

AIC

Akaike information criterion

BIC

Bayesian information criterion

Nobs

number of observations

rsql

loglinear ratio R^{2}

rsqcs

Cox and Snell's R^{2}

rsqn

Nagelkerke's R^{2}

D

deviance

Iterations

number of iteration in iteratively reweighted least squares

Converged

a bit value identifying whether a solution was found (1) or not (0)

AUROC

area under the ROC curve

For more information on how these statistics are calculated see the Examples.
idx
Identifies the subscript for the estimated coefficient (b), standard error of the coefficient (se),
z statistics (z), pvalue of the z statistic (pval), and the Wald statistic.
When the idx is 0, it is referring to the intercept, otherwise the idx identifies the column number of independent variable.
Descriptive statistics other than the ones mentioned above will have an idx of NULL.
stat_val
the calculated value of the statistic.
Remarks
· If @y_ColumnNumber is NULL then the rightmost column in the resultant table is assumed to contain the dichotomous results.
· @Matrix_RangeQuery must return at least 2 columns or an error will be returned.
· If @y_ColumnNumber is not NULL and @y_ColumnNumber < 1 an error will be returned.
· If @y_ColumnNumber is not NULL and @y_ColumnNumber greater than the number of columns returned by @Matrix_RangeQuery an error will be returned.
Examples
Example #1
In this example we use the Coronary Heart Disease data from Applied Logistic Regression, Third Edition by David W. Hosmer, Jr., Stanley Lemeshow, and Rodney X. Sturdivant. The data consist of a single independent variable (age) and an outcome (chd) which indicates the absence (0) or presence (1) of coronary heart disease.
The following SQL populates a temporary table, #chd.
SELECT
*
INTO
#chd
FROM (VALUES
(20,0),(23,0),(24,0),(25,0),(25,1),(26,0),(26,0),(28,0),(28,0),(29,0)
,(30,0),(30,0),(30,0),(30,0),(30,0),(30,1),(32,0),(32,0),(33,0),(33,0)
,(34,0),(34,0),(34,1),(34,0),(34,0),(35,0),(35,0),(36,0),(36,1),(36,0)
,(37,0),(37,1),(37,0),(38,0),(38,0),(39,0),(39,1),(40,0),(40,1),(41,0)
,(41,0),(42,0),(42,0),(42,0),(42,1),(43,0),(43,0),(43,1),(44,0),(44,0)
,(44,1),(44,1),(45,0),(45,1),(46,0),(46,1),(47,0),(47,0),(47,1),(48,0)
,(48,1),(48,1),(49,0),(49,0),(49,1),(50,0),(50,1),(51,0),(52,0),(52,1)
,(53,1),(53,1),(54,1),(55,0),(55,1),(55,1),(56,1),(56,1),(56,1),(57,0)
,(57,0),(57,1),(57,1),(57,1),(57,1),(58,0),(58,1),(58,1),(59,1),(59,1)
,(60,0),(60,1),(61,1),(62,1),(62,1),(63,1),(64,0),(64,1),(65,1),(69,1)
)n(age,chd)
We can run the following SQL to reproduce Table 1.2, Frequency Table of Age Group by CHD from the Hosmer book, verifying that we are using the same data.
SELECT
a.descr as [Age Group]
,COUNT(*) as n
,COUNT(*)  SUM(c.chd) as Absent
,SUM(c.chd) as Present
,AVG(cast(c.chd as float)) as Mean
FROM
#chd c
CROSS APPLY(
SELECT TOP 1
grp
,descr
FROM (VALUES
(20,1,'2029'),(30,2,'3034'),(35,3,'3539'),(40,4,'4044')
,(45,5,'4549'),(50,6,'5055'),(55,7,'5559'),(60,8,'6069')
)n(age,grp,descr)
WHERE
n.age <= c.age
ORDER BY
n.age DESC
)a
GROUP BY
a.descr
,a.grp
ORDER BY
1
This produces the following result.
This SQL calculates the results of the logistic regression.
SELECT
*
FROM
wct.LOGIT(
'SELECT
age
,chd
FROM
#chd'
,2
)
This produces the following result.
Let's reformat some of these to make them easier to read and easier to compare to Hosmer. This SQL reproduces Table 1.3, Results of Fitting the Logistic Regression to the CHDAGE Data, n = 100.
SELECT
n.Variable
,p.b as Coeff
,p.se as [Std. Err]
,p.z
,p.pval as p
FROM (
SELECT
*
FROM
wct.LOGIT(
'SELECT
age
,chd
FROM
#chd'
,2
)
)d
PIVOT(
SUM(stat_val)
FOR
stat_name
IN
(b,se,z,pval)
)p
CROSS APPLY(
VALUES (0,'Constant'),(1,'Age')
)n(idx,variable)
WHERE
p.idx = n.idx
ORDER BY
p.idx DESC
This produces the following result.
As Hosmer points out, the fitted values, are given by the equation
meaning that we can use this equation to predict the probability of the presence of coronary heart disease given a person's age.
If you are not interested in an explanation of how the remaining statistics are calculated, then skip to the next example. In order to explore how these remaining statistics are calculated, let's store the results of the LOGIT function in a table (#mylogit) with the following SQL.
SELECT
stat_name
,idx
,stat_val
INTO
#mylogit
FROM
wct.LOGIT(
'SELECT
age
,chd
FROM
#chd'
,2
)
The standard errors of the coefficients (se) can be calculated as the square root of the diagonal of the covariance matrix:
Where X is the design matrix (a column of ones added to the input data) and W is the diagonal matrix of weights calculated by the iteratively reweighted least squares process.
In this piece of SQL we will create a table which calculates the loglikelihood (LL), the weights (W) and the predicted value, which can then be used to calculate the covariance matrix.
SELECT
*
,(p_obs*LOG(p_pred)+(1p_obs)*LOG(1p_pred)) as LL
,p_pred*(1p_pred) as W
INTO
#m
FROM (
SELECT
age
,chd as p_obs
,EXP(b0.stat_val + age * b1.stat_val)/(1+EXP(b0.stat_val + age * b1.stat_val)) as p_pred
FROM
#mylogit b0
JOIN
#mylogit b1
ON
b0.stat_name = 'b'
AND b0.idx = 0
AND b1.stat_name = 'b'
AND b1.idx = 1
CROSS JOIN
#chd
)n
We can now use the matrix functions from the XLeratorDB math module to verify the results returned by the LOGIT function for the standard errors (se).
SELECT
RowNum as idx
,SQRT(ItemValue) as se
FROM
wct.Matrix(
wct.MATINVERSE(
wct.MATMULT(
wct.TRANSPOSE(
wct.Matrix2String_q('SELECT 1, age FROM #m')
)
,wct.Matrix2String_q('SELECT W, W * age FROM #m')
)
)
)
WHERE
RowNum = ColNum
DROP TABLE
#chd
This produces the following result.
From the coefficients (b) and the se values, we can calculate the z, pval, and Wald values. The following SQL demonstrates the calculation, using the b and se values stored in #mylogit.
SELECT
n.Variable
,p.b / p.se as z
,2 * wct.NORMSDIST(ABS(p.b / p.se)) as pval
,POWER(p.b / p.se, 2) as Wald
FROM (
SELECT
*
FROM
#mylogit
)d
PIVOT(
SUM(stat_val)
FOR
stat_name
IN
(b,se)
)p
CROSS APPLY(
VALUES (0,'Constant'),(1,'Age')
)n(idx,variable)
WHERE
p.idx = n.idx
ORDER BY
p.idx DESC
This produces the following result.
The model loglikelihood (LLM) is simply the sum of the individual loglikelihoods which we have already computed and stored in #m.
SELECT
SUM(LL) as LLM
FROM
#m
This produces the following result.
The calculation of LL0 can be done directly from the input data.
SELECT
f*LOG(f) + s*LOG(s)  n*LOG(n) as LL0
FROM (
SELECT
COUNT(*) as n
,SUM(chd) as s
,COUNT(*)  SUM(chd) as f
FROM
#chd
)n
This produces the following result
Given LL0 and LLM it is relatively straightforward to calculate the rest of the statistics.
SELECT
x.*
FROM (
SELECT
2 *(LLM  LL0) as chisq
,wct.CHIDIST(2 *(LLM  LL0), df) as pchisq
,2 *(LLM  df  1) as AIC
,2 * LLM + LOG(Nobs) *(df + 1) as BIC
,1  LLM/LL0 as rsql
,1  EXP((2/Nobs)*(LLMLL0)) as rsqcs
,(1  EXP((2/Nobs)*(LLMLL0)))/(1exp(2*LL0/Nobs)) as rsqn
,2*LOG(LL0/LLM) as D
FROM (
SELECT
stat_name
,stat_val
FROM
#mylogit
)d
PIVOT(
SUM(stat_val)
FOR
stat_name
IN
(LL0,LLM,df,Nobs)
)p
)q
CROSS APPLY(
VALUES
('chisq',chisq),
('pchisq',pchisq),
('AIC',AIC),
('BIC',BIC),
('rsql',rsql),
('rsqcs',rsqcs),
('rsqn',rsqn),
('D',D)
)x(stat_name, stat_val)
This produces the following result.
We can use the tablevalued function ROCTable for the calculation of the area under the ROC curve (AUROC). The tablevalued function requires the predicted probabilities and associated absence (0) or presence (1) of coronary heart disease as inputs. We will use the LOGITPRED function to create these combinations and store them in a temporary table which will then be called from ROCTable.
SELECT
wct.LOGITPRED('SELECT stat_val FROM #mylogit where stat_name = ''b'' ORDER BY idx',cast(age as varchar(max))) as [p predicted]
,chd as y
INTO
#t
FROM
#chd
SELECT
*
FROM
wct.ROCTABLE('SELECT * FROM #t')
This produces the following result.
You can get more information about the calculation of the area under the ROC curve by going to the ROCTable documentation
Example #2
The dataset contains 4 columns of data which are labeled admit, gre, gpa, and rank. The variables gpa and gre will be treated as continuous. The variable rank has the values 1 through 4 where a 1 indicates the highestranked institutions and a 4 indicates the lowestranked institutions.
We can run the following SQL to get some basic descriptive statistics to make sure that the data are loaded correctly.
SELECT
n.lbl,
wct.QUARTILE(admit,n.x) as Admit,
wct.QUARTILE(gre,n.x) as gre,
wct.QUARTILE(gpa,n.x) as gpa,
wct.QUARTILE(rank,n.x) as rank
FROM
#mydata
CROSS APPLY(VALUES
('Min',0),('1st Quartile',1),('Median',2),('3rd Quartile',3),('Max',4))n(lbl,x)
GROUP BY
n.lbl
UNION
SELECT
'Mean',
AVG(cast(admit as float)) as Admit,
AVG(cast(gre as float)) as gre,
AVG(cast(gpa as float)) as gpa,
AVG(cast(rank as float)) as rank
FROM
#mydata
ORDER BY
3
This produces the following result.
To get the standard deviations, we could run the following SQL.
SELECT
wct.STDEV_S(admit) as Admit,
wct.STDEV_S(gre) as gre,
wct.STDEV_S(gpa) as gpa,
wct.STDEV_S(rank) as rank
FROM
#mydata
This produces the following result.
This SQL will create a twoway contingency table between the admit outcome and the rank predictor. We run this SQL to check to see if there are any zeroes in the contingency table.
SELECT
admit,[1],[2],[3],[4]
FROM (
SELECT admit, rank FROM #mydata
) p
PIVOT(
COUNT([rank]) FOR [rank] IN([1],[2],[3],[4])
) as d
This produces the following result.
This SQL will run the logistic regression and put the results in a temporary table, #mylogit. The only reason for storing the results in the temp table is to make it easier to follow the rest of the example.
SELECT
*
INTO
#mylogit
FROM
wct.LOGIT(
'SELECT
admit
,gre
,gpa
,CASE RANK
WHEN 2 THEN 1
ELSE 0
END
,CASE RANK
WHEN 3 THEN 1
ELSE 0
END
,CASE RANK
WHEN 4 THEN 1
ELSE 0
END
FROM
#mydata'
,1
)
Note that we have taken the rank column and turned it into 3 columns, representing the ranks of 2, 3, and 4, since we are treating the rank variable as discrete and not continuous.
In this SQL, we are going to select the estimated coefficients, the standard error of the estimates, the zvalue, and the probability associated with that zvalue, and return the results in spreadsheet format with the appropriate labels for the estimated coefficients.
SELECT
CASE idx
WHEN 0 THEN 'Intercept'
WHEN 1 THEN 'gre'
WHEN 2 THEN 'gpa'
WHEN 3 THEN 'rank2'
WHEN 4 THEN 'rank3'
WHEN 5 THEN 'rank4'
END as [X]
,ROUND(b,5) as [Estimated]
,ROUND(se,5) as [Std. Error]
,ROUND(z,2) as [z Value]
,ROUND(pval,5) as [Pr(>z)]
,ROUND(Wald,2) as Wald
FROM
#mylogit
PIVOT(
SUM(stat_val) FOR stat_name IN(b,se,z,pval,Wald)
) as pvt
WHERE
idx IS NOT NULL
ORDER BY
idx
This produces the following result.
The results of the regression indicate that both the gre and gpa are statistically significant as are the three terms for rank. You could interpret the results in the following way:
o for every one unit change in gre, the log odds of success (admission) increase by 0.00226.
o for every one unit change in gpa, the log odds of success increase by 0.80404.
o since the rank variable is not continuous it is interpreted differently. For example, attending an undergraduate institution with a rank of 2 (as opposed to 1) decreases the log odds of admission by 0.67544.
The following SQL returns the remaining statistics from the regression analysis.
SELECT
stat_name
,stat_val
FROM
#mylogit
WHERE
idx IS NULL
This produces the following result.
See Also