Login     Register

        Contact Us     Search

XLeratorDB/statistics Documentation

SQL Server LOGIT function


LOGIT

Updated: 27 February 2015


Use the table-valued function LOGIT to calculate the binary logistic regression coefficients from a table of x-values (the independent variables) and dichotomous y-values (the dependent variable). The function supports the entry of the data in raw form, where the y-values are a column of zeros and ones {0,1}. The y-value can appear in any column and the column number of the y-values is specified at input.

x1
x2
x3
xn
y
x1,1
x1,2
x1,3
x1,n
y1
x2,1
x2,2
x2,3
x2,n
y2
xm,1
xm,2
xm,3
xm,n
ym

For summary values, where the y-values are the counts of successes and failures, use the LOGITSUM function.
Logistic regression estimates the probability of an event occurring. Unlike ordinary least squares (see LINEST and LINEST_q) which estimates the value of a dependent variable based on the independent variables, LOGIT measures the probability (p) of an event occurring (1) or not occurring (0). The probability is estimated as:


The LOGIT function works by finding the coefficient (ß) values that maximize the log-likelihood statistic, which is defined as:


using a method of iteratively re-weighted least squares.

Syntax

SELECT * FROM [wct].[LOGIT](
  <@Matrix_RangeQuery, nvarchar(max),>
 ,<@y_ColumnNumber, int,>)
@Matrix_RangeQuery
the SELECT statement, as a string, which, when executed, creates the resultant table of x- and y-values used in the calculation.
@y_ColumnNumber
the number of the column in the resultant table returned by @Matrix_RangeQuery which contains the dichotomous outcomes. @y_ColumnNumber must be of the type int or of a type that implicitly converts to int.

Return Types

RETURNS TABLE(
       [stat_name] [nvarchar](10) NULL,
       [idx] [int] NULL,
       [stat_val] [float] NULL
)

Table Description

stat_name
Identifies the statistic being returned:

b
estimated coefficient for each independent variable plus the intercept
se
standard error of b
z
z statistic for b
pval
p-value (normal distribution) for the z statistic
Wald
Wald statistic
LL0
log-likelihood with just the intercept and no other coefficients
LLM
model log-likelihood
chisq
chi squared statistic
df
degrees of freedom
p_chisq
p-value of the chi-squared statistic
AIC
Akaike information criterion
BIC
Bayesian information criterion
Nobs
number of observations
rsql
log-linear ratio R2
rsqcs
Cox and Snell's R2
rsqn
Nagelkerke's R2
D
deviance
Iterations
number of iteration in iteratively re-weighted least squares
Converged
a bit value identifying whether a solution was found (1) or not (0)
AUROC
area under the ROC curve

For more information on how these statistics are calculated see the Examples.
idx
Identifies the subscript for the estimated coefficient (b), standard error of the coefficient (se),
z statistics (z), p-value of the z statistic (pval), and the Wald statistic.
When the idx is 0, it is referring to the intercept, otherwise the idx identifies the column number of independent variable.
Descriptive statistics other than the ones mentioned above will have an idx of NULL.
stat_val
the calculated value of the statistic.
 
Remarks

·         If @y_ColumnNumber is NULL then the right-most column in the resultant table is assumed to contain the dichotomous results.
·         @Matrix_RangeQuery must return at least 2 columns or an error will be returned.
·         If @y_ColumnNumber is not NULL and @y_ColumnNumber < 1 an error will be returned.
·         If @y_ColumnNumber is not NULL and @y_ColumnNumber greater than the number of columns returned by @Matrix_RangeQuery an error will be returned.

Examples

Example #1
In this example we use the Coronary Heart Disease data from Applied Logistic Regression, Third Edition by David W. Hosmer, Jr., Stanley Lemeshow, and Rodney X. Sturdivant. The data consist of a single independent variable (age) and an outcome (chd) which indicates the absence (0) or presence (1) of coronary heart disease.
The following SQL populates a temporary table, #chd.
SELECT
   *
INTO
   #chd
FROM (VALUES
    (20,0),(23,0),(24,0),(25,0),(25,1),(26,0),(26,0),(28,0),(28,0),(29,0)
    ,(30,0),(30,0),(30,0),(30,0),(30,0),(30,1),(32,0),(32,0),(33,0),(33,0)
    ,(34,0),(34,0),(34,1),(34,0),(34,0),(35,0),(35,0),(36,0),(36,1),(36,0)
    ,(37,0),(37,1),(37,0),(38,0),(38,0),(39,0),(39,1),(40,0),(40,1),(41,0)
    ,(41,0),(42,0),(42,0),(42,0),(42,1),(43,0),(43,0),(43,1),(44,0),(44,0)
    ,(44,1),(44,1),(45,0),(45,1),(46,0),(46,1),(47,0),(47,0),(47,1),(48,0)
    ,(48,1),(48,1),(49,0),(49,0),(49,1),(50,0),(50,1),(51,0),(52,0),(52,1)
    ,(53,1),(53,1),(54,1),(55,0),(55,1),(55,1),(56,1),(56,1),(56,1),(57,0)
    ,(57,0),(57,1),(57,1),(57,1),(57,1),(58,0),(58,1),(58,1),(59,1),(59,1)
    ,(60,0),(60,1),(61,1),(62,1),(62,1),(63,1),(64,0),(64,1),(65,1),(69,1)
   )n(age,chd)
 

We can run the following SQL to reproduce Table 1.2, Frequency Table of Age Group by CHD from the Hosmer book, verifying that we are using the same data.
SELECT
    a.descr as [Age Group]
   ,COUNT(*) as n
   ,COUNT(*) - SUM(c.chd) as Absent
   ,SUM(c.chd) as Present
   ,AVG(cast(c.chd as float)) as Mean
FROM
   #chd c
CROSS APPLY(
   SELECT TOP 1
        grp
       ,descr
   FROM (VALUES
        (20,1,'20-29'),(30,2,'30-34'),(35,3,'35-39'),(40,4,'40-44')
       ,(45,5,'45-49'),(50,6,'50-55'),(55,7,'55-59'),(60,8,'60-69')
       )n(age,grp,descr)
   WHERE
       n.age <= c.age
   ORDER BY
       n.age DESC
   )a
GROUP BY
    a.descr
   ,a.grp
ORDER BY
   1

This produces the following result.
 


This SQL calculates the results of the logistic regression.
SELECT
   *
FROM
   wct.LOGIT(
   'SELECT
        age
       ,chd
    FROM
       #chd'
   ,2
   )

This produces the following result.


Let's reformat some of these to make them easier to read and easier to compare to Hosmer. This SQL reproduces Table 1.3, Results of Fitting the Logistic Regression to the CHDAGE Data, n = 100.
SELECT
    n.Variable
   ,p.b as Coeff
   ,p.se as [Std. Err]
   ,p.z
   ,p.pval as p
FROM (
   SELECT
       *
   FROM
       wct.LOGIT(
       'SELECT
           age
          ,chd
        FROM
          #chd'
       ,2
       )
   )d
PIVOT(
   SUM(stat_val)
FOR
   stat_name
IN
   (b,se,z,pval)
   )p
CROSS APPLY(
   VALUES (0,'Constant'),(1,'Age')
   )n(idx,variable)
WHERE
   p.idx = n.idx
ORDER BY
   p.idx DESC

This produces the following result.
 


As Hosmer points out, the fitted values, are given by the equation


meaning that we can use this equation to predict the probability of the presence of coronary heart disease given a person's age.
If you are not interested in an explanation of how the remaining statistics are calculated, then skip to the next example. In order to explore how these remaining statistics are calculated, let's store the results of the LOGIT function in a table (#mylogit) with the following SQL.
SELECT
    stat_name
   ,idx
   ,stat_val
INTO
   #mylogit
FROM
   wct.LOGIT(
   'SELECT
        age
       ,chd
    FROM
       #chd'
   ,2
   )

The standard errors of the coefficients (se) can be calculated as the square root of the diagonal of the covariance matrix:

Where X is the design matrix (a column of ones added to the input data) and W is the diagonal matrix of weights calculated by the iteratively re-weighted least squares process.

In this piece of SQL we will create a table which calculates the log-likelihood (LL), the weights (W) and the predicted value, which can then be used to calculate the covariance matrix.
SELECT
   *
   ,(p_obs*LOG(p_pred)+(1-p_obs)*LOG(1-p_pred)) as LL
   ,p_pred*(1-p_pred) as W
INTO
   #m
FROM (
   SELECT
        age
       ,chd as p_obs
,EXP(b0.stat_val + age * b1.stat_val)/(1+EXP(b0.stat_val + age * b1.stat_val)) as p_pred
   FROM
       #mylogit b0
   JOIN
       #mylogit b1
   ON
       b0.stat_name = 'b'
       AND b0.idx = 0
       AND b1.stat_name = 'b'
       AND b1.idx = 1
   CROSS JOIN
       #chd
   )n

We can now use the matrix functions from the XLeratorDB math module to verify the results returned by the LOGIT function for the standard errors (se).
SELECT
    RowNum as idx
   ,SQRT(ItemValue) as se
FROM
   wct.Matrix(
       wct.MATINVERSE(
          wct.MATMULT(
              wct.TRANSPOSE(
                 wct.Matrix2String_q('SELECT 1, age FROM #m')
              )
              ,wct.Matrix2String_q('SELECT W, W * age FROM #m')
          )
       )
   )
WHERE
   RowNum = ColNum
DROP TABLE
   #chd

This produces the following result.

From the coefficients (b) and the se values, we can calculate the z, pval, and Wald values. The following SQL demonstrates the calculation, using the b and se values stored in #mylogit.
SELECT
    n.Variable
   ,p.b / p.se as z
   ,2 * wct.NORMSDIST(-ABS(p.b / p.se)) as pval
   ,POWER(p.b / p.se, 2) as Wald
FROM (
   SELECT
       *
   FROM
       #mylogit
   )d
PIVOT(
   SUM(stat_val)
FOR
   stat_name
IN
   (b,se)
   )p
CROSS APPLY(
   VALUES (0,'Constant'),(1,'Age')
   )n(idx,variable)
WHERE
   p.idx = n.idx
ORDER BY
   p.idx DESC

This produces the following result.
 
The model log-likelihood (LLM) is simply the sum of the individual log-likelihoods which we have already computed and stored in #m.
SELECT
   SUM(LL) as LLM
FROM
   #m
This produces the following result.


The calculation of LL0 can be done directly from the input data.
SELECT
   f*LOG(f) + s*LOG(s) - n*LOG(n) as LL0
FROM (
   SELECT
        COUNT(*) as n
       ,SUM(chd) as s
       ,COUNT(*) - SUM(chd) as f
   FROM
       #chd
   )n

This produces the following result
 


Given LL0 and LLM it is relatively straightforward to calculate the rest of the statistics.
SELECT
   x.*
FROM (
   SELECT
        2 *(LLM - LL0) as chisq
       ,wct.CHIDIST(2 *(LLM - LL0), df) as pchisq
       ,-2 *(LLM - df - 1) as AIC
       ,-2 * LLM + LOG(Nobs) *(df + 1) as BIC
       ,1 - LLM/LL0 as rsql
       ,1 - EXP((-2/Nobs)*(LLM-LL0)) as rsqcs
       ,(1 - EXP((-2/Nobs)*(LLM-LL0)))/(1-exp(2*LL0/Nobs)) as rsqn
       ,-2*LOG(LL0/LLM) as D
   FROM (
       SELECT
           stat_name
          ,stat_val
       FROM
          #mylogit
       )d
   PIVOT(
       SUM(stat_val)
   FOR
       stat_name
   IN
       (LL0,LLM,df,Nobs)
       )p
   )q
CROSS APPLY(
   VALUES
    ('chisq',chisq),
    ('pchisq',pchisq),
    ('AIC',AIC),
    ('BIC',BIC),
    ('rsql',rsql),
    ('rsqcs',rsqcs),
    ('rsqn',rsqn),
    ('D',D)
   )x(stat_name, stat_val)

This produces the following result.


We can use the table-valued function ROCTable for the calculation of the area under the ROC curve (AUROC). The table-valued function requires the predicted probabilities and associated absence (0) or presence (1) of coronary heart disease as inputs. We will use the LOGITPRED function to create these combinations and store them in a temporary table which will then be called from ROCTable.
SELECT
        wct.LOGITPRED('SELECT stat_val FROM #mylogit where stat_name = ''b'' ORDER BY idx',cast(age as varchar(max))) as [p predicted]
       ,chd as y
INTO
       #t
FROM
       #chd
SELECT
       *
FROM
       wct.ROCTABLE('SELECT * FROM #t')

This produces the following result.

You can get more information about the calculation of the area under the ROC curve by going to the ROCTable documentation

Example #2

This example is based on R Data Analysis Examples: Logit Regression which can be found at www.ats.ucla.edu/stat/r/dae/logit.htm. We will use http://www.ats.ucla.edu/stat/data/binary.csv as input into a table called #mydata.
The dataset contains 4 columns of data which are labeled admit, gre, gpa, and rank. The variables gpa and gre will be treated as continuous. The variable rank has the values 1 through 4 where a 1 indicates the highest-ranked institutions and a 4 indicates the lowest-ranked institutions.
We can run the following SQL to get some basic descriptive statistics to make sure that the data are loaded correctly.
SELECT
   n.lbl,
   wct.QUARTILE(admit,n.x) as Admit,
   wct.QUARTILE(gre,n.x) as gre,
   wct.QUARTILE(gpa,n.x) as gpa,
   wct.QUARTILE(rank,n.x) as rank
FROM
   #mydata
CROSS APPLY(VALUES
   ('Min',0),('1st Quartile',1),('Median',2),('3rd Quartile',3),('Max',4))n(lbl,x)
GROUP BY
   n.lbl
UNION
SELECT
   'Mean',
   AVG(cast(admit as float)) as Admit,
   AVG(cast(gre as float)) as gre,
   AVG(cast(gpa as float)) as gpa,
   AVG(cast(rank as float)) as rank
FROM
   #mydata
ORDER BY
   3

This produces the following result.
To get the standard deviations, we could run the following SQL.
SELECT
   wct.STDEV_S(admit) as Admit,
   wct.STDEV_S(gre) as gre,
   wct.STDEV_S(gpa) as gpa,
   wct.STDEV_S(rank) as rank
FROM
   #mydata

This produces the following result.


This SQL will create a two-way contingency table between the admit outcome and the rank predictor. We run this SQL to check to see if there are any zeroes in the contingency table.
SELECT
   admit,[1],[2],[3],[4]
FROM (
   SELECT admit, rank FROM #mydata
   ) p
PIVOT(
   COUNT([rank]) FOR [rank] IN([1],[2],[3],[4])
   ) as d

This produces the following result.


This SQL will run the logistic regression and put the results in a temporary table, #mylogit. The only reason for storing the results in the temp table is to make it easier to follow the rest of the example.
SELECT
   *
INTO
   #mylogit
FROM
   wct.LOGIT(
   'SELECT
    admit
   ,gre
   ,gpa
   ,CASE RANK
       WHEN 2 THEN 1
       ELSE 0
    END
   ,CASE RANK
       WHEN 3 THEN 1
       ELSE 0
    END
   ,CASE RANK
       WHEN 4 THEN 1
       ELSE 0
    END
   FROM
       #mydata'
   ,1
   )

Note that we have taken the rank column and turned it into 3 columns, representing the ranks of 2, 3, and 4, since we are treating the rank variable as discrete and not continuous.


In this SQL, we are going to select the estimated coefficients, the standard error of the estimates, the z-value, and the probability associated with that z-value, and return the results in spreadsheet format with the appropriate labels for the estimated coefficients.
SELECT
    CASE idx
       WHEN 0 THEN 'Intercept'
       WHEN 1 THEN 'gre'
       WHEN 2 THEN 'gpa'
       WHEN 3 THEN 'rank2'
       WHEN 4 THEN 'rank3'
       WHEN 5 THEN 'rank4'
    END as [X]
   ,ROUND(b,5) as [Estimated]
   ,ROUND(se,5) as [Std. Error]
   ,ROUND(z,2) as [z Value]
   ,ROUND(pval,5) as [Pr(>|z|)]
   ,ROUND(Wald,2) as Wald
FROM
   #mylogit
PIVOT(
   SUM(stat_val) FOR stat_name IN(b,se,z,pval,Wald)
   ) as pvt
WHERE
   idx IS NOT NULL
ORDER BY
   idx

This produces the following result.

The results of the regression indicate that both the gre and gpa are statistically significant as are the three terms for rank. You could interpret the results in the following way:
o        for every one unit change in gre, the log odds of success (admission) increase by 0.00226.
o        for every one unit change in gpa, the log odds of success increase by 0.80404.
o        since the rank variable is not continuous it is interpreted differently. For example, attending an undergraduate institution with a rank of 2 (as opposed to 1) decreases the log odds of admission by 0.67544.


The following SQL returns the remaining statistics from the regression analysis.
SELECT
    stat_name
   ,stat_val
FROM
   #mylogit
WHERE
   idx IS NULL

This produces the following result.
 

 

See Also
·         LINEST - Linear regression

 



Copyright 2008-2024 Westclintech LLC         Privacy Policy        Terms of Service