Calculating a Correlation Matrix in SQL Server
May
8
Written by:
Charles Flock
5/8/2014 10:51 AM
As a follow up to our article on calculating a variance – covariance matrix in SQL Server, we look at a similar calculation to produce a correlation matrix and provide 2 relatively straightforward methods for calculating the correlation matrix in SQL Server, using the XLeratorDB function library, and compare them to the calculation provided in the Excel Data Analysis tool.
We can think of a correlation matrix in the following way. Given the following 4-by-n data set
A1
|
B1
|
C1
|
D1
|
A2
|
B2
|
C2
|
D2
|
A3
|
B3
|
C3
|
D3
|
…
|
…
|
…
|
…
|
An
|
Bn
|
Cn
|
Dn
|
the correlation matrix would be a 4-by-4 matrix with the following elements:
correl(A,A)
|
correl(B,A)
|
correl(C,A)
|
correl(D,A)
|
correl(A,B)
|
correl(B,B)
|
correl(C,B)
|
correl(D,B)
|
correl(A,C)
|
correl(B,C)
|
correl(C,C)
|
correl(D,C)
|
correl(A,D)
|
correl(B,D)
|
correl(C,D)
|
correl(D,D)
|
where CORREL is the correlation function and CORREL(A,B) would be the correlation of column A and column B. When the same column in passed into the correlation function twice, the result is one.
1
|
correl(B,A)
|
correl(C,A)
|
correl(D,A)
|
correl(A,B)
|
1
|
correl(C,B)
|
correl(D,B)
|
correl(A,C)
|
correl(B,C)
|
1
|
correl(D,C)
|
correl(A,D)
|
correl(B,D)
|
correl(C,D)
|
1
|
The covariance matrix is a square (m-by-m) matrix where m is the number of columns in the dataset.
Let's look at an example where the data are not stored in third-normal form: what I usually refer to as spreadsheet format. We will just put some data into a table.
SELECT
*
INTO
#t
FROM (VALUES
(1,23,-16.7,14.2,-36.5,-35.6),
(2,6.4,7.5,-0.6,-31.5,-56.4),
(3,20.2,17.9,-4.2,41,-33.8),
(4,20.8,31.3,-7.2,10,23.4),
(5,-0.4,8.9,43.2,35.5,27.6),
(6,6.6,-12.7,21,36.5,-2),
(7,-7.6,-24,32.8,8.5,54.8),
(8,6.2,9.2,15,-19,-51),
(9,8,-5.7,24,37.5,36.8),
(10,2.8,-16.3,7.4,18,-41.8)
)N(rn,A,B,C,D,E)
SELECT
*
FROM
#t
This is what the table should look like.
rn
|
A
|
B
|
C
|
D
|
E
|
1
|
23
|
-16.7
|
14.2
|
-36.5
|
-35.6
|
2
|
6.4
|
7.5
|
-0.6
|
-31.5
|
-56.4
|
3
|
20.2
|
17.9
|
-4.2
|
41
|
-33.8
|
4
|
20.8
|
31.3
|
-7.2
|
10
|
23.4
|
5
|
-0.4
|
8.9
|
43.2
|
35.5
|
27.6
|
6
|
6.6
|
-12.7
|
21
|
36.5
|
-2
|
7
|
-7.6
|
-24
|
32.8
|
8.5
|
54.8
|
8
|
6.2
|
9.2
|
15
|
-19
|
-51
|
9
|
8
|
-5.7
|
24
|
37.5
|
36.8
|
10
|
2.8
|
-16.3
|
7.4
|
18
|
-41.8
|
The correlation matrix is calculated on all the possible 2-column combinations. Order matters, so there are 5 * 5 possible column combinations. One way to approach the calculation is just construct 25 explicit pieces of SQL to do the calculation; something like this.
SELECT
'A' as row,
'A' as col,
wct.CORREL(A,A) as CORREL
FROM
#t
UNION ALL
SELECT
'A' as row,
'B' as col,
wct.CORREL(A,B) as CORREL
FROM
#t
UNION ALL
SELECT
'A' as row,
'C' as col,
wct.CORREL(A,C) as CORREL
FROM
#t
--Keep repeating until all possible
--combinations have been entered, ending
--with E E.
UNION ALL
SELECT
'E' as row,
'E' as col,
wct.CORREL(E,E) as CORREL
FROM
#t
But there is a much more efficient and simpler way to do this. We can UNPIVOT the columns using the CROSS APPLY syntax and automatically generate all the possible column combinations.
SELECT
x.col as row,
y.col,
WCT.CORREL(x.val,y.val) as CORREL
FROM
#t
CROSS APPLY(VALUES
('A',A),('B',B),('C',C),('D',D),('E',E))x(col,val)
CROSS APPLY(VALUES
('A',A),('B',B),('C',C),('D',D),('E',E))y(col,val)
GROUP BY
x.col,
y.col
This produces the following result.
row col CORREL
---- ---- ----------------------
A A 1
B A 0.46938069802843
C A -0.678918454407369
D A -0.178483779232176
E A -0.352988235643861
A B 0.469380698028429
B B 1
C B -0.507984939857974
D B 0.0831788330441863
E B -0.114638730041581
A C -0.678918454407369
B C -0.507984939857974
C C 1
D C 0.260093802856154
E C 0.566800670902373
A D -0.178483779232176
B D 0.0831788330441863
C D 0.260093802856154
D D 1
E D 0.522429502534083
A E -0.352988235643861
B E -0.114638730041581
C E 0.566800670902373
D E 0.522429502534083
E E 1
If you wanted to return the result in 'spreadsheet' format, we can simply wrap our SQL inside a PIVOT statement.
SELECT
[A],[B],[C],[D],[E]
FROM (
SELECT
x.col as row,
y.col,
WCT.CORREL(x.val,y.val) as CORREL
FROM
#t
CROSS APPLY(VALUES
('A',A),('B',B),('C',C),('D',D),('E',E))x(col,val)
CROSS APPLY(VALUES
('A',A),('B',B),('C',C),('D',D),('E',E))y(col,val)
GROUP BY
x.col,
y.col
) D
PIVOT(
SUM(CORREL)
FOR
col in([A],[B],[C],[D],[E])
) as P
ORDER BY
row
This produces the following result.
A
|
B
|
C
|
D
|
E
|
1
|
0.469381
|
-0.67892
|
-0.17848
|
-0.35299
|
0.469381
|
1
|
-0.50798
|
0.083179
|
-0.11464
|
-0.67892
|
-0.50798
|
1
|
0.260094
|
0.566801
|
-0.17848
|
0.083179
|
0.260094
|
1
|
0.52243
|
-0.35299
|
-0.11464
|
0.566801
|
0.52243
|
1
|
It turns out that Excel has a Data Analysis tool for calculating the correlation matrix. Let's see what it returns using the same data.
|
A
|
B
|
C
|
D
|
E
|
A
|
1
|
|
|
|
|
B
|
0.469381
|
1
|
|
|
|
C
|
-0.67892
|
-0.50798
|
1
|
|
|
D
|
-0.17848
|
0.083179
|
0.260094
|
1
|
|
E
|
-0.35299
|
-0.11464
|
0.566801
|
0.52243
|
1
|
Interestingly, Excel only returns the lower triangular portion of the correlation matrix. As you can see from the matrix we created in SQL Server, the correlation matrix is symmetric so that the value for CORREL(A,B) = CORREL(B,A). There is really no explanation in the Excel documentation about why the upper triangular portion of the matrix is not returned.
In dealing with the data in spreadsheet format, our SQL is sensitive to the column definition of the underlying table. In this next example we will use some matrix functions from the XLeratorDB/math library to create the correlation matrix, which returns a result in third-normal form, and which doesn't require un-pivoting the underlying data. This SQL uses the covariance matrix to calculate the correlation matrix.
DECLARE @myMatrix as varchar(max) = wct.MATRIX2STRING_q('
SELECT
*
FROM (VALUES
(23,-16.7,14.2,-36.5,-35.6),
(6.4,7.5,-0.6,-31.5,-56.4),
(20.2,17.9,-4.2,41,-33.8),
(20.8,31.3,-7.2,10,23.4),
(-0.4,8.9,43.2,35.5,27.6),
(6.6,-12.7,21,36.5,-2),
(-7.6,-24,32.8,8.5,54.8),
(6.2,9.2,15,-19,-51),
(8,-5.7,24,37.5,36.8),
(2.8,-16.3,7.4,18,-41.8)
)N(A,B,C,D,E)
')
DECLARE @mrows as float = wct.MROWS(@myMatrix)
DECLARE @invmrows as float = 1e+00/@mrows
DECLARE @x as varchar(max) =(SELECT wct.MATSUB(@MyMatrix, wct.MATMULT(wct.MATMULT(wct.ONES(@mrows,@mrows),@myMatrix),@invmrows)))
;with mycte as (
SELECT
*
FROM
wct.MATRIX(wct.MATMULT(wct.MATMULT(wct.TRANSPOSE(@x),@x),@invmrows))
)
SELECT
xy.RowNum,
xy.ColNum,
xy.ItemValue /(SQRT(x.ItemValue) * SQRT(y.ItemValue)) as ItemValue
FROM
mycte xy
JOIN
mycte x
ON
xy.RowNum = x.ColNum
AND xy.RowNum = x.RowNum
JOIN
mycte y
ON
xy.ColNum = y.ColNum
AND xy.ColNum = y.RowNum
This produces the following result.
RowNum ColNum ItemValue
----------- ----------- ----------------------
0 0 1
0 1 0.46938069802843
0 2 -0.678918454407369
0 3 -0.178483779232176
0 4 -0.352988235643861
1 0 0.46938069802843
1 1 1
1 2 -0.507984939857974
1 3 0.0831788330441863
1 4 -0.114638730041581
2 0 -0.678918454407369
2 1 -0.507984939857974
2 2 1
2 3 0.260093802856154
2 4 0.566800670902373
3 0 -0.178483779232176
3 1 0.0831788330441863
3 2 0.260093802856154
3 3 1
3 4 0.522429502534083
4 0 -0.352988235643861
4 1 -0.114638730041581
4 2 0.566800670902373
4 3 0.522429502534083
4 4 1
As with the CORREL function, if you want the resultant table returned in spreadsheet format you will have to use a PIVOT.
SELECT
[0],[1],[2],[3]
FROM (
SELECT
xy.RowNum,
xy.ColNum,
xy.ItemValue /(SQRT(x.ItemValue) * SQRT(y.ItemValue)) as ItemValue
FROM
mycte xy
JOIN
mycte x
ON
xy.RowNum = x.ColNum
AND xy.RowNum = x.RowNum
JOIN
mycte y
ON
xy.ColNum = y.ColNum
AND xy.ColNum = y.RowNum
)D
PIVOT(
SUM(ItemValue)
FOR
ColNum in([0],[1],[2],[3])
) as P
The following SQL will automatically figure out the number of columns and return the results in spreadsheet format, so that it can be used regardless of the dimension of the input matrix.
DECLARE @myMatrix as varchar(max) = wct.MATRIX2STRING_q('
SELECT
*
FROM (VALUES
(23,-16.7,14.2,-36.5,-35.6),
(6.4,7.5,-0.6,-31.5,-56.4),
(20.2,17.9,-4.2,41,-33.8),
(20.8,31.3,-7.2,10,23.4),
(-0.4,8.9,43.2,35.5,27.6),
(6.6,-12.7,21,36.5,-2),
(-7.6,-24,32.8,8.5,54.8),
(6.2,9.2,15,-19,-51),
(8,-5.7,24,37.5,36.8),
(2.8,-16.3,7.4,18,-41.8)
)N(A,B,C,D,E)
')
DECLARE @mrows as float = wct.MROWS(@myMatrix)
DECLARE @mcols as float = wct.MCOLS(@myMatrix)
DECLARE @invmrows as float = 1e+00/@mrows
DECLARE @x as varchar(max) =(SELECT wct.MATSUB(@MyMatrix, wct.MATMULT(wct.MATMULT(wct.ONES(@mrows,@mrows),@myMatrix),@invmrows)))
DECLARE @sqlstring as varchar(max) = '
with mycte as (
SELECT
*
FROM
wct.MATRIX(wct.MATMULT(wct.MATMULT(wct.TRANSPOSE(' + '''' + @x + '''' + '),' + '''' + @x + '''' + '),' + cast(@invmrows as varchar(max)) + '))
)
SELECT
@cols
FROM (
SELECT
xy.RowNum,
xy.ColNum,
xy.ItemValue / (SQRT(x.ItemValue) * SQRT(y.ItemValue)) as ItemValue
FROM
mycte xy
JOIN
mycte x
ON
xy.RowNum = x.ColNum
AND xy.RowNum = x.RowNum
JOIN
mycte y
ON
xy.ColNum = y.ColNum
AND xy.ColNum = y.RowNum
)D
PIVOT (
SUM(ItemValue)
FOR
ColNum in (@cols)
) as P'
DECLARE @columns as varchar(max) =(
SELECT
'[' + CAST(SeriesValue as varchar(max)) + ']'
FROM
wct.SeriesInt(0, @mcols-1,NULL,NULL,NULL)
ORDER BY
SeriesValue
FOR XML PATH('')
)
SET @columns = REPLACE(@columns, '][','],[')
SET @sqlstring = REPLACE(@sqlstring,'@cols',@columns)
EXEC(@sqlString)
This produces the following result.
0
|
1
|
2
|
3
|
4
|
1
|
0.469381
|
-0.67892
|
-0.17848
|
-0.35299
|
0.469381
|
1
|
-0.50798
|
0.083179
|
-0.11464
|
-0.67892
|
-0.50798
|
1
|
0.260094
|
0.566801
|
-0.17848
|
0.083179
|
0.260094
|
1
|
0.52243
|
-0.35299
|
-0.11464
|
0.566801
|
0.52243
|
1
|
All of the examples in this article were entered in SQL Server 2012. XLeratorDB works in SQL Server 2005, 2008, 2012, and 2014, though some functions might have a different syntax in SQL Server 2005.
You should download the 15-day free trial today and see how you can use XLeratorDB to get the most out of the data in your database engine.