New tTest functions in XLeratorDB/statistics 1.11
Jul
18
Written by:
Charles Flock
7/18/2012 4:56 PM
We have a new functionality for the tTest allowing you to perform sophisticated analysis of your SQL Server data directly on the database using TSQL.
In release 1.11 we have added three new functions for the tTest: TTEST_PAIRED (the paired tTest), TTEST_INDEP (the twosample tTest assuming equal variance), and TTEST_INDEPU (the twoample tTest assuming unequal variance). In addition to calculating the tTest pvalue (which was already available in the existing TTEST function), you can now get other test characteristics, including the test statistic, the degrees of freedom, the pooled variance, and the effect size.
It is possible to get some of this information out of EXCEL, but there are no EXCEL functions that will provide anything other than the pvalue. You need to use the Data Analysis features of EXCEL to get anything other than the pvalue.
Here’s an example of the paired tTest in EXCEL using Data Analysis.
tTest: Paired Two Sample for Means





Variable 1

Variable 2

Mean

175.3142857

207.4

Variance

300.7880952

176.2366667

Observations

7

7

Pearson Correlation

0.806734967


Hypothesized Mean Difference

0


df

6


t Stat

2.914289804


P(T<=t) onetail

0.013415026


t Critical onetail

1.943180281


P(T<=t) twotail

0.026830053


t Critical twotail

2.446911851


Notice that there is no effect size returned in the EXCEL table, and there is no way to calculate the effect size from the table values. Additionally, if the data change, then Data Analysis needs to be rerun, so it’s quite easy for the table and the results to get out of synch. This made EXCEL pretty much unsuitable as a testing platform.
EXCEL also produces results for something called P(T<=t) onetail and P(T<=t) twotail. These values are simply TINV(2*alpha, df) and TINV(alpha, df), respectively. Take note that this is TINV function and not the T.INV function in EXCEL 2010. Using T.INV, the formula would be T.INV(1alpha,df) and T.INV((1alpha)/2,df), respectively.
However, even knowing how the values are calculated doesn’t really tell us what the description means. None of the other applications that we looked at produced this value. If you need to produce this value, you can use the XLeratorDB TINV function to do the calculation.
This led us to use MATLAB and R for testing purposes. We tested against the ttest and ttest2 functions in MATLAB and the t.test function in R.
In R, you can just paste the following into the R console:
x<c(154.3,191,163.4,168.6,187,200.4,162.5)
y<c(230.4,202.8,202.8,216.8,192.9,194.4,211.7)
t.test(x,y,paired = TRUE)
and it will produce the following result.
Paired ttest
data: x and y
t = 2.9143, df = 6, pvalue = 0.02683
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
59.025696 5.145732
sample estimates:
mean of the differences
32.0857
You can see that R also does not report on effect size, but does include upper and lower confidence levels.
In MATLAB, paste this into the Command Window:
x=[154.3,191,163.4,168.6,187,200.4,162.5];
y=[230.4,202.8,202.8,216.8,192.9,194.4,211.7];
[h,p,ci,stats] = ttest(x,y)
And the following values are returned.
h = 1
p = 0.026830052564695
ci = 59.025696174081908 5.145732397346670
stats =
tstat: 2.914289804035312
df: 6
sd: 29.129162281776917
Like R, MATLAB does not calculate the effect size, but does include the confidence interval. It also includes an sd statistic, which is the standard deviation of the differences.
In SQL Server using XLeratorDB, we can enter the following into SQL Server Management Studio.
SELECT t.s
,t.descr
,wct.TTEST_PAIRED(x,y,t.s) as [Value]
FROM (VALUES
(154.3,230.4),
(191,202.8),
(163.4,202.8),
(168.6,216.8),
(187,192.9),
(200.4,194.4),
(162.5,211.7)
) n(x,y)
CROSS APPLY(VALUES
('P1','one tailed p_value'),
('P2','two tailed p_value'),
('T','t observed'),
('ES','effect size'),
('N','num observed'),
('DF','deg freedom'),
('MD','mean difference'),
('SDD','stdev difference'),
('SE','standard error'),
('R','Pearsons correlation'),
('LCL','lower confidence level'),
('UCL','upper confidence level')
)t(s,descr)
GROUP BY t.s, t.descr
This produces the following results.
s descr Value
  
P1 one tailed p_value 0.0134150262823471
P2 two tailed p_value 0.0268300525646943
T t observed 2.91428980403532
ES effect size 1.10149800997837
N num observed 7
DF deg freedom 6
MD mean difference 32.0857142857143
SDD stdev difference 29.1291622817769
SE standard error 11.0097884710321
R Pearsons correlation 0.806734967138168
LCL lower confidence level 59.025696174082
UCL upper confidence level 5.14573239734668
As you can see, XLeratorDB includes a variety of descriptive statistics about the paired tTest, which matches up to the results produced by EXCEL, R, and MATLAB.
However, we had to check the effect size calculations arithmetically. For this we used Summary of Effect Sizes and their Links to Inferential Statistics by Michael Furr published in March 2008. For the paired tTest, the effect size is calculated as Cohen’s d, using the Furr 1.2.3:
Where is the mean difference and s_{D} is the standard deviation of the difference scores. Financial users will immediately recognize this equation as the Sharpe ratio. This makes it possible to use the TTEST_PAIRED function to calculate the Sharpe Ratio in a very straightforward way. Here’s an example with monthly portfolio returns and a riskfree rate of .05 per month.
SELECT wct.TTEST_PAIRED(r,.05,'ES') as Sharpe
FROM (VALUES
(1,0.259),(2,0.198),(3,0.364),(4,0.081),(5,0.057),(6,0.055),
(7,0.188),(8,0.317),(9,0.240),(10,0.184),(11,0.01),(12,0.526)
) n(per,r)
This produces the following result.
Sharpe

0.834363920967124
The paired tTest requires a pair of values. If either of the values is NULL, the pair is not included in the calculations. If you went to perform a tTest against a hypothetical mean, simply enter the hypothetical mean as the yvalue instead of a column name. If you want to test the xvalues against the standard normal distribution, simply enter 0 as the yvalue. If you want to perform a paired tTest against a hypothetical mean, you can simple add the hypothetical mean to the yvalue in the SQL. For example, this calculation returns the values for a paired tTest where the hypothetical mean is 20.
SELECT wct.TTEST_PAIRED(control, treatment+20, 'T') as t_observed
,wct.TTEST_PAIRED(control, treatment+20, 'P1') as p1_value
,wct.TTEST_PAIRED(control, treatment+20, 'P2') as p2_value
,wct.TTEST_PAIRED(control, treatment+20, 'R') as PCC
,wct.TTEST_PAIRED(control, treatment+20, 'LCL') as lower_ci
,wct.TTEST_PAIRED(control, treatment+20, 'UCL') as upper_ci
FROM (VALUES
(237,194),
(289,240),
(257,230),
(228,186),
(303,265),
(275,222),
(262,242),
(304,281),
(244,240),
(233,212)) n(control,treatment)
This produces the following result.
t_observed

p1_value

p2_value

PCC

lower_ci

upper_ci

2.459760185

0.018086

0.036173

0.85834

0.964011

23.03599

Not only does the XLeratorDB function calculate a wide variety of descriptive statistics associated with the paired tTest, but since it is an object on the database, it can consume massive amounts of data very quickly, without having to transport the xvalues and yvalues over the network. In our test environment, a 2.5gHz Dual Core Pentium machine running SQL Server 2008 R2 with 4GB RAM running Windows Server 2008 we processed over 30,000 rows/second, with each row containing a paired x and yvalue.
Now, let’s look at the twosample tTest for independent means assuming equal variance.
Let’s say we had the following values that we wanted to perform the twosample tTest for independent means assuming equal variance
x

Y

165.9

212.1

210.3

203.5

166.8

210.3

182.3

228.4

182.1

206.2

218

203.2

170.1

224.9


202.6

In EXCEL we could simply put these values in columns A and B then using the Data Analysis tool, select the appropriate tTest and put the results in C1.
This produces the following output
tTest: TwoSample Assuming Equal Variances







Variable 1

Variable 2

Mean

185.0714286

211.4

Variance

443.802381

101.0114286

Observations

7

8

Pooled Variance

259.2226374


Hypothesized Mean Difference

0


df

13


t Stat

3.159651739


P(T<=t) onetail

0.0037652


t Critical onetail

1.770933396


P(T<=t) twotail

0.0075304


t Critical twotail

2.160368656


Again, there is no calculation of the effect size. It’s also important to note that the x and yvalues are no longer treated as pairs. Additionally, there is another value produced by EXCEL, called the Pooled Variance.
In R, you can just paste the following into the R console:
x<c(165.9,210.3,166.8,182.3,182.1,218.0,170.1)
y<c(212.1,203.5,210.3,228.4,206.2,203.2,224.9,202.6)
t.test(x,y,var.equal = TRUE)
This produces the following result.
Two Sample ttest
data: x and y
t = 3.1597, df = 13, pvalue = 0.00753
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
44.330372 8.326771
sample estimates:
mean of x mean of y
185.0714 211.4000
In MATLAB, paste this into the Command Window:
x=[165.9,210.3,166.8,182.3,182.1,218.0,170.1];
y=[212.1,203.5,210.3,228.4,206.2,203.2,224.9,202.6];
[h,p,ci,stats] = ttest2(x,y)
This produces the following result.
h = 1
p = 0.007530400275588
ci = 44.330372000894251 8.326770856248594
stats =
tstat: 3.159651739014238
df: 13
sd: 16.100392459894802
Again, neither R nor MATLAB produce the effect size, though both produce confidence intervals which are not produced by EXCEL. R does not provide a figure for the Pooled Variance and MATLAB produces a statistic labeled sd, which is equal to the square root of the pooled variance (which I guess makes it the pooled standard deviation).
Unlike the paired tTest, the tTest for independent means does not require that the number of xvalues equal the number of yvalues. In EXCEL and the other tools, x and y are treated as vectors and passed into the functions as vectors of unequal length. There is no vector data structure in SQL Server, and we can’t just pass in x and y as column names, since they can be of different lengths.
Supporting vectors of unequal length in SQL Server is quite achievable, provided that your data are properly normalized. Instead of thinking about a vector of xvalues and a vector of yvalues, we need to think about a set of data which contains test results and labels which identify the results as belonging to the xsample or the ysample. Of course, the labels do not have to be ‘x’ or ‘y’; they can be anything that you like. The important thing is that each test calculation must consist of exactly two labels and you must identify which of the two labels is the ‘x’ (or leftside) of the test. This is important as it will determine the sign on the tstatistic.
Using the same data as above in XLeratorDB you can paste this into SQL Server Management Studio.
SELECT t.s
,t.descr
,wct.TTEST_INDEP(l,val,t.s,'x') as Value
FROM (VALUES
('x',165.9), ('y',212.1),
('x',210.3), ('y',203.5),
('x',166.8), ('y',210.3),
('x',182.3), ('y',228.4),
('x',182.1), ('y',206.2),
('x',218), ('y',203.2),
('x',170.1), ('y',224.9),
('y',202.6)
) n(l,val)
CROSS APPLY(VALUES
('P1','one tailed pvalue'),
('P2','two tailed_pvalue'),
('T','t observed'),
('ES','effect size'),
('N1','num observed in sample one'),
('N2','num observed in sample two'),
('SD','pooled variance'),
('DF','deg freedom'),
('MD','mean difference'),
('SE','standard error'),
('LCL','lower confidence level'),
('UCL','upper confidence level')
)t(s,descr)
GROUP BY t.s, t.descr
This produces the following result
s descr Value
  
P1 one tailed pvalue 0.00376520013779435
P2 two tailed_pvalue 0.00753040027558871
T t observed 3.15965173901421
ES effect size 1.63527513345741
N1 num observed in sample one 7
N2 num observed in sample two 8
SD pooled variance 259.222637362642
DF deg freedom 13
MD mean difference 26.3285714285714
SE standard error 8.33274474635162
LCL lower confidence level 44.3303720008945
UCL upper confidence level 8.32677085624837
For TTEST_INDEP, the effect size is calculated as
For more information, see Furr 1.2.1
As with the paired tTest, the XLeratorDB function TTEST_INDEP calculates a wide variety of descriptive statistics associated with the twosample tTest for independent means assuming equal variance, Again, since it is an object on the database, it can consume massive amounts of data very quickly, without having to transport the xvalues and yvalues over the network. In our test environment, a 2.5gHz Dual Core Pentium machine running SQL Server 2008 R2 with 4GB RAM running Windows Server 2008 we processed over 50,000 rows/second, with each row containing a single xvalue or a single yvalue.
Finally, let’s look at the twosample t Test assuming Unequal Variances
This produces the following output
tTest: TwoSample Assuming Unequal Variances





Variable 1

Variable 2

Mean

185.0714286

211.4

Variance

443.802381

101.0114286

Observations

7

8

Hypothesized Mean Difference

0


df

8


t Stat

3.01956254


P(T<=t) onetail

0.008285256


t Critical onetail

1.859548038


P(T<=t) twotail

0.016570512


t Critical twotail

2.306004135


Again, there is no calculation of the effect size. Unlike twosample tTest assuming equal variance, there is no Pooled Variance. Also, EXCEL, unlike R and MATLAB, calculates the degrees of freedom as an integer, which is not what happens in the EXCEL TTEST function. This leads to the problem that the numbers produced by Data Analysis in EXCEL do not agree with the numbers produced by the TTEST function in EXCEL. I have no idea why EXCEL would choose to do this, other than the T.DIST function, which can be used to calculated the pvalue in EXCEL, truncates degrees of freedom if it is not an integer. However, the pvalue can also be calculated using the BETA.DIST function, which accepts noninteger values. In our implementation, we calculate the degrees of freedom as a floating point value, rather than exclusively as an integer, keeping TTEST_INDEPU consistent with R, MATLAB, and SPSS.
In R, you can just paste the following into the R console:
x<c(165.9,210.3,166.8,182.3,182.1,218.0,170.1)
y<c(212.1,203.5,210.3,228.4,206.2,203.2,224.9,202.6)
t.test(x,y)
This produces the following result.
Welch Two Sample ttest
data: x and y
t = 3.0196, df = 8.344, pvalue = 0.01579
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
46.291950 6.365193
sample estimates:
mean of x mean of y
185.0714 211.4000
In MATLAB, paste this into the Command Window:
x=[165.9,210.3,166.8,182.3,182.1,218.0,170.1];
y=[212.1,203.5,210.3,228.4,206.2,203.2,224.9,202.6];
[h,p,ci,stats] = ttest2(x,y,.05,'both','unequal')
This produces the following result.
h = 1
p = 0.015785485650734
ci = 46.291949510052831 6.365193347090017
stats =
tstat: 3.019562539931532
df: 8.344151362650967
sd: [21.066617691323422 10.050444197717267]
Using the same data as above in XLeratorDB you can paste this into SQL Server Management Studio.
SELECT t.s
,t.descr
,wct.TTEST_INDEPU(l,val,t.s,'x') as Value
FROM (VALUES
('x',165.9), ('y',212.1),
('x',210.3), ('y',203.5),
('x',166.8), ('y',210.3),
('x',182.3), ('y',228.4),
('x',182.1), ('y',206.2),
('x',218), ('y',203.2),
('x',170.1), ('y',224.9),
('y',202.6)
) n(l,val)
CROSS APPLY(VALUES
('P1','one tailed pvalue'),
('P2','two tailed_pvalue'),
('T','t observed'),
('ES','effect size'),
('N1','num observed in sample one'),
('N2','num observed in sample two'),
('DF','deg freedom'),
('MD','mean difference'),
('SE','standard error'),
('LCL','lower confidence level'),
('UCL','upper confidence level')
)t(s,descr)
GROUP BY t.s, t.descr
This produces the following result.
s descr Value
  
P1 one tailed pvalue 0.00789274282536726
P2 two tailed_pvalue 0.0157854856507345
T t observed 3.01956253993151
ES effect size 1.63527513345741
N1 num observed in sample one 7
N2 num observed in sample two 8
DF deg freedom 8.34415136265098
MD mean difference 26.3285714285714
SE standard error 8.71933304258326
LCL lower confidence level 46.291949510053
UCL upper confidence level 6.36519334708986
For TTEST_INDEPU, the effect size is calculated as
For more information, see Furr 1.2.1
As with the paired tTest, the XLeratorDB function TTEST_INDEPU calculates a wide variety of descriptive statistics associated with the twosample tTest for independent means assuming equal variance, Again, since it is an object on the database, it can consume massive amounts of data very quickly, without having to transport the xvalues and yvalues over the network. In our test environment, a 2.5gHz Dual Core Pentium machine running SQL Server 2008 R2 with 4GB RAM running Windows Server 2008 we processed over 50,000 rows/second, with each row containing a single xvalue or a single yvalue, which is pretty much the same as TTEST_INDEP.
We think that the new tTest functions are a powerful addition to XLeratorDB, allowing users to perform sophisticated statistical calculations right on the database, without having to extract, move, or link their data to any other platform. Let us know what you think.