Login     Register

        Contact Us     Search

New t-Test functions in XLeratorDB/statistics 1.11

Jul 18

Written by: Charles Flock
7/18/2012 4:56 PM  RssIcon

We have a new functionality for the t-Test allowing you to perform sophisticated analysis of your SQL Server data directly on the database using TSQL.
In release 1.11 we have added three new functions for the t-Test: TTEST_PAIRED (the paired t-Test), TTEST_INDEP (the two-sample t-Test assuming equal variance), and TTEST_INDEPU (the two-ample t-Test assuming unequal variance). In addition to calculating the t-Test p-value (which was already available in the existing TTEST function), you can now get other test characteristics, including the test statistic, the degrees of freedom, the pooled variance, and the effect size.
It is possible to get some of this information out of EXCEL, but there are no EXCEL functions that will provide anything other than the p-value. You need to use the Data Analysis features of EXCEL to get anything other than the p-value.
Here’s an example of the paired t-Test in EXCEL using Data Analysis.


 

t-Test: Paired Two Sample for Means
 
 
 
 
Variable 1
Variable 2
Mean
175.3142857
207.4
Variance
300.7880952
176.2366667
Observations
7
7
Pearson Correlation
-0.806734967
 
Hypothesized Mean Difference
0
 
df
6
 
t Stat
-2.914289804
 
P(T<=t) one-tail
0.013415026
 
t Critical one-tail
1.943180281
 
P(T<=t) two-tail
0.026830053
 
t Critical two-tail
2.446911851
 

Notice that there is no effect size returned in the EXCEL table, and there is no way to calculate the effect size from the table values. Additionally, if the data change, then Data Analysis needs to be re-run, so it’s quite easy for the table and the results to get out of synch. This made EXCEL pretty much unsuitable as a testing platform.
EXCEL also produces results for something called P(T<=t) one-tail and P(T<=t) two-tail. These values are simply TINV(2*alpha, df) and TINV(alpha, df), respectively. Take note that this is TINV function and not the T.INV function in EXCEL 2010. Using T.INV, the formula would be T.INV(1-alpha,df) and T.INV((1-alpha)/2,df), respectively.
However, even knowing how the values are calculated doesn’t really tell us what the description means. None of the other applications that we looked at produced this value. If you need to produce this value, you can use the XLeratorDB TINV function to do the calculation.
This led us to use MATLAB and R for testing purposes. We tested against the ttest and ttest2 functions in MATLAB and the t.test function in R.
In R, you can just paste the following into the R console:
x<-c(154.3,191,163.4,168.6,187,200.4,162.5)
y<-c(230.4,202.8,202.8,216.8,192.9,194.4,211.7)
t.test(x,y,paired = TRUE)
 
and it will produce the following result.
        Paired t-test
 
data: x and y
t = -2.9143, df = 6, p-value = 0.02683
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -59.025696 -5.145732
sample estimates:
mean of the differences
              -32.0857
You can see that R also does not report on effect size, but does include upper and lower confidence levels.
In MATLAB, paste this into the Command Window:
x=[154.3,191,163.4,168.6,187,200.4,162.5];
y=[230.4,202.8,202.8,216.8,192.9,194.4,211.7];
[h,p,ci,stats] = ttest(x,y)
 
And the following values are returned.
h =   1
p =   0.026830052564695
ci =  -59.025696174081908 -5.145732397346670
stats =
    tstat: -2.914289804035312
       df: 6
       sd: 29.129162281776917
Like R, MATLAB does not calculate the effect size, but does include the confidence interval. It also includes an sd statistic, which is the standard deviation of the differences.
In SQL Server using XLeratorDB, we can enter the following into SQL Server Management Studio.
SELECT t.s
,t.descr
,wct.TTEST_PAIRED(x,y,t.s) as [Value]
FROM (VALUES
      (154.3,230.4),
      (191,202.8),
      (163.4,202.8),
      (168.6,216.8),
      (187,192.9),
      (200.4,194.4),
      (162.5,211.7)
      ) n(x,y)
CROSS APPLY(VALUES
       ('P1','one tailed p_value'),
       ('P2','two tailed p_value'),
       ('T','t observed'),
       ('ES','effect size'),
       ('N','num observed'),
       ('DF','deg freedom'),
       ('MD','mean difference'),
       ('SDD','stdev difference'),
       ('SE','standard error'),
       ('R','Pearsons correlation'),
       ('LCL','lower confidence level'),
       ('UCL','upper confidence level')
       )t(s,descr)
GROUP BY t.s, t.descr
This produces the following results.
s    descr                                   Value
---- ---------------------- ----------------------
P1   one tailed p_value         0.0134150262823471
P2   two tailed p_value         0.0268300525646943
T    t observed                  -2.91428980403532
ES   effect size                 -1.10149800997837
N    num observed                                7
DF   deg freedom                                 6
MD   mean difference             -32.0857142857143
SDD  stdev difference             29.1291622817769
SE   standard error               11.0097884710321
R    Pearsons correlation       -0.806734967138168
LCL  lower confidence level       -59.025696174082
UCL  upper confidence level      -5.14573239734668
As you can see, XLeratorDB includes a variety of descriptive statistics about the paired t-Test, which matches up to the results produced by EXCEL, R, and MATLAB.
However, we had to check the effect size calculations arithmetically. For this we used Summary of Effect Sizes and their Links to Inferential Statistics by Michael Furr published in March 2008. For the paired t-Test, the effect size is calculated as Cohen’s d, using the Furr 1.2.3:

Where  is the mean difference and sD is the standard deviation of the difference scores. Financial users will immediately recognize this equation as the Sharpe ratio. This makes it possible to use the TTEST_PAIRED function to calculate the Sharpe Ratio in a very straightforward way. Here’s an example with monthly portfolio returns and a risk-free rate of .05 per month.
SELECT wct.TTEST_PAIRED(r,.05,'ES') as Sharpe
FROM (VALUES
      (1,0.259),(2,0.198),(3,0.364),(4,-0.081),(5,0.057),(6,0.055),
      (7,0.188),(8,0.317),(9,0.240),(10,0.184),(11,-0.01),(12,0.526)
      ) n(per,r)
This produces the following result.
                Sharpe
----------------------
     0.834363920967124
The paired t-Test requires a pair of values. If either of the values is NULL, the pair is not included in the calculations. If you went to perform a t-Test against a hypothetical mean, simply enter the hypothetical mean as the y-value instead of a column name. If you want to test the x-values against the standard normal distribution, simply enter 0 as the y-value. If you want to perform a paired t-Test against a hypothetical mean, you can simple add the hypothetical mean to the y-value in the SQL. For example, this calculation returns the values for a paired t-Test where the hypothetical mean is 20.
SELECT wct.TTEST_PAIRED(control, treatment+20, 'T') as t_observed
,wct.TTEST_PAIRED(control, treatment+20, 'P1') as p1_value
,wct.TTEST_PAIRED(control, treatment+20, 'P2') as p2_value
,wct.TTEST_PAIRED(control, treatment+20, 'R') as PCC
,wct.TTEST_PAIRED(control, treatment+20, 'LCL') as lower_ci
,wct.TTEST_PAIRED(control, treatment+20, 'UCL') as upper_ci
FROM (VALUES
      (237,194),
      (289,240),
      (257,230),
      (228,186),
      (303,265),
      (275,222),
      (262,242),
      (304,281),
      (244,240),
      (233,212)) n(control,treatment)
This produces the following result.

t_observed
p1_value
p2_value
PCC
lower_ci
upper_ci
2.459760185
0.018086
0.036173
0.85834
0.964011
23.03599

Not only does the XLeratorDB function calculate a wide variety of descriptive statistics associated with the paired t-Test, but since it is an object on the database, it can consume massive amounts of data very quickly, without having to transport the x-values and y-values over the network. In our test environment, a 2.5gHz Dual Core Pentium machine running SQL Server 2008 R2 with 4GB RAM running Windows Server 2008 we processed over 30,000 rows/second, with each row containing a paired x- and y-value.
Now, let’s look at the two-sample t-Test for independent means assuming equal variance.
Let’s say we had the following values that we wanted to perform the two-sample t-Test for independent means assuming equal variance

x
Y
165.9
212.1
210.3
203.5
166.8
210.3
182.3
228.4
182.1
206.2
218
203.2
170.1
224.9
 
202.6

In EXCEL we could simply put these values in columns A and B then using the Data Analysis tool, select the appropriate t-Test and put the results in C1.


This produces the following output

t-Test: Two-Sample Assuming Equal Variances
 
 
 
 
 
 
Variable 1
Variable 2
Mean
185.0714286
211.4
Variance
443.802381
101.0114286
Observations
7
8
Pooled Variance
259.2226374
 
Hypothesized Mean Difference
0
 
df
13
 
t Stat
-3.159651739
 
P(T<=t) one-tail
0.0037652
 
t Critical one-tail
1.770933396
 
P(T<=t) two-tail
0.0075304
 
t Critical two-tail
2.160368656
 

 
Again, there is no calculation of the effect size. It’s also important to note that the x- and y-values are no longer treated as pairs. Additionally, there is another value produced by EXCEL, called the Pooled Variance.
In R, you can just paste the following into the R console:
x<-c(165.9,210.3,166.8,182.3,182.1,218.0,170.1)
y<-c(212.1,203.5,210.3,228.4,206.2,203.2,224.9,202.6)
t.test(x,y,var.equal = TRUE)
 
This produces the following result.
        Two Sample t-test
 
data: x and y
t = -3.1597, df = 13, p-value = 0.00753
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -44.330372 -8.326771
sample estimates:
mean of x mean of y
 185.0714 211.4000
 
In MATLAB, paste this into the Command Window:
x=[165.9,210.3,166.8,182.3,182.1,218.0,170.1];
y=[212.1,203.5,210.3,228.4,206.2,203.2,224.9,202.6];
[h,p,ci,stats] = ttest2(x,y)
 
This produces the following result.
h = 1
p = 0.007530400275588
ci = -44.330372000894251 -8.326770856248594
stats =
    tstat: -3.159651739014238
       df: 13
       sd: 16.100392459894802
Again, neither R nor MATLAB produce the effect size, though both produce confidence intervals which are not produced by EXCEL. R does not provide a figure for the Pooled Variance and MATLAB produces a statistic labeled sd, which is equal to the square root of the pooled variance (which I guess makes it the pooled standard deviation).
Unlike the paired t-Test, the t-Test for independent means does not require that the number of x-values equal the number of y-values. In EXCEL and the other tools, x and y are treated as vectors and passed into the functions as vectors of unequal length. There is no vector data structure in SQL Server, and we can’t just pass in x and y as column names, since they can be of different lengths.
Supporting vectors of unequal length in SQL Server is quite achievable, provided that your data are properly normalized. Instead of thinking about a vector of x-values and a vector of y-values, we need to think about a set of data which contains test results and labels which identify the results as belonging to the x-sample or the y-sample. Of course, the labels do not have to be ‘x’ or ‘y’; they can be anything that you like. The important thing is that each test calculation must consist of exactly two labels and you must identify which of the two labels is the ‘x’ (or left-side) of the test. This is important as it will determine the sign on the t-statistic.
Using the same data as above in XLeratorDB you can paste this into SQL Server Management Studio.
SELECT t.s
,t.descr
,wct.TTEST_INDEP(l,val,t.s,'x') as Value
FROM (VALUES
      ('x',165.9),      ('y',212.1),
      ('x',210.3),      ('y',203.5),
      ('x',166.8),      ('y',210.3),
      ('x',182.3),      ('y',228.4),
      ('x',182.1),      ('y',206.2),
      ('x',218),        ('y',203.2),
      ('x',170.1),      ('y',224.9),
                        ('y',202.6)
      ) n(l,val)
CROSS APPLY(VALUES
       ('P1','one tailed p-value'),
       ('P2','two tailed_p-value'),
       ('T','t observed'),
       ('ES','effect size'),
       ('N1','num observed in sample one'),
       ('N2','num observed in sample two'),
       ('SD','pooled variance'),
       ('DF','deg freedom'),
       ('MD','mean difference'),
       ('SE','standard error'),
       ('LCL','lower confidence level'),
       ('UCL','upper confidence level')
       )t(s,descr)
GROUP BY t.s, t.descr
This produces the following result
s    descr                                       Value
---- -------------------------- ----------------------
P1   one tailed p-value            0.00376520013779435
P2   two tailed_p-value            0.00753040027558871
T    t observed                      -3.15965173901421
ES   effect size                     -1.63527513345741
N1   num observed in sample one                      7
N2   num observed in sample two                      8
SD   pooled variance                  259.222637362642
DF   deg freedom                                    13
MD   mean difference                 -26.3285714285714
SE   standard error                   8.33274474635162
LCL  lower confidence level          -44.3303720008945
UCL  upper confidence level          -8.32677085624837
For TTEST_INDEP, the effect size is calculated as

For more information, see Furr 1.2.1
As with the paired t-Test, the XLeratorDB function TTEST_INDEP calculates a wide variety of descriptive statistics associated with the two-sample t-Test for independent means assuming equal variance, Again, since it is an object on the database, it can consume massive amounts of data very quickly, without having to transport the x-values and y-values over the network. In our test environment, a 2.5gHz Dual Core Pentium machine running SQL Server 2008 R2 with 4GB RAM running Windows Server 2008 we processed over 50,000 rows/second, with each row containing a single x-value or a single y-value.
Finally, let’s look at the two-sample t Test assuming Unequal Variances
This produces the following output

t-Test: Two-Sample Assuming Unequal Variances
 
 
 
 
Variable 1
Variable 2
Mean
185.0714286
211.4
Variance
443.802381
101.0114286
Observations
7
8
Hypothesized Mean Difference
0
 
df
8
 
t Stat
-3.01956254
 
P(T<=t) one-tail
0.008285256
 
t Critical one-tail
1.859548038
 
P(T<=t) two-tail
0.016570512
 
t Critical two-tail
2.306004135
 

 
Again, there is no calculation of the effect size. Unlike two-sample t-Test assuming equal variance, there is no Pooled Variance. Also, EXCEL, unlike R and MATLAB, calculates the degrees of freedom as an integer, which is not what happens in the EXCEL TTEST function. This leads to the problem that the numbers produced by Data Analysis in EXCEL do not agree with the numbers produced by the TTEST function in EXCEL. I have no idea why EXCEL would choose to do this, other than the T.DIST function, which can be used to calculated the p-value in EXCEL, truncates degrees of freedom if it is not an integer. However, the p-value can also be calculated using the BETA.DIST function, which accepts non-integer values. In our implementation, we calculate the degrees of freedom as a floating point value, rather than exclusively as an integer, keeping TTEST_INDEPU consistent with R, MATLAB, and SPSS.
In R, you can just paste the following into the R console:
x<-c(165.9,210.3,166.8,182.3,182.1,218.0,170.1)
y<-c(212.1,203.5,210.3,228.4,206.2,203.2,224.9,202.6)
t.test(x,y)
This produces the following result.
 
        Welch Two Sample t-test
 
data: x and y
t = -3.0196, df = 8.344, p-value = 0.01579
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -46.291950 -6.365193
sample estimates:
mean of x mean of y
 185.0714 211.4000
 
In MATLAB, paste this into the Command Window:
x=[165.9,210.3,166.8,182.3,182.1,218.0,170.1];
y=[212.1,203.5,210.3,228.4,206.2,203.2,224.9,202.6];
[h,p,ci,stats] = ttest2(x,y,.05,'both','unequal')
This produces the following result.
 
h =     1
p =     0.015785485650734
ci =  -46.291949510052831 -6.365193347090017
stats =
    tstat: -3.019562539931532
       df: 8.344151362650967
       sd: [21.066617691323422 10.050444197717267]
 
Using the same data as above in XLeratorDB you can paste this into SQL Server Management Studio.
SELECT t.s
,t.descr
,wct.TTEST_INDEPU(l,val,t.s,'x') as Value
FROM (VALUES
      ('x',165.9),      ('y',212.1),
      ('x',210.3),      ('y',203.5),
      ('x',166.8),      ('y',210.3),
      ('x',182.3),      ('y',228.4),
      ('x',182.1),      ('y',206.2),
      ('x',218),        ('y',203.2),
      ('x',170.1),      ('y',224.9),
                        ('y',202.6)
      ) n(l,val)
CROSS APPLY(VALUES
       ('P1','one tailed p-value'),
       ('P2','two tailed_p-value'),
       ('T','t observed'),
       ('ES','effect size'),
       ('N1','num observed in sample one'),
       ('N2','num observed in sample two'),
       ('DF','deg freedom'),
       ('MD','mean difference'),
       ('SE','standard error'),
       ('LCL','lower confidence level'),
       ('UCL','upper confidence level')
       )t(s,descr)
GROUP BY t.s, t.descr
 
This produces the following result.
s    descr                                       Value
---- -------------------------- ----------------------
P1  one tailed p-value            0.00789274282536726
P2   two tailed_p-value             0.0157854856507345
T    t observed                      -3.01956253993151
ES   effect size                     -1.63527513345741
N1   num observed in sample one                      7
N2   num observed in sample two                      8
DF   deg freedom                      8.34415136265098
MD   mean difference                 -26.3285714285714
SE   standard error                   8.71933304258326
LCL lower confidence level           -46.291949510053
UCL upper confidence level          -6.36519334708986
 
For TTEST_INDEPU, the effect size is calculated as

For more information, see Furr 1.2.1
As with the paired t-Test, the XLeratorDB function TTEST_INDEPU calculates a wide variety of descriptive statistics associated with the two-sample t-Test for independent means assuming equal variance, Again, since it is an object on the database, it can consume massive amounts of data very quickly, without having to transport the x-values and y-values over the network. In our test environment, a 2.5gHz Dual Core Pentium machine running SQL Server 2008 R2 with 4GB RAM running Windows Server 2008 we processed over 50,000 rows/second, with each row containing a single x-value or a single y-value, which is pretty much the same as TTEST_INDEP.

We think that the new t-Test functions are a powerful addition to XLeratorDB, allowing users to perform sophisticated statistical calculations right on the database, without having to extract, move, or link their data to any other platform. Let us know what you think.

Tags:
Categories:
Copyright 2008-2013 WestClinTech LLC         Privacy Policy        Terms of Service