Probably more bunk has been written about correlation than about any other statistic. We'll keep this short and straight. At its simplest, correlation is a statistical measure of non-random, linear association between pairs of values in a dataset. It's denoted by r, and varies from -1 through +1, where -1 indicates perfect inverse correlation (the regression line goes down left to right), 0 indicates no correlation (there is no regression line; it's just a scatterplot), and +1 indicates perfect direct correlation (the regression line goes up left to right).
For an example we'll use a bit of imaginary data: drop table if exists xydata; create table xydata (id int, x int, y float); insert into xydata values (1 , 68, 4.1),(2 , 71, 4.6),(3 , 62, 3.8),(4 , 75, 4.4),(5 , 58, 3.2), (6 , 60, 3.1),(7 , 67, 3.8),(8 , 68, 4.1),(9 , 71, 4.3),(10, 69, 3.7), (11, 68, 3.5),(12, 67, 3.2),(13, 63, 3.7),(14, 62, 3.3),(15, 60, 3.4), (16, 63, 4.0),(17, 65, 4.1),(18, 67, 3.8),(19, 63, 3.4),(20, 61, 3.6);If you like to think about such problems concretely, you can think of `id` as a subject's id, `x` as a subject's height, and `y` as a subject's score on a self-confidence questionnaire, so we would be computing a correlation between height and self-confidence.
There are many correlation formulas. Most commonly used is the You'll also want the And, you need to calculate the probability that the coefficient you calculated is significantly different from zero. Here we use a simple standard formula to calculate a For this example we restrict ourselves to calculating correlation assuming a normal distribution and no missing values. Calculating the correlation coefficient needs two passes: a first pass to calculate basic statistical quantities, then a second pass to calculate the slope, intercept and correlation coefficient from those basic quantities. Calculating the coefficient of regression and statistical significance needs a third pass. So this will be a 4-step:
If we're computing correlation in a database like MySQL, then, you'd expect Views will be useful. Unfortunately we're immediately bitten by two limitations in the MySQL implementation of Views: subqueries in the Then do we need to encapsulate correlation in a stored procedure? We could, but we needn't because SQL implementations like MySQL provide the required basic statistical quantities, permitting us to do it all in one four-layer query ...
SELECT -- Step 4 N, Slope, avgY - slope*avgX AS Intercept, r AS Correlation, CoeffOfReg, Round( r * SqRt( (N-2)/(1-r*r) ), 2 ) AS t, N-2 AS df FROM ( SELECT -- Step 3 N, avgX, avgY, slope, intercept, r, FORMAT( 1 - SUM((y - intercept - slope*x)*(y - intercept - slope*x))/ ((N-1)*varY), 5 ) AS CoeffOfReg FROM xydata CROSS JOIN ( SELECT -- Step 2 N, avgX, avgY, varY, slope, r, avgY - slope*avgX AS intercept FROM ( SELECT N, avgX, avgY, varY, FORMAT(( N*sumXY - sumX*sumY ) / ( N*sumsqX - sumX*sumX ), 5 ) AS slope, FORMAT(( sumXY - n*avgX*avgY ) / ( (N-1) * SQRT(varX) * SQRT(varY)), 5 ) AS r FROM ( SELECT -- Step 1 COUNT(x) AS N, AVG(x) AS avgX, SUM(x) AS sumX, SUM(x*x) AS sumsqX, VAR_SAMP(x) AS varX, AVG(y) AS avgY, SUM(y) AS sumY, SUM(y*y) AS sumsqY, VAR_SAMP(y) AS varY, SUM(x*y) AS sumXY FROM xydata ) AS sums ) AS calc ) AS stats ) AS result; +------+---------+---------------------+-------------+------------+------+------+ | N | Slope | Intercept | Correlation | CoeffOfReg | t | df | +------+---------+---------------------+-------------+------------+------+------+ | 20 | 0.07066 | -0.8661640047683719 | 0.73064 | 0.53383 | 4.54 | 18 | +------+---------+---------------------+-------------+------------+------+------+You can look up the significance of the t value at a pre-decided level of significance for the number of degrees of freedom in your sample in a standard statistical table of t values, for example this one. As you can see there, the higher the degrees of freedom, the lower the t value needed for a given level of significance. |