Pearson-r in SQL

From: -CELKO- <jcelko212_at_earthlink.net>
Date: 22 Dec 2004 12:13:22 -0800
Message-ID: <1103746401.990096.37280_at_f14g2000cwb.googlegroups.com>



I was playing with doing som stats in SQL, and I need some feedback. What should you do with NULLs in a computation of Pearson's r? Let me do a "cut & paste" of the definition of r and some SQL code at the bottom of this posting.

If (x, y) = (NULL, NULL) then my query will drop them both and the answer stays the same. That seems to follow the usual SQL aggregate function rules for singleton stats like AVG()about dropping NULLs before computing. That was the reason for using COUNT(x) and not COUNT(*). But what is the correct (or reasonable) behavior if (x,y) has one and only one NULL in the pair? My initial thoughts are:

  1. Drop the pair. That is quick and easy with a "WHERE x IS NOT NULL AND y IS NOT NULL" clause.
  2. Convert (x, NULL) to (x, AVG(x)) and (NULL, y) to (AVG(y), y)? Or something that will smooth out the missing values.
  3. Should the code use COUNT(*) instead of COUNT(x) to show that the missing values are affecting the answer?

The strength of the linear association between two variables is quantified by the correlation coefficient. Given a set of observations
(x1, y1), (x2,y2),...(xn,yn), the formula for computing the correlation
coefficient is given by this query:

CREATE TABLE Foobar
(x REAL NOT NULL,

y REAL NOT NULL);

INSERT INTO Foobar VALUES (1.0, 2.0);
INSERT INTO Foobar VALUES (2.0, 5.0);
INSERT INTO Foobar VALUES (3.0, 6.0);

-- r= 0.9608

SELECT
((SUM(x*y) - ((SUM(x)*SUM(y))/COUNT(x)))
/ (SQRT((SUM(x*x)- (SUM(x)*SUM(x)/COUNT(x))) * (SUM(y*y)- (SUM(y)* SUM(y)/COUNT(x)))))) AS pearson_r
FROM Foobar;

The correlation coefficient always takes a value between -1 and 1, with 1 or -1 indicating perfect correlation (all points would lie along a straight line in this case). A positive correlation indicates a positive association between the variables (increasing values in one variable correspond to increasing values in the other variable), while a negative correlation indicates a negative association between the variables (increasing values is one variable corresponding to decreasing values in the other variable). A correlation value close to 0 indicates no association between the variables.

Since the formula for calculating the correlation coefficient standardizes the variables, changes in scale or units of measurement will not affect its value. For this reason, the correlation coefficient is often more useful than a graphical depiction in determining the strength of the association between two variables. Received on Wed Dec 22 2004 - 21:13:22 CET

Original text of this message