Re: Pearson-r in SQL

From: Matthias Klaey <mpky_at_hotmail.com>
Date: Thu, 23 Dec 2004 11:12:56 +0100
Message-ID: <045ls0tjujc7oj024pki0phq606n78v3ih_at_4ax.com>


On 22 Dec 2004 12:13:22 -0800, "-CELKO-" <jcelko212_at_earthlink.net> wrote:

>I was playing with doing som stats in SQL, and I need some feedback.
>What should you do with NULLs in a computation of Pearson's r? Let me
>do a "cut & paste" of the definition of r and some SQL code at the
>bottom of this posting.
>
>If (x, y) = (NULL, NULL) then my query will drop them both and the
>answer stays the same. That seems to follow the usual SQL aggregate
>function rules for singleton stats like AVG()about dropping NULLs
>before computing. That was the reason for using COUNT(x) and not
>COUNT(*).
>
>But what is the correct (or reasonable) behavior if (x,y) has one and
>only one NULL in the pair? My initial thoughts are:
>
>1) Drop the pair. That is quick and easy with a "WHERE x IS NOT NULL
>AND y IS NOT NULL" clause.
>
>2) Convert (x, NULL) to (x, AVG(x)) and (NULL, y) to (AVG(y), y)? Or
>something that will smooth out the missing values.

I think you always will drop the pairs (Null, Null). With the pairs (Null, y) and (y, Null) it depends on the "cost" of dropping some potentially valuable information.
If you want to keep them, I would first calculate a linear regression with the remaining pairs, say y = a + b*x, and then fill in the expected values, i.e., yi(missing) = a + b*xi for the pair (xi, Null), and xi(missing) = (y - a) / b, and then calculate the correlation coefficient.

>3) Should the code use COUNT(*) instead of COUNT(x) to show that the
>missing values are affecting the answer?

You always use the number of pairs that are effectively in the calculation.

>
>================
>The strength of the linear association between two variables is
>quantified by the correlation coefficient. Given a set of observations
>(x1, y1), (x2,y2),...(xn,yn), the formula for computing the correlation
>coefficient is given by this query:
>
>CREATE TABLE Foobar
>(x REAL NOT NULL,
>y REAL NOT NULL);
>
>INSERT INTO Foobar VALUES (1.0, 2.0);
>INSERT INTO Foobar VALUES (2.0, 5.0);
>INSERT INTO Foobar VALUES (3.0, 6.0);
>-- r= 0.9608
>
>SELECT
>((SUM(x*y) - ((SUM(x)*SUM(y))/COUNT(x)))
>/ (SQRT((SUM(x*x)- (SUM(x)*SUM(x)/COUNT(x)))
>* (SUM(y*y)- (SUM(y)* SUM(y)/COUNT(x))))))
>AS pearson_r
>FROM Foobar;
>
[...]

With all due respect, this is a numerically unstable algorithm, see, e.g.

  Tony F. Chan, Gene H. Golub, and Randall J. LeVeque   "Algorithms for Computing the Sample Variance: Analysis and Recommendations"
  The American Statistician, August 1983, Vol. 37, No. 3, pp 242-247

  http://www.amstat.org/publications/tas/

The earliest reference in this paper goes back to

  Welford, B.P. (1962), "Note on a Method for Calculating Corrected Sums of Squares and Products", Technometrics, 4, 419-420

  http://www.amstat.org/publications/technometrics/

Since the computation of the average has become pretty "cheap", I would much more prefer

(SUM(x - AVG(x))*(y - AVG(y)))
/ SQRT(SUM((x - AVG(x))^2) * SUM((y - AVG(y))^2))

I don't know if exponentiation ^2 is standard SQL, otherwise just use multiplication.

Greetings
Matthias Kläy

-- 
www.kcc.ch
Matthias Kläy
-- 
www.kcc.ch
Received on Thu Dec 23 2004 - 11:12:56 CET

Original text of this message