Path: dp-news.maxwell.syr.edu!spool.maxwell.syr.edu!drn.maxwell.syr.edu!news.maxwell.syr.edu!border1.nntp.dca.giganews.com!nntp.giganews.com!ngpeer.news.aol.com!news.compuserve.com!news-master.compuserve.com!not-for-mail
From: Matthias Klaey <mpky@hotmail.com>
Newsgroups: comp.databases.theory
Subject: Re: Pearson-r in SQL
Date: Thu, 23 Dec 2004 11:12:56 +0100
Organization: Klaey Computing AG
Lines: 97
Message-ID: <045ls0tjujc7oj024pki0phq606n78v3ih@4ax.com>
References: <1103746401.990096.37280@f14g2000cwb.googlegroups.com>
NNTP-Posting-Host: zux167-174.adsl.green.ch
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
X-Trace: ngspool-d02.news.aol.com 1103796779 4896 80.254.167.174 (23 Dec 2004 10:12:59 GMT)
X-Complaints-To: newsmaster@compuserve.com
NNTP-Posting-Date: Thu, 23 Dec 2004 10:12:59 +0000 (UTC)
X-Newsreader: Forte Agent 2.0/32.652
Xref: dp-news.maxwell.syr.edu comp.databases.theory:29131

On 22 Dec 2004 12:13:22 -0800, "-CELKO-" <jcelko212@earthlink.net>
wrote:

>I was playing with doing som stats in SQL, and I need some feedback.
>What should you do with NULLs in a computation of Pearson's r?  Let me
>do a "cut & paste" of the definition of r and some SQL code at the
>bottom of this posting.
>
>If (x, y) = (NULL, NULL) then my query will drop them both and the
>answer stays the same.  That seems to follow the usual SQL aggregate
>function rules for singleton stats like AVG()about dropping NULLs
>before computing.  That was the reason for using COUNT(x) and not
>COUNT(*).
>
>But what is the correct (or reasonable) behavior if (x,y) has one and
>only one NULL in the pair?  My initial thoughts are:
>
>1) Drop the pair.  That is quick and easy with a "WHERE x IS NOT NULL
>AND y IS NOT NULL" clause.
>
>2) Convert (x, NULL) to (x, AVG(x)) and (NULL, y) to (AVG(y), y)?  Or
>something that will smooth out the missing values.

I think you always will drop the pairs (Null, Null). With the pairs
(Null, y) and (y, Null) it depends on the "cost" of dropping some
potentially valuable information.
If you want to keep them, I would first calculate a linear regression
with the remaining pairs, say y = a + b*x, and then fill in the
expected values, i.e., yi(missing) = a + b*xi for the pair (xi, Null),
and xi(missing) = (y - a) / b, and then calculate the correlation
coefficient.

>3) Should the code use COUNT(*) instead of COUNT(x) to show that the
>missing values are affecting the answer?

You always use the number of pairs that are effectively in the
calculation.

>
>================
>The strength of the linear association between two variables is
>quantified by the correlation coefficient. Given a set of observations
>(x1, y1), (x2,y2),...(xn,yn), the formula for computing the correlation
>coefficient is given by this query:
>
>CREATE TABLE Foobar
>(x REAL NOT NULL,
>y REAL NOT NULL);
>
>INSERT INTO Foobar VALUES (1.0, 2.0);
>INSERT INTO Foobar VALUES (2.0, 5.0);
>INSERT INTO Foobar VALUES (3.0, 6.0);
>-- r= 0.9608
>
>SELECT
>((SUM(x*y) - ((SUM(x)*SUM(y))/COUNT(x)))
>/ (SQRT((SUM(x*x)- (SUM(x)*SUM(x)/COUNT(x)))
>* (SUM(y*y)- (SUM(y)* SUM(y)/COUNT(x))))))
>AS pearson_r
>FROM Foobar;
>
[...]

With all due respect, this is a numerically unstable algorithm, see,
e.g.

  Tony F. Chan, Gene H. Golub, and Randall J. LeVeque
  "Algorithms for Computing the Sample Variance: Analysis and
Recommendations"
  The American Statistician, August 1983, Vol. 37, No. 3, pp 242-247

  http://www.amstat.org/publications/tas/

The earliest reference in this paper goes back to

  Welford, B.P. (1962), "Note on a Method for Calculating Corrected
Sums of Squares and Products", Technometrics, 4, 419-420

  http://www.amstat.org/publications/technometrics/


Since the computation of the average has become pretty "cheap", I
would much more prefer

(SUM(x - AVG(x))*(y - AVG(y)))
/ SQRT(SUM((x - AVG(x))^2) * SUM((y - AVG(y))^2))

I don't know if exponentiation ^2 is standard SQL, otherwise just use
multiplication.

Greetings
Matthias Kläy
-- 
www.kcc.ch
Matthias Kläy
-- 
www.kcc.ch