# Re: Pearson-r in SQL

Date: Wed, 22 Dec 2004 13:49:48 -0800

Message-ID: <j9myd.23$Tg5.117_at_news.oracle.com>

"-CELKO-" <jcelko212_at_earthlink.net> wrote in message
news:1103746401.990096.37280_at_f14g2000cwb.googlegroups.com...

> I was playing with doing som stats in SQL, and I need some feedback.

*> What should you do with NULLs in a computation of Pearson's r? Let me
**> do a "cut & paste" of the definition of r and some SQL code at the
**> bottom of this posting.
**>
**> If (x, y) = (NULL, NULL) then my query will drop them both and the
**> answer stays the same. That seems to follow the usual SQL aggregate
**> function rules for singleton stats like AVG()about dropping NULLs
**> before computing. That was the reason for using COUNT(x) and not
**> COUNT(*).
**>
**> But what is the correct (or reasonable) behavior if (x,y) has one and
**> only one NULL in the pair? My initial thoughts are:
**>
**> 1) Drop the pair. That is quick and easy with a "WHERE x IS NOT NULL
**> AND y IS NOT NULL" clause.
**>
**> 2) Convert (x, NULL) to (x, AVG(x)) and (NULL, y) to (AVG(y), y)? Or
**> something that will smooth out the missing values.
**>
**> 3) Should the code use COUNT(*) instead of COUNT(x) to show that the
**> missing values are affecting the answer?
*

Suppose all the data you have are either (NULL, nonNULL) or (nonNULL, NULL) pairs. What is the corellation between x and y? Received on Wed Dec 22 2004 - 22:49:48 CET