Re: Pearson-r in SQL

From: Mikito Harakiri <mikharakiri_at_iahu.com>
Date: Wed, 22 Dec 2004 13:49:48 -0800
Message-ID: <j9myd.23$Tg5.117_at_news.oracle.com>


"-CELKO-" <jcelko212_at_earthlink.net> wrote in message news:1103746401.990096.37280_at_f14g2000cwb.googlegroups.com...
> I was playing with doing som stats in SQL, and I need some feedback.
> What should you do with NULLs in a computation of Pearson's r? Let me
> do a "cut & paste" of the definition of r and some SQL code at the
> bottom of this posting.
>
> If (x, y) = (NULL, NULL) then my query will drop them both and the
> answer stays the same. That seems to follow the usual SQL aggregate
> function rules for singleton stats like AVG()about dropping NULLs
> before computing. That was the reason for using COUNT(x) and not
> COUNT(*).
>
> But what is the correct (or reasonable) behavior if (x,y) has one and
> only one NULL in the pair? My initial thoughts are:
>
> 1) Drop the pair. That is quick and easy with a "WHERE x IS NOT NULL
> AND y IS NOT NULL" clause.
>
> 2) Convert (x, NULL) to (x, AVG(x)) and (NULL, y) to (AVG(y), y)? Or
> something that will smooth out the missing values.
>
> 3) Should the code use COUNT(*) instead of COUNT(x) to show that the
> missing values are affecting the answer?

Suppose all the data you have are either (NULL, nonNULL) or (nonNULL, NULL) pairs. What is the corellation between x and y? Received on Wed Dec 22 2004 - 22:49:48 CET

Original text of this message