| Oracle FAQ | Your Portal to the Oracle Knowledge Grid | |
Home -> Community -> Usenet -> comp.databases.theory -> Pearson-r in SQL
I was playing with doing som stats in SQL, and I need some feedback.
What should you do with NULLs in a computation of Pearson's r? Let me
do a "cut & paste" of the definition of r and some SQL code at the
bottom of this posting.
If (x, y) = (NULL, NULL) then my query will drop them both and the answer stays the same. That seems to follow the usual SQL aggregate function rules for singleton stats like AVG()about dropping NULLs before computing. That was the reason for using COUNT(x) and not COUNT(*). But what is the correct (or reasonable) behavior if (x,y) has one and only one NULL in the pair? My initial thoughts are:
CREATE TABLE Foobar
(x REAL NOT NULL,
y REAL NOT NULL);
INSERT INTO Foobar VALUES (1.0, 2.0); INSERT INTO Foobar VALUES (2.0, 5.0); INSERT INTO Foobar VALUES (3.0, 6.0);
SELECT
((SUM(x*y) - ((SUM(x)*SUM(y))/COUNT(x)))
/ (SQRT((SUM(x*x)- (SUM(x)*SUM(x)/COUNT(x)))
* (SUM(y*y)- (SUM(y)* SUM(y)/COUNT(x))))))
AS pearson_r
FROM Foobar;
The correlation coefficient always takes a value between -1 and 1, with 1 or -1 indicating perfect correlation (all points would lie along a straight line in this case). A positive correlation indicates a positive association between the variables (increasing values in one variable correspond to increasing values in the other variable), while a negative correlation indicates a negative association between the variables (increasing values is one variable corresponding to decreasing values in the other variable). A correlation value close to 0 indicates no association between the variables.
Since the formula for calculating the correlation coefficient standardizes the variables, changes in scale or units of measurement will not affect its value. For this reason, the correlation coefficient is often more useful than a graphical depiction in determining the strength of the association between two variables. Received on Wed Dec 22 2004 - 14:13:22 CST
![]() |
![]() |