两数值变量的相关性分析
两数值变量的相关性分析
==相关性的计算==
“R can produce a variety of correlation coefficients, including Pearson, Spearman, Kendall, partial, polychoric, and polyserial.” (“R in action”, p. 153) (pdf)
Pearson, Spearman, Kendall, partial, ploychroic, ployserial
r=\frac{\sum^{n}_{i=1}(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum^{n}_{i=1}(x_i-\bar{x})\sum^{n}_{i=1}(y_i-\bar{y})^2}}
> states <- state.x77[,1:6]
> states
Population Income Illiteracy Life Exp Murder HS Grad
Alabama 3615 3624 2.1 69.05 15.1 41.3
Alaska 365 6315 1.5 69.31 11.3 66.7
Arizona 2212 4530 1.8 70.55 7.8 58.1
> cor(states)
Population Income Illiteracy Life Exp Murder HS Grad
Population 1.00000000 0.2082276 0.1076224 -0.06805195 0.3436428 -0.09848975
Income 0.20822756 1.0000000 -0.4370752 0.34025534 -0.2300776 0.61993232
Illiteracy 0.10762237 -0.4370752 1.0000000 -0.58847793 0.7029752 -0.65718861
Life Exp -0.06805195 0.3402553 -0.5884779 1.00000000 -0.7808458 0.58221620
Murder 0.34364275 -0.2300776 0.7029752 -0.78084575 1.0000000 -0.48797102
HS Grad -0.09848975 0.6199323 -0.6571886 0.58221620 -0.4879710 1.00000000
> sum( (Population-mean(Population)) * (Income-mean(Income)) ) /
sqrt(sum( (Population-mean(Population))^2 ) * sum( (Income-mean(Income))^2 ))
[1] 0.2082276
==相关系数检验==
“由于抽样的随机性以及样本量的影响,简单相关系数仅能体现样本所表现的相关性。样本来自的总体是否相关或无关,还需要进行相关系数检验” (“基于R的统计分析与数据挖掘”, p. 68) (pdf),计算简单相关系数检验的检验统计量
t
的公式如下:
t_r = \frac{r-\rho}{s_r} \sim t-distribution 为总体的相关系数,s_r为抽样误差。
t= \frac{r-0}{\sqrt{\frac{1-r^2}{n-2}}}= \frac{r\sqrt{n-2}}{\sqrt{1-r^2}}
> cor.test(Population,Income)
Pearson's product-moment correlation
data: Population and Income
t = 1.475, df = 48, p-value = 0.1467
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.07443435 0.45991855
sample estimates:
cor
0.2082276
> r = 0.2082276
> n = nrow(states) # 50
> ( r*sqrt(n-2) ) / ( sqrt(1-r^2) )
[1] 1.474974
“偏相关系数是在控制了其他数值型变量(这 些变量称为控制变量)的条件下,计算两数值型变量间的相关系数,从而消除其他变量对相关系数值的影响” (“基于R的统计分析与数据挖掘”, p. 70) (pdf)
==两数值变量相关性的可视化==
> plot(Population~Income,states)
> lm_fit <- lm(Population~Income,as.data.frame(states))
> lm_fit
Call:
lm(formula = Population ~ Income, data = as.data.frame(states))
Coefficients:
(Intercept) Income
-2464.492 1.513
> abline(lm_fit,col="red")
> coef(lm_fit)
# > coef(lm_fit)
# (Intercept) Income
# -2464.491538 1.512898
> abline(-2464.491538,1.512898, col="red")
> loess_fit <- loess(Population~Income,as.data.frame(states))
> loess_fit
Call:
loess(formula = Population ~ Income, data = as.data.frame(states))
Number of Observations: 50
Equivalent Number of Parameters: 5.46
Residual Standard Error: 4447
> ord <- order(loess_fit$x)
> lines(loess_fit$x[ord], loess_fit$fitted[ord], col="blue")
library(ggplot2)
ggplot(as.data.frame(states),aes(Income,Population))+
geom_point()+
geom_smooth(method="loess",se=F,col="blue")+
geom_smooth(method="lm",se=F,col="red")
两分类变量的相关性分析
“两分类型变量相关性描述的工具是编制列联表” (“基于R的统计分析与数据挖掘”, p. 73) (pdf)
“两分类型变量相关性的检验是在列联表的基础上,利用列联表数据,分析表中两分类型变量的总体相关性。采用的方法是 #卡方检验 卡方检验的原假设是:列联表中两分类型变量独立” (“基于R的统计分析与数据挖掘”, p. 76) (pdf)