==相关性的计算==
“R can produce a variety of correlation coefficients, including Pearson, Spearman, Kendall, partial, polychoric, and polyserial.” (“R in action”, p. 153) (pdf)
Pearson, Spearman, Kendall, partial, ploychroic, ployserial r=\frac{\sum^{n}_{i=1}(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum^{n}_{i=1}(x_i-\bar{x})\sum^{n}_{i=1}(y_i-\bar{y})^2}}
> states <- state.x77[,1:6] > states Population Income Illiteracy Life Exp Murder HS Grad Alabama 3615 3624 2.1 69.05 15.1 41.3 Alaska 365 6315 1.5 69.31 11.3 66.7 Arizona 2212 4530 1.8 70.55 7.8 58.1 > cor(states) Population Income Illiteracy Life Exp Murder HS Grad Population 1.00000000 0.2082276 0.1076224 -0.06805195 0.3436428 -0.09848975 Income 0.20822756 1.0000000 -0.4370752 0.34025534 -0.2300776 0.61993232 Illiteracy 0.10762237 -0.4370752 1.0000000 -0.58847793 0.7029752 -0.65718861 Life Exp -0.06805195 0.3402553 -0.5884779 1.00000000 -0.7808458 0.58221620 Murder 0.34364275 -0.2300776 0.7029752 -0.78084575 1.0000000 -0.48797102 HS Grad -0.09848975 0.6199323 -0.6571886 0.58221620 -0.4879710 1.00000000 > sum( (Population-mean(Population)) * (Income-mean(Income)) ) / sqrt(sum( (Population-mean(Population))^2 ) * sum( (Income-mean(Income))^2 )) [1] 0.2082276
==相关系数检验==
“由于抽样的随机性以及样本量的影响,简单相关系数仅能体现样本所表现的相关性。样本来自的总体是否相关或无关,还需要进行相关系数检验” (“基于R的统计分析与数据挖掘”, p. 68) (pdf),计算简单相关系数检验的检验统计量t的公式如下:
t
t_r = \frac{r-\rho}{s_r} \sim t-distribution 为总体的相关系数,s_r为抽样误差。
t= \frac{r-0}{\sqrt{\frac{1-r^2}{n-2}}}= \frac{r\sqrt{n-2}}{\sqrt{1-r^2}}
> cor.test(Population,Income) Pearson's product-moment correlation data: Population and Income t = 1.475, df = 48, p-value = 0.1467 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.07443435 0.45991855 sample estimates: cor 0.2082276 > r = 0.2082276 > n = nrow(states) # 50 > ( r*sqrt(n-2) ) / ( sqrt(1-r^2) ) [1] 1.474974
“偏相关系数是在控制了其他数值型变量(这 些变量称为控制变量)的条件下,计算两数值型变量间的相关系数,从而消除其他变量对相关系数值的影响” (“基于R的统计分析与数据挖掘”, p. 70) (pdf)
==两数值变量相关性的可视化==
> plot(Population~Income,states) > lm_fit <- lm(Population~Income,as.data.frame(states)) > lm_fit Call: lm(formula = Population ~ Income, data = as.data.frame(states)) Coefficients: (Intercept) Income -2464.492 1.513 > abline(lm_fit,col="red") > coef(lm_fit) # > coef(lm_fit) # (Intercept) Income # -2464.491538 1.512898 > abline(-2464.491538,1.512898, col="red") > loess_fit <- loess(Population~Income,as.data.frame(states)) > loess_fit Call: loess(formula = Population ~ Income, data = as.data.frame(states)) Number of Observations: 50 Equivalent Number of Parameters: 5.46 Residual Standard Error: 4447 > ord <- order(loess_fit$x) > lines(loess_fit$x[ord], loess_fit$fitted[ord], col="blue")
library(ggplot2) ggplot(as.data.frame(states),aes(Income,Population))+ geom_point()+ geom_smooth(method="loess",se=F,col="blue")+ geom_smooth(method="lm",se=F,col="red")
“两分类型变量相关性描述的工具是编制列联表” (“基于R的统计分析与数据挖掘”, p. 73) (pdf)
“两分类型变量相关性的检验是在列联表的基础上,利用列联表数据,分析表中两分类型变量的总体相关性。采用的方法是 #卡方检验 卡方检验的原假设是:列联表中两分类型变量独立” (“基于R的统计分析与数据挖掘”, p. 76) (pdf)