p401
We perform PCA on the USArrests data set, which is part of the base R package. The rows of the data set contain the 50 states, in alphabetical order.
states=row.names(USArrests)
#states
names(USArrests)
## [1] "Murder" "Assault" "UrbanPop" "Rape"
Apply the mean to each column
apply(USArrests, 2, mean)
## Murder Assault UrbanPop Rape
## 7.788 170.760 65.540 21.232
Apply the variance to each column
apply(USArrests, 2, var)
## Murder Assault UrbanPop Rape
## 18.97047 6945.16571 209.51878 87.72916
Use the prcomp() function.
By default, the prcomp() function centers the variables to have mean zero. By using the option scale=TRUE, we scale the variables to have standard deviation one.
pr.out=prcomp(USArrests, scale=TRUE)
summary(pr.out)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.5749 0.9949 0.59713 0.41645
## Proportion of Variance 0.6201 0.2474 0.08914 0.04336
## Cumulative Proportion 0.6201 0.8675 0.95664 1.00000
pr.out$center
## Murder Assault UrbanPop Rape
## 7.788 170.760 65.540 21.232
pr.out$scale
## Murder Assault UrbanPop Rape
## 4.355510 83.337661 14.474763 9.366385
names(pr.out)
## [1] "sdev" "rotation" "center" "scale" "x"
The rotation matrix provides the principal component loadings; each column of pr.out$rotation contains the corresponding principal component loading vector.
pr.out$rotation
## PC1 PC2 PC3 PC4
## Murder -0.5358995 0.4181809 -0.3412327 0.64922780
## Assault -0.5831836 0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158 0.13387773
## Rape -0.5434321 -0.1673186 0.8177779 0.08902432
There are four distinct principal components. In general min(n - 1, p) informative principal components in a data set with n observations and p variables.
dim(pr.out$x)
## [1] 50 4
biplot(pr.out, scale=0)
The scale=0 argument to biplot() ensures that the arrows are scaled to represent the loadings; other values for scale give slightly different biplots with different interpretations.
pr.out$rotation=-pr.out$rotation
pr.out$x=-pr.out$x
biplot(pr.out, scale=0)
The prcomp() function also outputs the standard deviation of each principal component
pr.out$sdev
## [1] 1.5748783 0.9948694 0.5971291 0.4164494
The variance explained by each principal component is obtained by squaring these:
pr.var <- pr.out$sdev^2
pr.var
## [1] 2.4802416 0.9897652 0.3565632 0.1734301
To compute the proportion of variance explained by each principal component, we simply divide the variance explained by each principal component by the total variance explained by all four principal components:
pve=pr.var/sum(pr.var)
pve
## [1] 0.62006039 0.24744129 0.08914080 0.04335752
We see that the first principal component explains 62.0 % of the variance in the data, the next principal component explains 24.7 % of the variance, and so forth.
We can plot the PVE explained by each component, as well as the cumulative PVE, as follows:
par(mfrow=c(1,2))
plot(pve, xlab="Principal Component",
ylab="Proportion of Variance Explained ",
ylim=c(0,1),type='b')
plot(cumsum(pve), xlab="Principal Component ",
ylab="Cumulative Proportion of Variance Explained ",
ylim=c(0,1), type='b')
cumsum() used above is the cumulative sum.
a=c(1,2,8,-3)
cumsum (a)
## [1] 1 3 11 8