Consider the USArrests data. We will now perform hierarchical clustering on the states.
Using hierarchical clustering with complete linkage and Euclidean distance, cluster the states.
Cut the dendrogram at a height that results in three distinct clusters. Which states belong to which clusters?
Hierarchically cluster the states using complete linkage and Euclidean distance, after scaling the variables to have standard deviation one.
What effect does scaling the variables have on the hierarchical clustering obtained? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed? Provide a justification for your answer.
library(ISLR)
library(ggdendro) # Better dendrograms
library(ggplot2)
hc.complete = hclust(
dist(USArrests),
method="complete"
)
summary(hc.complete)
## Length Class Mode
## merge 98 -none- numeric
## height 49 -none- numeric
## order 50 -none- numeric
## labels 50 -none- character
## method 1 -none- character
## call 3 -none- call
## dist.method 1 -none- character
hc.complete
##
## Call:
## hclust(d = dist(USArrests), method = "complete")
##
## Cluster method : complete
## Distance : euclidean
## Number of objects: 50
Dendrogram
plot(hc.complete)
Better Dendrogram
ggdendrogram(hc.complete)
Cutting the tree to have only three branches (clusters)
hc.cut.complete = cutree(hc.complete, 3)
Number of observations in each class
table(hc.cut.complete)
## hc.cut.complete
## 1 2 3
## 16 14 20
hc.cut.complete
## Alabama Alaska Arizona Arkansas California
## 1 1 1 2 1
## Colorado Connecticut Delaware Florida Georgia
## 2 3 1 1 2
## Hawaii Idaho Illinois Indiana Iowa
## 3 3 1 3 3
## Kansas Kentucky Louisiana Maine Maryland
## 3 3 1 3 1
## Massachusetts Michigan Minnesota Mississippi Missouri
## 2 1 3 1 2
## Montana Nebraska Nevada New Hampshire New Jersey
## 3 3 1 3 2
## New Mexico New York North Carolina North Dakota Ohio
## 1 1 1 3 3
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 2 2 3 2 1
## South Dakota Tennessee Texas Utah Vermont
## 3 2 2 3 3
## Virginia Washington West Virginia Wisconsin Wyoming
## 2 2 3 3 2
Plotting dendrogram with abline cutting off at 3 clusters
par(mfrow=c(1,1))
plot(hc.complete)
abline(h=150, col="red")
Better Dendrogram
ggdendrogram(hc.complete)
Scaling the features
USArrests.scaled = scale(USArrests)
# Fitting the scaled features
hc.complete.scaled = hclust(
dist(USArrests.scaled),
method="complete"
)
plot(hc.complete.scaled)
ggdendrogram(hc.complete.scaled)
ggdendrogram(hc.complete.scaled, rotate = TRUE, theme_dendro = FALSE)
Cutting the tree to have only three branches (clusters)
hc.scaled.cut.complete = cutree(hc.complete.scaled, 3)
table(hc.scaled.cut.complete)
## hc.scaled.cut.complete
## 1 2 3
## 8 11 31
table(hc.scaled.cut.complete, hc.cut.complete)
## hc.cut.complete
## hc.scaled.cut.complete 1 2 3
## 1 6 2 0
## 2 9 2 0
## 3 1 10 20
COMMENTS: Scaling the features changes the results. It decreases the vertical axis. It might be good to scale the features because if the features are in difference range, the model might weight features that have high values since its a big number even though the value might not have an impact on result.