Cluster dendrogram: An introduction and showcase

By Deependra Dhakal in R

September 7, 2020

A cluster analysis is a classification problem. It is dealt in several ways, one of which is hierarchial agglomeration. The method allows for easy presentation of high dimensional data, more of so when the number of observations is readily fitted into a visualization.

Here’s I deal with a case of clustering typically seen in agriculture and field research where a researcher tests typically a large number of genotypes and seeks to see them organized into distinguishable clusters using dendrogram. Data concerns observations on disease incidence in rice genotypes of various stages – germinating seed to maturity nearing crop. Following provides a descriptive summary of the observation variables.

rice_path <- read_csv(here::here("content", "blog", "data", "rice_genotypes_blight_pathology_disease.csv"))
landraces <- read_csv(here::here("content", "blog", "data", "rice_genotypes_blight_pathology_landraces.csv"))
skimr::skim(rice_path)

Table: Table 1: Data summary

Name rice_path
Number of rows 60
Number of columns 3
_______________________
Column type frequency:
numeric 3
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
severity_percent 0 1 58.79 10.34 33.33 51.86 59.26 66.67 81.49 ▂▅▇▇▂
mean_audpc 0 1 239.55 38.59 134.26 216.88 243.06 263.89 305.56 ▁▃▆▇▅
final_seed_incidence 0 1 51.70 8.67 30.00 46.00 52.67 58.00 69.33 ▂▃▇▇▃

Scaling the data in all columns avoids the trouble with interpretation, particularly with all the different units of measurement in each variable.

# scale data and prepare
rice_path_disease <- scale(rice_path)

# distance and clustering
dist_euclidean <- dist(rice_path_disease, "euclidean")
# hclust_ave <- hclust(dist_euclidean, method = "ave")
hclust_ave <- hclust(dist_euclidean, method = "ward.D2") # ward has high clustering power
hclust_ave$labels <- landraces$landraces[1:60][hclust_ave$order]

# extract data
hclust_ave_data <- ggdendro::dendro_data(hclust_ave, type = "rectangle")

# determine number of clusters
clust_five <- cutree(hclust_ave, k=5)
clust_df <- tibble(label=names(clust_five), cluster=factor(clust_five))

# merge the labels, with clust.df
hclust_ave_data[["labels"]] <- merge(hclust_ave_data[["labels"]], clust_df, by="label")
clustering_dendrogram <- ggplot() + 
  geom_segment(data=hclust_ave_data[["segments"]], aes(x=x, y=y, xend=xend, 
                                        yend=yend)) + 
  geom_text(data=hclust_ave_data[["labels"]], aes(x, y, label=label, hjust=0, color=cluster), 
            size=2.2) +
  coord_flip() + scale_y_reverse(expand=c(0.2, 0)) + 
  theme(axis.line.y=element_blank(),
        axis.ticks.y=element_blank(),
        axis.text.y=element_blank(),
        axis.title.y=element_blank(),
        panel.background=element_rect(fill="white"),
        panel.grid=element_blank()) +
  labs(caption = "Hierarchical clustering using euclidean distance and \nward's method for agglomeration and \nshowing five clusters in dendrogram")

# ggsave("./clustering_dendrogram.png", plot = clustering_dendrogram, device = "png", 
#          width = 8, height = 6, units = "in", dpi = 240)

While we use certain method in clustering, its efficacy as a classifer can be checked with cluster::agnes().

# methods to assess
m <- c("average", "single", "complete", "ward")
names(m) <- c("average", "single", "complete", "ward")

# function to compute coefficient
ac <- function(x) {
  agnes(rice_path_disease, method = x)$ac
}

map_dbl(m, ac)
##   average    single  complete      ward 
## 0.8171988 0.7323511 0.8932563 0.9257491
Posted on:
September 7, 2020
Length:
2 minute read, 413 words
Categories:
R
Tags:
tidyverse
See Also:
Logistic Regression: Part II - Varietal adoption dataset
Logistic Regression: Part I - Fundamentals
Linear model fitting for regression: Basics and Variation