In this interactive tutorial you will learn to understand hierarchical clustering.
You can modify the number of points, but that won’t be our focus in this tutorial. Let’s use 12 points, just to have a different setup than in the 101 lecture. We will focus on the impact of the parameters “distance metric” and linkage method. We specify for starters euclidean distance and single linkage.
Let’s start with the study of distance metric. The simplest method is euclidean distance. This is the shortest distance between two points. Clearly, points 4 and 8 have the closest distance. You can see this visually in the top graph, but also in the dendrogram. Both points are connected at the lowest level. If you would specify 11 clusters, that means that two points are considered a pair and all others are individuals. The paired points are point 4 and 8.
The next closest pair is point 9 and 12. You can see they group a team, as well as 1 and 10, and 2 and 11, and 3 and 7. Which is the next cluster added? Point 5 is closest to the cluster of points 3 and 7. For single linkage the closest point is chosen, which is point 7. If you now switch to complete linkage, the farthest point is considered. The farthest point of a close-by cluster is point 11, because point 3 would be the other choice, but this is much farther away.
If you keep the linkage method at complete linkage, and specify as distance metric Manhattan, the whole picture changes. The lowest level is pretty similar. The two-point pairs are the same. Please set the number of clusters to 4, because then you can see the impact directly in the upper graph. Point 6, which is forming a cluster with points 3 and 7 for eucidean distance, is its own cluster for Manhattan distance. At the same point all the green points at the right side form a cluster. If you switch to euclidean you see that points 4 and 8 are on their own, but for Manhattan they are added to the other green points.
You can also play with other linkage methods like Ward. Ward minimizes the total within-cluster variance. So the custers get more homogeneous.
By changing the number of clusters you change the threshold. The threshold is a horizontal line cutting through the dendrogram. By changing the number you change the height of this threshold. If you only choose two clusters, the branches are split at a point where exactly two branches are found. And similarly you can see the changes resulting from other cluster numbers. You can directly see the result by looking at the color mappings in the upper graph.
This interactive dashboard provides a tool to you that you can use to study the impact of parameter changes on hierarchical clustering. Have fun with it. Thanks for reading and I am happing to receive feedback from you.