Assignment 5. Hierarchical Clustering
This assignment requires that you use the similarity matrix computed assignment 3. We will provide the matrix for consistency. Implement (in Java or Python) and turn in the following hierarchical clustering program.
Part 1. Pseudo-Code for Hierarchical Clustering.
- Initialize each cluster to be a singleton. I suggest that you use the java utility BitSet to do this. So initially you will have 10 BitSets, each with a single bit turned on. You might keep these BitSets is a List or ArrayList(that will be more convenient than an array).
- While more than one cluster exists do:
- Find the two closest clusters. The similarity between two clusters is the average similarity between all pairs, one member from each of the clusters. The similarity between any two elements is given by the similarity matrix.
- Now output the elements of the two closest clusters and their union. (No change to any data structures).
- Remove one of the clusters and replace the other by the union of the two.
- back to the beginning of the while loop
Part 2.
- Put the output of your program in a *.doc file.
- What is the time complexity for finding the distance between two clusters, in the worse case?
- What is the time complexity for finding the two closest clusters?
- What is the time complexity for the entire algorithm?
- In your clustering there is one time that a 5-cluster (a cluster with 5 elements) is merged with a 3 cluster. Looking back at the original virus file, and only at the year property of each virus, what property is true for elements of the 5-cluster? What property is true for all elements in the 3-cluster?