Clustering

Misc

  • Also see Notebook, pg 57
  • Packages
    • {clustree} - For interrogating clusterings as resolution increases. Allows you to see how samples move as the number of clusters increases.
      • Builds a clustering tree that visualizes the relationships between at a range of resolutions.
  • For K-Means, elbow method (i.e. WSS) is awful. Recommended: Calinski-Harabasz Index and BIC then Silhouette Coefficient or Davies-Bouldin Index
  • {{sklearn.metrics.cluster}}

Spherical/Centroid Based

  • Within-Cluster Sum of Squares (WSS) (aka Inertia)
    • Measures the variability of the observations within each cluster. In general, a cluster that has a small sum of squares is more compact than a cluster that has a large sum of squares.
    • To calculate WCSS, you first find the Euclidean distance between a given point and the centroid to which it is assigned. You then iterate this process for all points in the cluster, and then sum the values for the cluster and divide by the number of points
    • Influenced by the number of observations. As the number of observations increases, the sum of squares becomes larger. Therefore, the within-cluster sum of squares is often not directly comparable across clusters with different numbers of observations.
      • To compare the within-cluster variability of different clusters, use the average distance from centroid instead.
    • Typically used in the elbow method for kmeans for choosing the number of clusters
  • Average Distance from Centroid
    • The average of the distances from observations to the centroid of each cluster.
    • The average distance from observations to the cluster centroid is a measure of the variability of the observations within each cluster. In general, a cluster that has a smaller average distance is more compact than a cluster that has a larger average distance. Clusters that have higher values exhibit greater variability of the observations within the cluster.
  • Silhouette Coefficient
    • Computationally expensive

    • Formula

      \[ s(i) = \frac{b(i) - a(i)}{\max \{a(i), b(i)\}},\; \text{if}\; |C_I| > 1 \]

      • Intra-Cluster Distance

        \[ a(i) = \frac{1}{|C_I| - 1} \sum_{j \in C_I, i \neq j} d(i,j) \]

        • The mean distance between i and all the other data points within C.
        • \(|C_I|\) is the number of points belonging to cluster i, and d(i , j) is the distance between data points i and j in the cluster CI
      • Inter-Cluster Distance

        \[ b(i) = \min_{J \neq I} \frac{1}{C_J} \sum_{j \in C_J} d(i, j) \]

        • The mean distance between i to all the points of its nearest neighbor cluster
    • Range: -1 to 1

    • Good: 1

    • Bad: -1

      • Says inter-cluster distances are not comparable to the intra-cluster distances
  • Calinski-Harabasz Index (aka Variance Ratio Criterion)
    • Compares the variance between-clusters to the variance within each cluster

    • NOT to be used to the density based methods, such as mean-shift clustering, DBSCAN, OPTICS, etc.

      • Clusters in density based methods are unlikely to be spherical and therefore centroids-based distances will not be that informative to tell the quality of the clustering algorithm
      • Much faster than Silhouette score calculation
    • Ratio of the squared inter-cluster distance sum and the squared intra-cluster distance sum for all clusters

    • Formula

      \[ CH = \frac{\sum_{k=1}^K n_k \lVert c_k - c \rVert^2}{K-1} \cdot \frac{N-K}{\sum_{k=1}^K \sum_{i=1}^{n_k} \lVert d_i - c_k \rVert^2} \]

      • \(n_k\) is the size of the kth cluster
      • \(c_k\) is the feature vector of the centroid of the kth cluster
      • \(c\) is the feature vector of the global centroid of the entire dataset
      • \(d_i\) is the feature vector of data point i
      • \(N\) is the total number of data points
    • Range: no upper bound

    • Higher is better and means the clusters are separated from each other

  • Davies-Bouldin Index
    • Average similarity between each cluster and its most similar one.

    • NOT to be used to the density based methods, such as mean-shift clustering, DBSCAN, OPTICS, etc.

      • Clusters in density based methods are unlikely to be spherical and therefore centroids-based distances will not be that informative to tell the quality of the clustering algorithm
    • Much faster than Silhouette score calculation

    • Range [0,1] and lower is better

    • The type of norm used in the formula should probably match the distance type used in the clustering algorithm. See the wiki for details.

    • Formula

      \[ DB = \frac{1}{N} \sum_{i=1}^N D_i \]

      • An averaged similarity score across all clusters with its nearest neighbor cluster

      • \(D_i\) is the ith cluster’s worst (i.e. largest) similarity score, \(R_{i,j}\) across all other clusters

        \[ D_i = \max_{j\neq i} R_{i,j} \]

      • Smaller similarity score indicates a better cluster separation

        \[ \begin{aligned} &R_{i,j} = \frac{S_i + S_j}{M_{i,j}}\\ &\begin{aligned} \text{where} \;\; &S_i = \left(\frac{1}{T_i} \sum_{j=1}^{T_i} \lVert X_j - A_i \rVert_{p}^q \right)^{\frac{1}{q}} \;\; \text{and} \\ &M_{i,j} = \lVert A_i - A_j \rVert_p = \left(\sum_{k=1}^n |a_{k,i} - a_{k,j}|^p\right)^{\frac{1}{p}} \end{aligned} \end{aligned} \]

        • \(X_j\) : n-dimensional feature vector assigned to Cluster \(C_i\)
        • \(T_i\): The size of cluster \(C_i\)
        • \(A_i\): Centroid of \(C_i\) and \(a_{k,i}\) is an element of the kth feature vector of \(A_i\) which has \(n\) elements
        • \(P\): the degree of the norm; typically 2 for the euclidean norm (i.e. distance)
        • \(q\): Not sure what this is (something to do with a statistical moment), but wiki only describes the parameter when it’s equal to 1 which makes S the average distance between the feature vectors and the centroid. So that’s probably the typical value.

Feature Reduction

  • Bayesian Information Criteria (BIC)
    • Higher is better

    • GoF metric in this situation that’s typically used for GMMs

    • Not on the same scale as WSS

    • Py Code

      def bic_score(X, labels):
        """
        BIC score for the goodness of fit of clusters.
        This Python function is directly translated from the GoLang code made by the author of the paper. 
        The original code is available here: https://github.com/bobhancock/goxmeans/blob/a78e909e374c6f97ddd04a239658c7c5b7365e5c/km.go#L778
        """
      
        n_points = len(labels)
        n_clusters = len(set(labels))
        n_dimensions = X.shape[1]
        n_parameters = (n_clusters - 1) + (n_dimensions * n_clusters) + 1
        loglikelihood = 0
        for label_name in set(labels):
          X_cluster = X[labels == label_name]
          n_points_cluster = len(X_cluster)
          centroid = np.mean(X_cluster, axis=0)
          variance = np.sum((X_cluster - centroid) ** 2) / (len(X_cluster) - 1)
          loglikelihood += \
            n_points_cluster * np.log(n_points_cluster) \
            - n_points_cluster * np.log(n_points) \
            - n_points_cluster * n_dimensions / 2 * np.log(2 * math.pi * variance) \
            - (n_points_cluster - 1) / 2
      
        bic = loglikelihood - (n_parameters / 2) * np.log(n_points)
      
        return bic