General
Misc
- Also see
- Notebook, pgs 57-58
- Diagnostics, Clustering
- “Clustering algorithms always find clusters if you tune their similarity and grouping enough. Unfortunately, clusters by themselves have no inherent meaning. Only understanding the data and the algorithm can lend meaning to the clusters. To make a clustering project well defined, start by knowing what the data means. It’s especially important to know about the process that generates the data points. Based on understanding, you can form a hypothesis about the relationships that might exist between points. An algorithm with thoughtfully chosen parameters can either confirm or deny that hypothesis.” (source)
- For static data, i.e., if the values do not change with time, clustering methods are usually divided into five major categories:
- Partitioning (or Partitional)
- Hierarchical
- Density-Based
- Grid-Based
- Model-Based Methods
- Biclustering (aka Two-Mode Clustering)
- Simultaneously clusters the rows and columns of an (\(n×p\))-dimensional data matrix \(Y\), with rows associated to \(n\) statistical units and columns consisting of \(p\) ordinal outcomes.
- Packages
- {biclustermd} - Biclustering with Missing Data
- {MixSim} - Simulating Data to Study Performance of Clustering Algorithms
- Papers
- Use Cases
- Analysis of political votes
- The voting records of \(n\) politicians on \(p\) legislative bills are only available for the sessions they attended, where absence from a session may represent a deliberate political strategy.
- Recommendation systems
- The data consists of customer (\(n\)) ratings on \(p\) items.
- In the well-known Netflix Prize data, missingness arises when users choose not to rate certain movies, potentially reflecting their lack of interest, preference, or other underlying factors
- Analysis of political votes
Terms
- Cluster Centroid - The middle of a cluster. A centroid is a vector that contains one number for each variable, where each number is the mean of a variable for the observations in that cluster. The centroid can be thought of as the multi-dimensional average of the cluster.
- Hard (or Crisp) Clustering - each point belongs to a single cluster
- Soft (or Fuzzy) Clustering - each point is allowed to belong to multiple clusters
Cluster Descriptions
- Packages
- {parameters} - provides various functions for describing, analyzing, and visualizing clusters for various methods
- {clustereval} - compute the statistical association between the features and the detected cluster labels and whether they are significant.
- Categorical: Chi-Square, Fisher’s Exact, or Hypergeometric tests
- Continuous: Mann-Whitney-U test
- Examine variable values at the centroids of each cluster
- A higher absolute value indicates that a certain variable characteristic is more pronounced within that specific cluster (as compared to other cluster groups with lower absolute mean values).
- Distributional statistics for each cluster
- Numeric variables: mean and sd for each variable in that cluster
- Categorical variables:
- binary: percent where event = 1
- multinomial: most prominent category
- Run a decision tree on clusters
- Each color (orange, blue, green, purple) represents a cluster
- Explains how clusters were generated
- Issue
- Suboptimal Labels: A depth of 3 might be sufficient to fully represent the cluster boundaries, but CART might only achieve 100% accuracy when using a depth of 5 or 6. This is because of the greedy nature of the decision tree algorithms.
- {treeheatr}
- Radar charts
- 3 clusters: blue (highlighted), red, green
- Guessing the mean values for each variable are the points
- Scatter
- Use clustering variables of interest for a scatter plot then label the points with cluster id
- See K-Means >> Example 2 for code
- Use clustering variables of interest for a scatter plot then label the points with cluster id
- Interpetable Clustering
- Notes from Introduction to Interpretable Clustering
- Overview of methods with references to papers
- Instead of fitting a classifier in a supervised manner after clustering, clusters are directly constructed using a model.
- Unsupervised Clustering Trees - Trees are constructed by optimizing a variety of clustering metrics through various optimization algorithms
- Others: Specialized Neural Networks, Polytope Machines, and Rectangular Rules
- Notes from Introduction to Interpretable Clustering
Gaussian Mixture Models (GMM)
- Misc
- Soft clustering algorithm
- Notes from
- Packages
- {mclust} - Gaussian Mixture Modelling for Model-Based Clustering, Classification, and Density Estimation
- EBook: Model-Based Clustering, Classification, and Density Estimation Using mclust in R
- JOSS: Model-based Methods of Classification: Using the mclust Software in Chemometrics
- Website: mclust.org
- Lists other packages from the group that are associated with GMMs.
- {mixtools} (Vignette) - Provides computational techniques for finite mixture model analysis in which components are regressions, multinomial vectors arising from discretization of multivariate data, or even distributions that are almost completely unspecified.
- Mixtures of parametric distributions (normal, multivariate normal, multinomial, gamma)
- Various Reliability Mixture Models (RMMs),
- Mixtures-of-Regressions settings (linear regression, logistic regression, Poisson regression, linear regression with changepoints, predictor-dependent mixing proportions, random effects regressions, hierarchical mixtures-of-experts)
- Tools for selecting the number of components (bootstrapping the likelihood ratio test statistic, mixturegrams, and model selection criteria)
- Bayesian estimation of mixtures-of-linear-regressions models is available as well as a novel data depth method for obtaining credible bands.
- {EMCluster} - EM Algorithm for Model-Based Clustering of Finite Mixture Gaussian Distribution
- Much of it is in C, so it should be fast
- {flexmix} - A general framework for finite mixtures of regression models using the EM algorithm is implemented.
- Existing drivers implement mixtures of standard linear models, generalized linear models and model-based clustering.
- Video: Extending flexmix to model-based clustering with sparse data
- {otrimle}
- Uses Improper Maximum Likelihood Estimator Clustering (IMLEC) method
- Hyperparameters automatically tuned; Outliers removed
- Robust gaussian mixture clustering algorithm
- Webpage has links to paper, Coretto and Hennig, 2016
- {pivmet} (JOSS) - Pivotal methods for consensus clustering in k-means and mixture modeling. Method mitigates label-switching problem that occurs during Bayesian estimation of mixture models
- {foldcluster} (Paper) - Bayesian Clustering via Fusing of Localized Densities
- Addresses GMM kernel misspecification problem. Clustering results are highly sensitive to kernel misspecification. For example, if Gaussian kernels are used but the true density of data within a cluster is even slightly non-Gaussian, then clusters will be broken into multiple Gaussian components.
- {plotmm} - Tidy Tools for Visualizing Mixture Models. Currently supports: mixtools, flexmix, and EMCluster models
- {mclust} - Gaussian Mixture Modelling for Model-Based Clustering, Classification, and Density Estimation
- Resources
- Components of the Algorithm
- “Color” points according to gaussians (clusters)
- Fitting a Gaussian
- Find the center of mass
- 2-dim: calculate the mean of x and the mean of y and that’s the coordinates of your center of mass
- Find the spread of the points
- Find the center of mass
- Steps
- Start with random Gaussians
- Each gaussian has random means, variances
- Color points according to distance to the random gaussians
- The heights in the distributions pic above
- (Forget about old gaussians) Calculate new gaussians based on the colored points
- (Forget about old colors) Color points according to distance to the new gaussians
- Repeat until some threshold is reached (i.e. gaussians or colors don’t change much)
- Start with random Gaussians
- Tuning
- Initial Conditions (i.e. Good starting points for the random gaussians at the beginning)
- Limits on the mean and variance calculations
- Number of gaussians, k, can be chosen by minimizing the Davies-Bouldin score
- See Diagnostics, Clustering >> Spherical/Centroid Based >> Davies-Bouldin Index
- Running algorithm multiple times
- Like CV grid search algs or bootstrapping
Hierarchical
- Misc
- Packages
- {heatmaply} (Vignette, Examples) - Visualizes Hierarchical Clustering via Heatmap
- Options for the clustering function for both axes or each individual axis.
- Default is
stats::hclust
but the introductory vignette also showedfastcluster::hclust
is capable of being used. I didn’t see anything other than the default used in the other examples
- Default is
- Distance methods (through
hclust
): “euclidean” (Default), “maximum”, “manhattan”, “canberra”, “binary” or “minkowski”. - Linkage methods (through
hclust
): “complete” (Default), “ward.D”, “ward.D2”, “single”, “average” (= UPGMA), “mcquitty” (= WPGMA), “median” (= WPGMC) or “centroid” (= UPGMC)
- Options for the clustering function for both axes or each individual axis.
- {heatmaply} (Vignette, Examples) - Visualizes Hierarchical Clustering via Heatmap
- Packages
- Examples
Example: {heatmaply} (source)
Code
<- p heatmaply( mat,#dendrogram = "row", xlab = "", ylab = "", main = "", scale = "column", margins = c(60,100,40,20), grid_color = "white", grid_width = 0.00001, titleX = FALSE, hide_colorbar = TRUE, branches_lwd = 0.1, label_names = c("Country", "Feature:", "Value"), fontsize_row = 5, fontsize_col = 5, labCol = colnames(mat), labRow = rownames(mat), heatmap_layers = theme(axis.line=element_blank()) ) # save the widget # library(htmlwidgets) # saveWidget(p, file= "~/Desktop/R-graph-gallery/HtmlWidget/heatmapInter.html")
- For the data, see the first example in the source article
- The right dendogram checks which countries tend to have the same features on their numeric variables, and therefore which countries are similar
- Burundi and Angola (bottom) are grouped together. This grouping can be examined using the heatmap where it’s shown they are two countries in strong expansion, with a lot of children per woman but still a strong mortality rate.
- From the top dendogram clusters the features
- It shows birth rate and children per woman (left) are grouped together since they are highly correlated.
Latent Profile Analysis (LPA)
Sort of like k-means + GMM
k number of profiles (i.e. clusters) are chosen
Model outputs probabilities that an observation belongs to any particular cluster
GOF metrics available
“As with Exploratory Factor Analysis (EFA )(and other latent-variable models), the assumption of LPA is that the latent (unobserved) factor”causes” (I’m using the term loosely here) observed scores on the indicator variables. So, to refer back to my initial hypothetical example, a monster being a spell caster (the unobserved class) causes it to have high intelligence, low strength, etc. rather than the inverse. This is a worthwhile distinction to keep in mind, since it has implications for how the model is fit.”
Bin variables that might dominate the profile. This way the profiles will represent a latent variable and not gradations of the dominate variable (e.g. low, middle, high values of the dominate variable).
Center other variable observations according to dominant variable bin those observations are in. (e.g. subtract values in bin1 from bin1’s mean)
# From D&D article where challenge_rating is a likely dominant variable <- mons_df %>% mons_bin mutate(cr_bin = ntile(x = challenge_rating, n = 6)) <- c("strength", "dexterity", "constitution", "intelligence", "wisdom", "charisma") ab_scores <- mons_bin %>% mons_bin group_by(cr_bin) %>% mutate(across(.cols = ab_scores, .fns = mean, .names = "{.col}_bin_mean")) %>% ungroup()
tSNE
- Packages
- {Rtsne}
- t-Distributed Stochastic Neighbor Embedding
- Looks at the local distances between points in the original data space and tries to reproduce them in the low-dimensional representation
- Both UMAP and tSNE attempt to do this but fails (Lior Pachter paper thread, Doesn’t preserve local structure, No theorem says that it preserves topology)
- Results depend on a random starting point
- Tuning parameters: perplexity
UMAP
- Packages:
- {umap}
- {scDEED} (article) - Detects Dubious t-SNE and UMAP Embeddings and Optimizes Hyperparameters
- scDEED assigns a reliability score to each 2D embedding to indicate how much the data point’s mid-range neighbors change in the 2D space. Observations whose 2D embedding neighbors have been drastically changed through the embedding process are called ‘dubious.’
- {SaturnCoefficient} - A metric expressing the quality of a UMAP layout
- Range: [0, 1]
- Higher value means better dimensionality reduction
- Uniform Manifold Approximation and Projection (variation of tSNE) which projects variables to a nonlinear space
- Issues
- See tSNE section for Lior Pachter threads on why not to use tSNE or UMAP
- Two dimensions are not enough to predict with high accuracy. This isn’t an issue if our goal is exploratory data analysis (EDA). However, if we over-interpret this as evidence of discrete clusters defined solely by their position in the two-dimensional scatter plot, we risk drawing incorrect conclusions.
- There are no weights like in PCA determine which variables are contributing the most to each dimension.
- Computationally intensive
- Preprocessing
- Only for numeric variables
- Standardize
- Differences with tSNE
- Random starting point has less of an impact that in tSNE
- Can be supervised (give it an outcome variable)
- UMAP can take a training model and apply it to test data or new data
- Try pca first
- If successful (good separation between categories), then prediction may be easier
- If not, umap, tsne needed
- Tuning parameter: neighbors
- Example used 500 iterations (n_epochs) as limit for convergence
K-Means
- Seeks to assign n points to k clusters and find cluster centers so as to minimize the sum of squared distances from each point to its cluster center.
- Packages
- {tidyclust} - Clustering for tidymodels
- Engines
- stats and ClusterR run classical K-means
- laR runs K-Modes models which are the categorical analog to K-means, meaning that it is intended to be used on only categorical data
- clustMixType to run K-prototypes which are the more general method that works with categorical and numeric data at the same time.
- Engines
- {ClusterR} - Consists of Gaussian mixture models, k-means, mini-batch-kmeans, k-medoids and affinity propagation clustering algorithms with the option to plot, validate, predict (new data) and find the optimal number of clusters.
- {pivmet} (JOSS) - Pivotal methods for consensus clustering in k-means and mixture modeling
- {fkms} (Paper) - k-means clustering for sparsely observed longitudinal data (i.e. repeated measure of multiple numeric variable over time)
- Normal k-means or mixture models can’t be applied to unbalanced/irregular designs where time and measurement differ for each subject. These methods also ignore autocorrelation.
- Also applicable to densely observed data
- Employs the basis function expansion (i.e. splines) to model the cluster centers
- {sparcl} - Sparse Hierarchical Clustering and Sparse K-Means Clustering
- {clusterHD} (Paper) - Regularized k-Means through hard thresholding
- Uses a \(\mathcal{L}_0\) penalty to induce sparsity in the variables
- {tidyclust} - Clustering for tidymodels
- Papers
- Wasserstein k-Centres Clustering for Distributional Data
- Distributional data arises when each data point can be regarded as a probability distribution
- Examples of a distributional data set include age distributions of countries and house price distributions of cities.
- The real data example was classifying 84= 42 (Austrian districts × 2 (men and women) age distributions into two groups with several clustering methods and determine whether the clustering results reflect gender differences.
- Wasserstein k-Centres Clustering for Distributional Data
- For choosing the number of clusters, elbow method (i.e. WSS) is usually awful if there are more than few clusters. Recommended: Calinski-Harabasz Index and BIC then Silhouette Coefficient or Davies-Bouldin Index (See Diagnostics, Clustering >> Spherical/Centroid Based (article)
- Base R
kmeans
uses the Hartigan-Wong algorithm- For large k and larger n, the density of cluster centers should be proportional to the density of the points to the power (d/d+2). In other words the distribution of clusters found by k-means should be more spread out than the distribution of points. This is not in general achieved by commonly used iterative schemes, which stay stuck close to the initial choice of centers.
- Issues
- Inefficient in distinguishing between groups of unbalanced sizes. See Packages >> {pivmet} which uses consensus clustering to help mitigate this.
- {tidyclust}
Example 1: Mixed K-Means
Model and Extract Centroids
library(tidymodels) library(tidyclust) data("ames", package = "modeldata") <- k_means(num_clusters = 3) %>% kproto_spec set_engine("clustMixType") <- kproto_spec %>% kproto_fit fit(~ ., data = ames) %>% kproto_fit extract_centroids() %>% select(11:20) %>% glimpse() #> Rows: 3 #> Columns: 10 #> $ Lot_Config <fct> Inside, Inside, Inside #> $ Land_Slope <fct> Gtl, Gtl, Gtl #> $ Neighborhood <fct> College_Creek, North_Ames, Northridge_Heights #> $ Condition_1 <fct> Norm, Norm, Norm #> $ Condition_2 <fct> Norm, Norm, Norm #> $ Bldg_Type <fct> OneFam, OneFam, OneFam #> $ House_Style <fct> Two_Story, One_Story, One_Story #> $ Overall_Cond <fct> Average, Average, Average #> $ Year_Built <dbl> 1989.977, 1953.793, 1998.765 #> $ Year_Remod_Add <dbl> 1995.934, 1972.973, 2003.035
Example 2: Classic K-Means (article, code)
Preprocess and Fit
<- recipe(~ ., data = data_prep_tbl) %>% recipe_kmeans step_dummy(all_nominal_predictors(), one_hot = TRUE) %>% step_normalize(all_numeric_predictors()) %>% step_rm("ID") %>% prep() %>% juice() %>% glimpse() recipe_kmeans <- k_means(num_clusters = 4) %>% model_kmeans set_engine("stats") set.seed(123) <- workflow() %>% wflw_fit_kmeans add_model(model_kmeans) %>% add_recipe(recipe_kmeans) %>% fit(data_prep_tbl)
Visualize
<- data_prep_tbl %>% g bind_cols(extract_cluster_assignment(wflw_fit_kmeans), .) %>% ggplot(aes(Spent, Income)) + geom_point( aes(fill = .cluster), shape = 21, alpha = 0.3, size = 5 + ) geom_smooth(color = "blue", se = FALSE) + scale_x_continuous(labels = scales::dollar_format()) + scale_y_continuous( labels = scales::dollar_format(), limits = c(0, 200000) + ) labs(title = "Customer Clusters: Spent vs Income") + scale_fill_tq() + theme_tq() ggplotly(g)
- See Cluster Descriptions >> Scatter for the output
DBSCAN
- Misc
- Notes from:
- Understanding DBSCAN and Implementation with Python
- Clustering with DBSCAN, Clearly Explained video
- Packages
- {dbscan}- A fast C++ (re)implementation of several density-based algorithms with a focus on the DBSCAN family for clustering spatial data
- DBSCAN: Density-based spatial clustering of applications with noise
- Jarvis-Patrick Clustering: Clustering using a similarity measure based on shared near neighbors
- SNN Clustering: Shared nearest neighbor clustering
- HDBSCAN: Hierarchical DBSCAN with simplified hierarchy extraction
- FOSC: Framework for optimal selection of clusters for unsupervised and semisupervised clustering of hierarchical cluster tree
- OPTICS/OPTICSXi: Ordering points to identify the clustering structure and cluster extraction methods
- {parameters}
n_clusters_dbscan
- Given a “min_size” (aka minPts?), the function estimates the optimal “eps”cluster_analysis
- Shows Sum of Squares metrics and the (standardized) mean value for each variable within each cluster.
- {dbscan}- A fast C++ (re)implementation of several density-based algorithms with a focus on the DBSCAN family for clustering spatial data
- HDBSCAN is the hierarchical density-based clustering algorithm
- Use Cases
- Geospatially Clustering Earthquakes
- Events can occur in irregular shaped clusters (i.e., along faults of different orientations).
- Events can occur in different densities (i.e. some fault zones are more active than others).
- Events can occur far away from fault zones (i.e. outliers)
- Geospatially Clustering Earthquakes
- Notes from:
- Tuning
- eps - The maximum distance between two samples for one to be considered to be connected to the other
- Large eps tend to include more points within a cluster,
- Too-large eps will include everything in the same single cluster
- Too-small eps will result in no clustering at all
- minPts (or min_samples) - The minimum number of samples in a neighborhood for a point to be considered as a core point
- Too-small minPts is not meaningful because it will regard every point as a core point.
- Larger minPts can be better to deal with noisy data
- eps - The maximum distance between two samples for one to be considered to be connected to the other
- Algorithm
- For each data point, find the points in the neighborhood within eps distance, and define the core points as those with at least minPts neighbors.
- The orange circle represents the eps area
- If minPts = 4, then the top 4 points are core points because they have at least 4 points overlapping the eps area
- Define groups of connected core points as clusters.
- All the green points have been labelled as core points
- Assign each non-core point to a nearby cluster if it’s directly reachable from a neighboring core point, otherwise define it as an outlier.
- The black points are non-core points but are points that overlap the eps area for the outer-most core points.
- Adding these black points finalizes the first cluster
- This process is repeated for the next group of core points and continues until all that’s left are outliers.
- For each data point, find the points in the neighborhood within eps distance, and define the core points as those with at least minPts neighbors.
- Advantages
- Doesn’t require users to specify the number of clusters.
- Not sensitive to outliers.
- Clusters formed by DBSCAN can be any shape, which makes it robust to different types of data.
- Disadvantages
- If the data has a very large variation in densities across clusters because you can only use one pair of parameters, eps and MinPts, on one dataset
- It could be hard to define eps without the domain knowledge of the data
- Clusters not totally reproducible. Clusters are defined sequentially so depending on which group of core points the algorithm starts with and hyperparameter values, some non-core points that are within the eps area of multiple clusters may be assigned to different clusters on different runs of the algorithm.
Mixed Variable Types
- Packages
- {DIBclust} (see paper) Deterministic Information Bottleneck (DIB) clustering
- Preserves the most relevant information while forming concise and interpretable clusters, guided by principles from information theory
- {kamila} - KAMILA clustering, a novel method for clustering mixed-type data in the spirit of k-means clustering.
- It does not require dummy coding of variables, and is efficient enough to scale to rather large data sets.
- {FactoMineR} - Multivariate Exploratory Data Analysis and Data Mining
- Has function for FAMD with K-Means approach
- {clustMixType} (Vignette) - Performs k-prototypes partitioning clustering for mixed variable-type data
- {cluster} - Can use for PAM w/Gower’s Dissimilarity mixed type clustering
- First compute the dissimilarity matrix using
daisy(df, metric = "gower")
. Then use that dissimilarity matrix as input forpam(matrix, k, diss = TRUE)
.
- First compute the dissimilarity matrix using
- {DIBclust} (see paper) Deterministic Information Bottleneck (DIB) clustering
- Papers
- A Deterministic Information Bottleneck Method for Clustering Mixed-Type Data
- Introduces {DIBclust}
- Compares the Deterministic Information Bottleneck (DIB) method to KAMILA, K-Prototypes, Factor Analysis for Mixed Data (FAMD) with K-Means, and PAM using Gower’s dissimilarity
- Used Adjusted Rand Index to compare cluster method results with the ground truth.
- Simulation Data
- DIBmix outperforms the other methods in a majority of scenarios with the only exception being when the proportion of categorical variables is high. Alternative bandwidth choices could potentially enhance performance in these cases.
- FAMD performed best with a high proportion of categorical variables probably due to its data reduction step.
- It was only slightly better than DIBmix. The other methods were substantially worse.
- Unlike the other methods, DIBmix effectively handles datasets with unbalanced cluster sizes.
- This robustness arises from the entropy term in its objective function, which minimises for imbalanced clusters, thus allowing exploration of diverse partition structures
- Real Data Sets
- Out of 10 datasets for the UCI Repository, DIBmix and K-Prototypes each had 4 best scores. Kamila and PAM each had 1. The results given the characteristics of the datasets were in-line with the results of the simulated data.
- A Deterministic Information Bottleneck Method for Clustering Mixed-Type Data