General
Misc
- Also see
- Notebook, pgs 57-58
- Diagnostics, Clustering
- “Clustering algorithms always find clusters if you tune their similarity and grouping enough. Unfortunately, clusters by themselves have no inherent meaning. Only understanding the data and the algorithm can lend meaning to the clusters. To make a clustering project well defined, start by knowing what the data means. It’s especially important to know about the process that generates the data points. Based on understanding, you can form a hypothesis about the relationships that might exist between points. An algorithm with thoughtfully chosen parameters can either confirm or deny that hypothesis.” (source)
- For static data, i.e., if the values do not change with time, clustering methods are usually divided into five major categories:
- Partitioning (or Partitional)
- Hierarchical
- Density-Based
- Grid-Based
- Model-Based Methods
- Biclustering (aka Two-Mode Clustering)
- Simultaneously clusters the rows and columns of an (\(n×p\))-dimensional data matrix \(Y\), with rows associated to \(n\) statistical units and columns consisting of \(p\) ordinal outcomes.
- Packags
- {biclustermd} - Biclustering with Missing Data
- Papers
- Use Cases
- Analysis of political votes
- The voting records of \(n\) politicians on \(p\) legislative bills are only available for the sessions they attended, where absence from a session may represent a deliberate political strategy.
- Recommendation systems
- The data consists of customer (\(n\)) ratings on \(p\) items.
- In the well-known Netflix Prize data, missingness arises when users choose not to rate certain movies, potentially reflecting their lack of interest, preference, or other underlying factors
- Analysis of political votes
Terms
- Cluster Centroid - The middle of a cluster. A centroid is a vector that contains one number for each variable, where each number is the mean of a variable for the observations in that cluster. The centroid can be thought of as the multi-dimensional average of the cluster.
- Hard (or Crisp) Clustering - each point belongs to a single cluster
- Soft (or Fuzzy) Clustering - each point is allowed to belong to multiple clusters
Cluster Descriptions
- Packages
- {parameters} - provides various functions for describing, analyzing, and visualizing clusters for various methods
- {clustereval} - compute the statistical association between the features and the detected cluster labels and whether they are significant.
- Categorical: Chi-Square, Fisher’s Exact, or Hypergeometric tests
- Continuous: Mann-Whitney-U test
- Examine variable values at the centroids of each cluster
- A higher absolute value indicates that a certain variable characteristic is more pronounced within that specific cluster (as compared to other cluster groups with lower absolute mean values).
- Distributional statistics for each cluster
- Numeric variables: mean and sd for each variable in that cluster
- Categorical variables:
- binary: percent where event = 1
- multinomial: most prominent category
- Run a decision tree on clusters
- Each color (orange, blue, green, purple) represents a cluster
- Explains how clusters were generated
- Issue
- Suboptimal Labels: A depth of 3 might be sufficient to fully represent the cluster boundaries, but CART might only achieve 100% accuracy when using a depth of 5 or 6. This is because of the greedy nature of the decision tree algorithms.
- {treeheatr}
- Radar charts
- 3 clusters: blue (highlighted), red, green
- Guessing the mean values for each variable are the points
- Scatter
- Use clustering variables of interest for a scatter plot then label the points with cluster id
- See K-Means >> Example 2 for code
- Use clustering variables of interest for a scatter plot then label the points with cluster id
- Interpetable Clustering
- Notes from Introduction to Interpretable Clustering
- Overview of methods with references to papers
- Instead of fitting a classifier in a supervised manner after clustering, clusters are directly constructed using a model.
- Unsupervised Clustering Trees - Trees are constructed by optimizing a variety of clustering metrics through various optimization algorithms
- Others: Specialized Neural Networks, Polytope Machines, and Rectangular Rules
- Notes from Introduction to Interpretable Clustering
Gaussian Mixture Models (GMM)
- Misc
- Soft clustering algorithm
- Notes from
- Packages
- {mclust} - Gaussian Mixture Modelling for Model-Based Clustering, Classification, and Density Estimation
- Old package that’s built as a companion to a book (which I don’t have and it’s not freely available). It has a vignette, but it just illustrates the available functions. There aren’t any descriptions, workflows, or interpretations.
- There’s an article in JOSS, Model-based Methods of Classification: Using the mclust Software in Chemometrics, which seems to give a little more idea on how to use the package.
- There’s also a website, mclust.org, which lists other packages from the group that are associated with GMMs. Their documentation isn’t great as well, but I haven’t checked the papers associated with them to see if they’d be much help.
- {otrimle}
- Uses Improper Maximum Likelihood Estimator Clustering (IMLEC) method
- Hyperparameters automatically tuned; Outliers removed
- Robust gaussian mixture clustering algorithm
- Webpage has links to paper, Coretto and Hennig, 2016
- {pivmet} (JOSS) - Pivotal methods for consensus clustering in k-means and mixture modeling. Method mitigates label-switching problem that occurs during Bayesian estimation of mixture models
- {foldcluster} (Paper) - Bayesian Clustering via Fusing of Localized Densities
- Addresses GMM kernel misspecification problem. Clustering results are highly sensitive to kernel misspecification. For example, if Gaussian kernels are used but the true density of data within a cluster is even slightly non-Gaussian, then clusters will be broken into multiple Gaussian components.
- {mclust} - Gaussian Mixture Modelling for Model-Based Clustering, Classification, and Density Estimation
- Components of the Algorithm
- “Color” points according to gaussians (clusters)
- Fitting a Gaussian
- Find the center of mass
- 2-dim: calculate the mean of x and the mean of y and that’s the coordinates of your center of mass
- Find the spread of the points
- Find the center of mass
- Steps
- Start with random Gaussians
- Each gaussian has random means, variances
- Color points according to distance to the random gaussians
- The heights in the distributions pic above
- (Forget about old gaussians) Calculate new gaussians based on the colored points
- (Forget about old colors) Color points according to distance to the new gaussians
- Repeat until some threshold is reached (i.e. gaussians or colors don’t change much)
- Start with random Gaussians
- Tuning
- Initial Conditions (i.e. Good starting points for the random gaussians at the beginning)
- Limits on the mean and variance calculations
- Number of gaussians, k, can be chosen by minimizing the Davies-Bouldin score
- See Diagnostics, Clustering >> Spherical/Centroid Based >> Davies-Bouldin Index
- Running algorithm multiple times
- Like CV grid search algs or bootstrapping
Latent Profile Analysis (LPA)
Sort of like k-means + GMM
k number of profiles (i.e. clusters) are chosen
Model outputs probabilities that an observation belongs to any particular cluster
GOF metrics available
“As with Exploratory Factor Analysis (EFA )(and other latent-variable models), the assumption of LPA is that the latent (unobserved) factor”causes” (I’m using the term loosely here) observed scores on the indicator variables. So, to refer back to my initial hypothetical example, a monster being a spell caster (the unobserved class) causes it to have high intelligence, low strength, etc. rather than the inverse. This is a worthwhile distinction to keep in mind, since it has implications for how the model is fit.”
Bin variables that might dominate the profile. This way the profiles will represent a latent variable and not gradations of the dominate variable (e.g. low, middle, high values of the dominate variable).
Center other variable observations according to dominant variable bin those observations are in. (e.g. subtract values in bin1 from bin1’s mean)
# From D&D article where challenge_rating is a likely dominant variable <- mons_df %>% mons_bin mutate(cr_bin = ntile(x = challenge_rating, n = 6)) <- c("strength", "dexterity", "constitution", "intelligence", "wisdom", "charisma") ab_scores <- mons_bin %>% mons_bin group_by(cr_bin) %>% mutate(across(.cols = ab_scores, .fns = mean, .names = "{.col}_bin_mean")) %>% ungroup()
tSNE
- Packages
- {Rtsne}
- t-Distributed Stochastic Neighbor Embedding
- Looks at the local distances between points in the original data space and tries to reproduce them in the low-dimensional representation
- Both UMAP and tSNE attempt to do this but fails (Lior Pachter paper thread, Doesn’t preserve local structure, No theorem says that it preserves topology)
- Results depend on a random starting point
- Tuning parameters: perplexity
UMAP
- Packages:
- {umap}
- {scDEED} (article) - Detects Dubious t-SNE and UMAP Embeddings and Optimizes Hyperparameters
- scDEED assigns a reliability score to each 2D embedding to indicate how much the data point’s mid-range neighbors change in the 2D space. Observations whose 2D embedding neighbors have been drastically changed through the embedding process are called ‘dubious.’
- Uniform Manifold Approximation and Projection
- See tSNE section for Lior Pachter threads on why not to use tSNE or UMAP
- Preprocessing
- Only for numeric variables
- Standardize
- Projects variables to a nonlinear space
- Variation of tSNE
- Random starting point has less of an impact
- Can be supervised (give it an outcome variable)
- Computationally intensive
- Low-dimensional embedding cannot be interpreted
- No rotation matrix plot like in PCA
- Try pca - linear method (fast)
- If successful (good separation between categories), then prediction may be easier
- If not, umap, tsne needed
- UMAP can taking training model and apply it to test data or new data (tSNE can’t)
- Tuning parameter: neighbors
- Example used 500 iterations (n_epochs) as limit for convergence
K-Means
- Seeks to assign n points to k clusters and find cluster centers so as to minimize the sum of squared distances from each point to its cluster center.
- Packages
- {tidyclust} - Clustering for tidymodels
- Engines
- stats and ClusterR run classical K-means
- laR runs K-Modes models which are the categorical analog to K-means, meaning that it is intended to be used on only categorical data
- clustMixType to run K-prototypes which are the more general method that works with categorical and numeric data at the same time.
- Engines
- {pivmet} (JOSS) - Pivotal methods for consensus clustering in k-means and mixture modeling
- {fkms} (Paper) - k-means clustering for sparsely observed longitudinal data (i.e. repeated measure of multiple numeric variable over time)
- Normal k-means or mixture models can’t be applied to unbalanced/irregular designs where time and measurement differ for each subject. These methods also ignore autocorrelation.
- Also applicable to densely observed data
- Employs the basis function expansion (i.e. splines) to model the cluster centers
- {tidyclust} - Clustering for tidymodels
- For choosing the number of clusters, elbow method (i.e. WSS) is usually awful if there are more than few clusters. Recommended: Calinski-Harabasz Index and BIC then Silhouette Coefficient or Davies-Bouldin Index (See Diagnostics, Clustering >> Spherical/Centroid Based (article)
- Base R
kmeans
uses the Hartigan-Wong algorithm- For large k and larger n, the density of cluster centers should be proportional to the density of the points to the power (d/d+2). In other words the distribution of clusters found by k-means should be more spread out than the distribution of points. This is not in general achieved by commonly used iterative schemes, which stay stuck close to the initial choice of centers.
- Issues
- Inefficient in distinguishing between groups of unbalanced sizes. See Packages >> {pivmet} which uses consensus clustering to help mitigate this.
- {tidyclust}
Example 1: Mixed K-Means
Model and Extract Centroids
library(tidymodels) library(tidyclust) data("ames", package = "modeldata") <- k_means(num_clusters = 3) %>% kproto_spec set_engine("clustMixType") <- kproto_spec %>% kproto_fit fit(~ ., data = ames) %>% kproto_fit extract_centroids() %>% select(11:20) %>% glimpse() #> Rows: 3 #> Columns: 10 #> $ Lot_Config <fct> Inside, Inside, Inside #> $ Land_Slope <fct> Gtl, Gtl, Gtl #> $ Neighborhood <fct> College_Creek, North_Ames, Northridge_Heights #> $ Condition_1 <fct> Norm, Norm, Norm #> $ Condition_2 <fct> Norm, Norm, Norm #> $ Bldg_Type <fct> OneFam, OneFam, OneFam #> $ House_Style <fct> Two_Story, One_Story, One_Story #> $ Overall_Cond <fct> Average, Average, Average #> $ Year_Built <dbl> 1989.977, 1953.793, 1998.765 #> $ Year_Remod_Add <dbl> 1995.934, 1972.973, 2003.035
Example 2: Classic K-Means (article, code)
Preprocess and Fit
<- recipe(~ ., data = data_prep_tbl) %>% recipe_kmeans step_dummy(all_nominal_predictors(), one_hot = TRUE) %>% step_normalize(all_numeric_predictors()) %>% step_rm("ID") %>% prep() %>% juice() %>% glimpse() recipe_kmeans <- k_means(num_clusters = 4) %>% model_kmeans set_engine("stats") set.seed(123) <- workflow() %>% wflw_fit_kmeans add_model(model_kmeans) %>% add_recipe(recipe_kmeans) %>% fit(data_prep_tbl)
Visualize
<- data_prep_tbl %>% g bind_cols(extract_cluster_assignment(wflw_fit_kmeans), .) %>% ggplot(aes(Spent, Income)) + geom_point( aes(fill = .cluster), shape = 21, alpha = 0.3, size = 5 + ) geom_smooth(color = "blue", se = FALSE) + scale_x_continuous(labels = scales::dollar_format()) + scale_y_continuous( labels = scales::dollar_format(), limits = c(0, 200000) + ) labs(title = "Customer Clusters: Spent vs Income") + scale_fill_tq() + theme_tq() ggplotly(g)
- See Cluster Descriptions >> Scatter for the output
Approximate Nearest Neighbor (ANN)
- kNN runs at O(N*K), where N is the number of items and K is the size of each embedding. Approximate nearest neighbor (ANN) algorithms typically drop the complexity of a lookup to O(log(n)).
- Misc
- Also see Maximum inner product search using nearest neighbor search algorithms
- It shows a preprocessing transformation that is performed before kNN to make it more efficient
- It might already be implemented in ANN algorithms
- Packages
- {RANN} - Wraps Mount’s and Arya’s ANN library (2010) that’s written in C++. Finds the k nearest neighbours for every point in a given dataset in O(N log N) time.
- Support for approximate as well as exact searches, fixed radius searches and bd as well as kd trees.
- Euclidean (L2) and Manhattan (L1)
- {RANN} - Wraps Mount’s and Arya’s ANN library (2010) that’s written in C++. Finds the k nearest neighbours for every point in a given dataset in O(N log N) time.
- Also see Maximum inner product search using nearest neighbor search algorithms
- Commonly used in Recommendation algs to cluster user-item embeddings at the end. Also, any NLP task where you need to do a similarity search of one character embedding to other character embeddings.
- Generally uses one of two main categories of hashing methods: either data-independent methods, such as locality-sensitive hashing (LSH); or data-dependent methods, such as Locality-preserving hashing (LPH)
- Locality-Sensitive Hashing (LSH)
- Hashes similar input items into the same “buckets” with high probability.
- The number of buckets is much smaller than the universe of possible input items
- Hash collisions are maximized, not minimized, where a collision is where two distinct data points have the same hash.
- Spotify’s Annoy
- Uses a type of LSH, Random Projections Method (RPM) (article didn’t explain this well)
- L RPM hashing functions are chosen. Each data point, p, gets hashed into buckets in each of the L hashing tables. When a new data point, q, is “queried,” it gets hash into buckets like p did. All the hashes in the same buckets of p are pulled and the hashes within a certain threshold, c*R, are considered nearest neighbors.
- Wiki article on LSH and RPM clears it up a little, but I’d probably have to go to Spotify’s paper to totally make sense of this.
- Also the Spotify alg might bring trees/forests into this somehow
- Facebook AI Similarity Search (FAISS)
- Hierarchical Navigable Small World Graphs (HNSW)
- HNSW has a polylogarithmic time complexity (O(logN))
- Two approximations available Embeddings are clustered and centroids are calculated. The k nearest centroids are returned.
- Embeddings are clustered into veroni cells. The k nearest embeddings in a veroni cell or a region of veroni cells is returned.
- Both types of approximations have tuning parameters.
- Inverted File Index + Product Quantization (IVFPQ)(article)
DBSCAN
- Misc
- Notes from:
- Understanding DBSCAN and Implementation with Python
- Clustering with DBSCAN, Clearly Explained video
- Packages
- {dbscan}
- {parameters}
n_clusters_dbscan
- Given a “min_size” (aka minPts?), the function estimates the optimal “eps”cluster_analysis
- Shows Sum of Squares metrics and the (standardized) mean value for each variable within each cluster.
- HDBSCAN is the hierarchical density-based clustering algorithm
- Use Cases
- Geospatially Clustering Earthquakes
- Events can occur in irregular shaped clusters (i.e., along faults of different orientations).
- Events can occur in different densities (i.e. some fault zones are more active than others).
- Events can occur far away from fault zones (i.e. outliers)
- Geospatially Clustering Earthquakes
- Notes from:
- Tuning
- eps - The maximum distance between two samples for one to be considered to be connected to the other
- Large eps tend to include more points within a cluster,
- Too-large eps will include everything in the same single cluster
- Too-small eps will result in no clustering at all
- minPts (or min_samples) - The minimum number of samples in a neighborhood for a point to be considered as a core point
- Too-small minPts is not meaningful because it will regard every point as a core point.
- Larger minPts can be better to deal with noisy data
- eps - The maximum distance between two samples for one to be considered to be connected to the other
- Algorithm
- For each data point, find the points in the neighborhood within eps distance, and define the core points as those with at least minPts neighbors.
- The orange circle represents the eps area
- If minPts = 4, then the top 4 points are core points because they have at least 4 points overlapping the eps area
- Define groups of connected core points as clusters.
- All the green points have been labelled as core points
- Assign each non-core point to a nearby cluster if it’s directly reachable from a neighboring core point, otherwise define it as an outlier.
- The black points are non-core points but are points that overlap the eps area for the outer-most core points.
- Adding these black points finalizes the first cluster
- This process is repeated for the next group of core points and continues until all that’s left are outliers.
- For each data point, find the points in the neighborhood within eps distance, and define the core points as those with at least minPts neighbors.
- Advantages
- Doesn’t require users to specify the number of clusters.
- Not sensitive to outliers.
- Clusters formed by DBSCAN can be any shape, which makes it robust to different types of data.
- Disadvantages
- If the data has a very large variation in densities across clusters because you can only use one pair of parameters, eps and MinPts, on one dataset
- It could be hard to define eps without the domain knowledge of the data
- Clusters not totally reproducible. Clusters are defined sequentially so depending on which group of core points the algorithm starts with and hyperparameter values, some non-core points that are within the eps area of multiple clusters may be assigned to different clusters on different runs of the algorithm.