Outliers
Misc
- Also see
- Anomaly Detection for ML methods
- EDA, General >> Outliers
- Mathematics, Statistics >> Multivariate >> Depth
- Outlier detection for multivariate data
- Packages
- CRAN Task View
- {ShapleyOutlier} - Multivariate Outlier Explanations using Shapley Values and Mahalanobis Distances
- {robustmatrix} (vignette) - Robust covariance estimation for matrix-valued data and data with Kronecker-covariance structure using the Matrix Minimum Covariance Determinant (MMCD) estimators and outlier explanation using Shapley values.
- Examples of matrix data would be image resolution and repeated measueres (e.g different time points, different spatial locations, different experimental conditions, etc)
- {ClusMIP} (Paper) - Tools for identifying observations influencing the choice of a stochastically selected submodel in the high-dimensional setting where the number of predictors p exceeds the sample size n
- Process
- Data is first divided into two components denoting the potentially influential and clean portions via a consistent high-dimensional clustering scheme.
- e.g. spectral clustering and regularized K-means clustering.
- For each potentiallly influential data point, it gets merged with the clean dataset and analyzed parametrically or nonparametrically as to whether it’s truely an influential point.
- i.e. A model is fit using the clean data plus an influential point. Diagnostics on this model will determine if that point was truly influential.
- Data is first divided into two components denoting the potentially influential and clean portions via a consistent high-dimensional clustering scheme.
- Suitable for diagnosing influential data points (i.e. outliers) that affect regularized linear regression and regularized logistic regression
- Process
- Resources
- Need to examine this article more closely, Taking Outlier Treatment to the Next Level
- Discusses detailed approach to diagnosing outliers , eda, diagnostics, robust regression, winsorizing, nonlinear approaches for nonrandom outliers.
- For Time Series, see bkmks, pkgs in time series >> cleaning/processing >> outliers
- Need to examine this article more closely, Taking Outlier Treatment to the Next Level
- For small sample sizes with outliers
- “With n=36? Just do the whole analysis with and without, and see if your results are roughly pointing in the same direction both ways. With a sample size that small,”rough estimate of direction, justification for bigger study” is the only result you can really achieve anyway.” (Thread)
Tests
- ** All tests assume data is from a Normal distribution **
- See the EDA section for ways to find potential outliers to test
- Grubbs’s Test
Test either a maximum or minimum point
- If you suspect multiple points, you have remove the max/min points above/below the suspect point. Then test the subsetted data. Repeat as necessary
H0: There is no outlier in the data.
Ha: There is an outlier in the data.
Test statistics
\[ G = \frac {\bar{Y} - Y_{\text{min}}}{s}G = \frac {Y_{\text{max}} - \bar{Y}}{s} \]
- Statistics for whether the minimum or maximum sample value is an outlier
The maximum value is outlier if
\[ G > \frac {N-1}{\sqrt{N}} \sqrt{\frac {t^2_{\alpha/(2N),N-2}}{N-2+t^2_{\alpha/(2N),N-2}}} \]
- “<” for minimum
- t is denotes the critical value of the t distribution with (N-2) degrees of freedom and a significance level of α/(2N).
- For testing either the maximum or minimum value, use a significance level of level of α/N
Requirements
- Normally distributed
- More than 7 observations
outliers::grubbs.test(x, type = 10, opposite = FALSE, two.sided = FALSE)
x: a numeric vector of data values
type=10: check if the maximum value is an outlier, 11 = check if both the minimum and maximum values are outliers, 20 = check if one tail has two outliers.
opposite:
- FALSE (default): check value at maximum distance from mean
- TRUE: check value at minimum distance from the mean
two-sided: If this test is to be treated as two-sided, this logical value indicates that.
see bkmk for examples
- Dixon’s Test
- Test either a maximum or minimum point
- If you suspect multiple points, you have remove the max/min points above/below the suspect point. Then test the subsetted data. Repeat as necessary.
- Most useful for small sample size (usually n≤25)
- H0: There is no outlier in the data.
- Ha: There is an outlier in the data.
outliers::dixon.test
- Will only accept a vector between 3 and 30 observations
- “opposite=TRUE” to test the maximum value
- Test either a maximum or minimum point
- Rosner’s Test (aka generalized (extreme Studentized deviate) ESD test) Tests multiple points
- Avoids the problem of masking, where an outlier that is close in value to another outlier can go undetected.
- Most appropriate when n≥20
- H0: There are no outliers in the data set
- Ha: There are up to k outliers in the data set
res <- EnvStats::rosnerTest(x,k)
- x: numeric vector
- k: upper limit of suspected outliers
- alpha: 0.05 default
- The results of the test,
res
, is a list that contains a number of objects
res$all.stats
shows all the calculated statistics used in the outlier determination and the results- “Value” shows the data point values being evaluated
- “Outlier” is True/False on whether the point is determined to be an outlier by the test
- Rs are the test statistics
- λs are the critical values
Preprocessing
- Removal
- An option if there’s sound reasoning (e.g. data entry error, etc.)
- Winsorization
- A typical strategy is to set all outliers (values beyond a certain threshold) to a specified percentile of the data
- Example: A 90% winsorization would see all data below the 5th percentile set to the 5th percentile, and data above the 95th percentile set to the 95th percentile.
- Packages
- Binning
- See Feature-Engineering, General >> Continuous >> Binning
- Depending on the modeling algorithm, binning can help with minimizing the influence of outliers and skewness. Beware of information loss due to too few bins. Some algorithms also don’t perform well with variables with too few bins.
Statistics
- Packages
- {WRS2} - Robust t-tests (independent and dependent samples), robust ANOVA (including between-within subject designs), quantile ANOVA, robust correlation, robust mediation, and nonparametric ANCOVA models based on robust location measures
- {robustHD::corHuber} - Computes a robust correlation based on winsorization
- Univariate Winsorization: The borders for each variable are given by +/− const, thus a symmetric distribution is assumed
- Adjusted Univariate Winsorization: The borders for the two diagonally opposing quadrants containing the minority of the data are shrunken by a factor that depends on the ratio between the number of observations in the major and minor quadrants.
- Bivariate Winsorization: A bivariate normal distribution is assumed and the data are shrunken towards the boundary of a tolerance ellipse with a coverage probability.
- The boundary of this ellipse is thereby given by all points that have a squared Mahalanobis distance equal to the quantile of the chi-square distribution given by coverage probability. Furthermore, the initial correlation matrix required for the Mahalanobis distances is computed based on adjusted univariate winsorization.
- Winsorization and Trimming
For a skewed distribution, a Winsorized Mean (percentage of points replaced) often has less bias than a Trimmed Mean ({DescTools::Winsorize})
For a symmetric distribution, a Trimmed Mean (percentage of points removed) often has less variance than a Winsorized Mean. (
mean(num_vec, trim = 0.10)
{DescTools::YuenTTest} - Yuen t-Test For Trimmed Means
Winsorized t-test
<- WRS2::winmean(x, tr = 0.2) mu_w <- WRS2::winse(x, tr = 0.2) se_w <- (mean_w - mu0) / se_w t_stat <- n - 1 dof <- 2 * pt(-abs(t_stat), df) p_value
- M-Estimators
- Applies a flexible, continuous loss function to down-weight or ignore outliers adaptively.
- Used in robust regression models, e.g. {MASS::rlm} which has Huber, Tukey Bi-Square, and Hampel.
- Note that the k argument is the \(c\) value in the Huber estimator equation
- Offer flexibility, particularly when:
- You need control over outlier influence.
- You’re dealing with regression or multivariate settings.
- Huber’s M Estimator
\[ \rho(u) = \left\{ \begin{array}{lcl} \frac{1}{2}u^2 & \mbox{if}\;|u| \le c \\ 2c|u|-c^2 & \mbox{if}\;|u| \gt c \end{array}\right. \]- A compromise between the mean and median.
- Good general purpose M Estimator
- Uses a squared error for small residuals and switches to absolute deviations for large residuals
- \(u\) is the residual difference between \(y\) (your data) and \(\hat u\) (your guess)
- \(c\) is a tuning constant (also called the threshold or cutoff), which determines the point where the loss function transitions from quadratic to linear. A common choice for \(c\) is 1.345, based on the asymptotic properties of the estimator under normal errors.
- Large c:
- When c is large (e.g., \(c \geq 3\)), the Huber M-estimator behaves similarly to the mean (least squares estimator) because most residuals fall below c, and the quadratic loss dominates.
- Use this when your data is approximately normal with few or no significant outliers.
- Small c:
- When c is small (e.g., \(c \approx 1.345\), the commonly recommended value), the Huber M-estimator is more robust to outliers, as more residuals are treated with the linear loss.
- Use this when your data has heavy tails or is prone to outliers.
- Large c:
- {DescTools::HuberM} - (Generalized) Huber M-estimator of location with MAD scale, being sensible also when the scale is zero where
huber
returns an error. - Process (Iterative optimization methods like Newton-Raphson or gradient descent used)
- Initialize \(\hat u\), e.g. set at the median
- Compute the residual vector, \(u\)
- For each residual \(u_i\), apply the Huber loss function \(\rho(u_i)\) based on whether \(|u_i|\) is smaller or greater than the tuning constant \(c\).
- Update \(\hat u\)
- Iterate until convergence
- Cauchy M-Estimator
\[ \rho(u) = \frac{c^2}{2} \log(1 + \frac{u^2}{c^2}) \]- Process and variables are the same as for Huber’s M-Estimator
- Small Residuals (\(|u| \ll c\)): Approximates a quadratic loss (similar to least squares).
- Large Residuals (\(|u| \gg c\)): The loss grows logarithmically, which heavily down-weights the influence of outliers compared to both Huber and least squares estimators.
- Ideal for extreme outliers and heavy-tailed distributions
- {smoothmest} has a Cauchy ML-Estimator
- ML (Maximum Likelihood) Estimator maximizes a likelihood instead of minimizing a loss.
- For heavy-tailed distributions like the Cauchy, the ML-Estimator is inherently robust due to the nature of the likelihood function.
- Others
- Tukey’s Biweight (or Bisquare) Estimator: Uses a loss function that completely disregards extreme outliers
- MM-Estimators
- Combines robust location and scale estimators
- High Breakdown Point: MM-estimators can tolerate a high proportion of outliers (up to 50% in some cases).
- High Efficiency: By combining robust and efficient estimators, MM-estimators achieve nearly the efficiency of the mean under normal distributions.
- Process
- Compute a robust estimate of the scale or dispersion (e.g., using the MAD or Huber scale estimator) which ensures a high breakdown point
- Use the initial robust estimate as a starting point and refine it by minimizing a second M-estimator with a specific loss function designed for high efficiency Typically Tukey’s bi-weight function or Huber’s function).
- {RobStatTM::covRob} computes a multivariate MM location estimate
- Hodges–Lehmann Estimator
- Packages: {DescTools::HodgesLehmann}
- A robust and nonparametric estimator of a population’s location parameter.
- Preferred for small data, symmetric distributions, and extreme outliers.
- For symmetric populations about one median, such as the Gaussian or normal distribution or the Student t-distribution, the Hodges–Lehmann estimator is a consistent and median-unbiased estimate of the population median.
- Has a Breakdown Point of 0.29, which means that the statistic remains bounded even if nearly 30 percent of the data have been contaminated.
- Sample Median is more robust with breakdown point of 0.50 for symmetric distributions, but is less efficient (i.e. needs more data).
- Has a Breakdown Point of 0.29, which means that the statistic remains bounded even if nearly 30 percent of the data have been contaminated.
- For non-symmetric populations, the Hodges–Lehmann estimator estimates the “pseudo–median”, which is closely related to the population median (relatively small difference).
- The psuedo-median is defined for heavy-tailed distributions that lack a finite mean.
- For two-samples, it’s the median of the difference between a sample from x and a sample from y.
- One-Variable Procedure
- Find all possible two-element subsets of the vector.
- Calculate the mean of each two-element subset.
- Calculate the median of all the subset means.
- Two-Variable Procedure
- Find all possible two-element subsets between the two vectors (i.e. cartesian product)
- Calculate difference between subsets
- Calculate median of differences
Models
- Be aware that if you try interpret the coefficients (inference) these models, you need to understand the distribution (e.g. winsored normal) or method (e.g. m-estimator) being used by the model.
- Bayes has different distributions for increasing uncertainty
- Isolation Forests - See Anomaly Detection >> Isolation Forests
- Support Vector Regression (SVR) - See Algorithms, ML >> Support Vector Machines >> Regression
- Extreme Value Theory approaches
- Quantile Regression - See Regresson, Quantile
- Robust Regression (see bkmks >> Regression >> Other >> Robust Regression)
- {MASS:rlm}
- CRAN Task View
- For linear regression and glms, {robustbase} should probably be your first stop.
- Huber Regression
- See Loss Functions >> Huber Loss
- See bkmks, Regression >> Generalized >> Huber
- Theil-Sen estimator
- {mblm}