Outliers
Misc
- Also see
- Anomaly Detection for ML methods
- EDA, General >> Outliers
- Mathematics, Statistics >> Multivariate >> Depth
- Outlier detection for multivariate data
- Packages
- CRAN Task View
- {ShapleyOutlier} - Multivariate Outlier Explanations using Shapley Values and Mahalanobis Distances
- {robustmatrix} (vignette) - Robust covariance estimation for matrix-valued data and data with Kronecker-covariance structure using the Matrix Minimum Covariance Determinant (MMCD) estimators and outlier explanation using Shapley values.
- Examples of matrix data would be image resolution and repeated measueres (e.g different time points, different spatial locations, different experimental conditions, etc)
- Resources
- Need to examine this article more closely, Taking Outlier Treatment to the Next Level
- Discusses detailed approach to diagnosing outliers , eda, diagnostics, robust regression, winsorizing, nonlinear approaches for nonrandom outliers.
- For Time Series, see bkmks, pkgs in time series >> cleaning/processing >> outliers
- Need to examine this article more closely, Taking Outlier Treatment to the Next Level
Tests
- ** All tests assume data is from a Normal distribution **
- See the EDA section for ways to find potential outliers to test
- Grubbs’s Test
Test either a maximum or minimum point
- If you suspect multiple points, you have remove the max/min points above/below the suspect point. Then test the subsetted data. Repeat as necessary
H0: There is no outlier in the data.
Ha: There is an outlier in the data.
Test statistics
\[ G = \frac {\bar{Y} - Y_{\text{min}}}{s}G = \frac {Y_{\text{max}} - \bar{Y}}{s} \]
- Statistics for whether the minimum or maximum sample value is an outlier
The maximum value is outlier if
\[ G > \frac {N-1}{\sqrt{N}} \sqrt{\frac {t^2_{\alpha/(2N),N-2}}{N-2+t^2_{\alpha/(2N),N-2}}} \]
- “<” for minimum
- t is denotes the critical value of the t distribution with (N-2) degrees of freedom and a significance level of α/(2N).
- For testing either the maximum or minimum value, use a significance level of level of α/N
Requirements
- Normally distributed
- More than 7 observations
outliers::grubbs.test(x, type = 10, opposite = FALSE, two.sided = FALSE)
x: a numeric vector of data values
type=10: check if the maximum value is an outlier, 11 = check if both the minimum and maximum values are outliers, 20 = check if one tail has two outliers.
opposite:
- FALSE (default): check value at maximum distance from mean
- TRUE: check value at minimum distance from the mean
two-sided: If this test is to be treated as two-sided, this logical value indicates that.
see bkmk for examples
- Dixon’s Test
- Test either a maximum or minimum point
- If you suspect multiple points, you have remove the max/min points above/below the suspect point. Then test the subsetted data. Repeat as necessary.
- Most useful for small sample size (usually n≤25)
- H0: There is no outlier in the data.
- Ha: There is an outlier in the data.
outliers::dixon.test
- Will only accept a vector between 3 and 30 observations
- “opposite=TRUE” to test the maximum value
- Test either a maximum or minimum point
- Rosner’s Test (aka generalized (extreme Studentized deviate) ESD test) Tests multiple points
- Avoids the problem of masking, where an outlier that is close in value to another outlier can go undetected.
- Most appropriate when n≥20
- H0: There are no outliers in the data set
- Ha: There are up to k outliers in the data set
res <- EnvStats::rosnerTest(x,k)
- x: numeric vector
- k: upper limit of suspected outliers
- alpha: 0.05 default
- The results of the test,
res
, is a list that contains a number of objects
res$all.stats
shows all the calculated statistics used in the outlier determination and the results- “Value” shows the data point values being evaluated
- “Outlier” is True/False on whether the point is determined to be an outlier by the test
- Rs are the test statistics
- λs are the critical values
Preprocessing
- Removal
- An option if there’s sound reasoning (e.g. data entry error, etc.)
- Winsorization
- A typical strategy is to set all outliers (values beyond a certain threshold) to a specified percentile of the data
- Example: A 90% winsorization would see all data below the 5th percentile set to the 5th percentile, and data above the 95th percentile set to the 95th percentile.
- Packages
- Binning
- See Feature-Engineering, General >> Continuous >> Binning
- Depending on the modeling algorithm, binning can help with minimizing the influence of outliers and skewness. Beware of information loss due to too few bins. Some algorithms also don’t perform well with variables with too few bins.
Statistics
- For a skewed distribution, a Winsorized Mean (percentage of points replaced) often has less bias than a Trimmed Mean
- For a symmetric distribution, a Trimmed Mean (percentage of points removed) often has less variance than a Winsorized Mean.
- Hodges–Lehmann Estimator
- Packages: {DescTools::HodgesLehmann}
- A robust and nonparametric estimator of a population’s location parameter.
- For populations that are symmetric about one median, such as the Gaussian or normal distribution or the Student t-distribution, the Hodges–Lehmann estimator is a consistent and median-unbiased estimate of the population median.
- Has a Breakdown Point of 0.29, which means that the statistic remains bounded even if nearly 30 percent of the data have been contaminated.
- Sample Median is more robust with breakdown point of 0.50 for symmetric distributions, but is less efficient (i.e. needs more data).
- Has a Breakdown Point of 0.29, which means that the statistic remains bounded even if nearly 30 percent of the data have been contaminated.
- For non-symmetric populations, the Hodges–Lehmann estimator estimates the “pseudo–median”, which is closely related to the population median (relatively small difference).
- The psuedo-median is defined for heavy-tailed distributions that lack a finite mean.
- For two-samples, it’s the median of the difference between a sample from x and a sample from y.
- One-Variable Procedure
- Find all possible two-element subsets of the vector.
- Calculate the mean of each two-element subset.
- Calculate the median of all the subset means.
- Two-Variable Procedure
- Find all possible two-element subsets between the two vectors (i.e. cartesian product)
- Calculate difference between subsets
- Calculate median of differences
Models
- Bayes has different distributions for increasing uncertainty
- Isolation Forests - See Anomaly Detection >> Isolation Forests
- Support Vector Regression (SVR) - See Algorithms, ML >> Support Vector Machines >> Regression
- Extreme Value Theory approaches
- fat tail stuff (need to finish those videos)
- Robust Regression (see bkmks >> Regression >> Other >> Robust Regression)
- {MASS:rlm}
- {robustbase}
- CRAN Task View
- Huber Regression
- See Loss Functions >> Huber Loss
- See bkmks, Regression >> Generalized >> Huber
- Theil-Sen estimator
- {mblm}