Outliers
Misc
- Also see Anomaly Detection for ML methods
- Packages
- CRAN Task View
- {robustmatrix} (vignette) - Robust covariance estimation for matrix-valued data and data with Kronecker-covariance structure using the Matrix Minimum Covariance Determinant (MMCD) estimators and outlier explanation using and Shapley values.
- Examples of matrix data would be image resolution and repeated measueres (e.g different time points, different spatial locations, different experimental conditions, etc)
- Resources
- Need to examine this article more closely, Taking Outlier Treatment to the Next Level
- Discusses detailed approach to diagnosing outliers , eda, diagnostics, robust regression, winsorizing, nonlinear approaches for nonrandom outliers.
- For Time Series, see bkmks, pkgs in time series >> cleaning/processing >> outliers
- Need to examine this article more closely, Taking Outlier Treatment to the Next Level
EDA
- IQR
- Observations above \(q_{0.75} + (1.5 \times \text{IQR})\) are considered outliers
- Observations below \(q_{0.25} - (1.5 \times \text{IQR})\) are considered outliers
- Where \(q_{0.25}\) and \(q_{0.75}\) correspond to first and third quartile respectively, and IQR is the difference between the third and first quartile
- Hampel Filter
- Observations above \(\text{median} + (3 \times \text{MAD})\) are considered outliers
- Observations below \(\text{median} - (3 \times \text{MAD})\) are considered outliers
- Use
mad(vec, constant = 1)
for the MAD
Tests
- ** All tests assume data is from a Normal distribution **
- See the EDA section for ways to find potential outliers to test
- Grubbs’s Test
Test either a maximum or minimum point
- If you suspect multiple points, you have remove the max/min points above/below the suspect point. Then test the subsetted data. Repeat as necessary
H0: There is no outlier in the data.
Ha: There is an outlier in the data.
Test statistics
\[ G = \frac {\bar{Y} - Y_{\text{min}}}{s}G = \frac {Y_{\text{max}} - \bar{Y}}{s} \]
- Statistics for whether the minimum or maximum sample value is an outlier
The maximum value is outlier if
\[ G > \frac {N-1}{\sqrt{N}} \sqrt{\frac {t^2_{\alpha/(2N),N-2}}{N-2+t^2_{\alpha/(2N),N-2}}} \]
- “<” for minimum
- t is denotes the critical value of the t distribution with (N-2) degrees of freedom and a significance level of α/(2N).
- For testing either the maximum or minimum value, use a significance level of level of α/N
Requirements
- Normally distributed
- More than 7 observations
outliers::grubbs.test(x, type = 10, opposite = FALSE, two.sided = FALSE)
x: a numeric vector of data values
type=10: check if the maximum value is an outlier, 11 = check if both the minimum and maximum values are outliers, 20 = check if one tail has two outliers.
opposite:
- FALSE (default): check value at maximum distance from mean
- TRUE: check value at minimum distance from the mean
two-sided: If this test is to be treated as two-sided, this logical value indicates that.
see bkmk for examples
- Dixon’s Test
- Test either a maximum or minimum point
- If you suspect multiple points, you have remove the max/min points above/below the suspect point. Then test the subsetted data. Repeat as necessary.
- Most useful for small sample size (usually n≤25)
- H0: There is no outlier in the data.
- Ha: There is an outlier in the data.
outliers::dixon.test
- Will only accept a vector between 3 and 30 observations
- “opposite=TRUE” to test the maximum value
- Test either a maximum or minimum point
- Rosner’s Test (aka generalized (extreme Studentized deviate) ESD test) Tests multiple points
- Avoids the problem of masking, where an outlier that is close in value to another outlier can go undetected.
- Most appropriate when n≥20
- H0: There are no outliers in the data set
- Ha: There are up to k outliers in the data set
res <- EnvStats::rosnerTest(x,k)
- x: numeric vector
- k: upper limit of suspected outliers
- alpha: 0.05 default
- The results of the test,
res
, is a list that contains a number of objects
res$all.stats
shows all the calculated statistics used in the outlier determination and the results- “Value” shows the data point values being evaluated
- “Outlier” is True/False on whether the point is determined to be an outlier by the test
- Rs are the test statistics
- λs are the critical values
Preprocessing
- Removal
- An option if there’s sound reasoning (e.g. data entry error, etc.)
- Winsorization
- A typical strategy is to set all outliers (values beyond a certain threshold) to a specified percentile of the data
- Example: A 90% winsorization would see all data below the 5th percentile set to the 5th percentile, and data above the 95th percentile set to the 95th percentile.
- Packages
Statistics
- For a skewed distribution, a Winsorized Mean (percentage of points replaced) often has less bias than a Trimmed Mean
- For a symmetric distribution, a Trimmed Mean (percentage of points removed) often has less variance than a Winsorized Mean.
- Hodges–Lehmann Estimator
- Packages: {DescTools::HodgesLehmann}
- A robust and nonparametric estimator of a population’s location parameter.
- For populations that are symmetric about one median, such as the Gaussian or normal distribution or the Student t-distribution, the Hodges–Lehmann estimator is a consistent and median-unbiased estimate of the population median.
- Has a Breakdown Point of 0.29, which means that the statistic remains bounded even if nearly 30 percent of the data have been contaminated.
- Sample Median is more robust with breakdown point of 0.50 for symmetric distributions, but is less efficient (i.e. needs more data).
- Has a Breakdown Point of 0.29, which means that the statistic remains bounded even if nearly 30 percent of the data have been contaminated.
- For non-symmetric populations, the Hodges–Lehmann estimator estimates the “pseudo–median”, which is closely related to the population median (relatively small difference).
- The psuedo-median is defined for heavy-tailed distributions that lack a finite mean.
- For two-samples, it’s the median of the difference between a sample from x and a sample from y.
- One-Variable Procedure
- Find all possible two-element subsets of the vector.
- Calculate the mean of each two-element subset.
- Calculate the median of all the subset means.
- Two-Variable Procedure
- Find all possible two-element subsets between the two vectors (i.e. cartesian product)
- Calculate difference between subsets
- Calculate median of differences
Models
- Bayes has different distributions for increasing uncertainty
- Isolation Forests - See Anomaly Detection >> Isolation Forests
- Support Vector Regression (SVR) - See Algorithms, ML >> Support Vector Machines >> Regression
- Extreme Value Theory approaches
- fat tail stuff (need to finish those videos)
- Robust Regression (see bkmks >> Regression >> Other >> Robust Regression)
- {MASS:rlm}
- {robustbase}
- CRAN Task View
- Huber Regression
- See Loss Functions >> Huber Loss
- See bkmks, Regression >> Generalized >> Huber
- Theil-Sen estimator
- {mblm}