Outliers

Misc

  • Also see Anomaly Detection for ML methods
  • Packages
    • CRAN Task View
    • {robustmatrix} (vignette) - Robust covariance estimation for matrix-valued data and data with Kronecker-covariance structure using the Matrix Minimum Covariance Determinant (MMCD) estimators and outlier explanation using and Shapley values.
      • Examples of matrix data would be image resolution and repeated measueres (e.g different time points, different spatial locations, different experimental conditions, etc)
  • Resources
    • Need to examine this article more closely, Taking Outlier Treatment to the Next Level
      • Discusses detailed approach to diagnosing outliers , eda, diagnostics, robust regression, winsorizing, nonlinear approaches for nonrandom outliers.
    • For Time Series, see bkmks, pkgs in time series >> cleaning/processing >> outliers

EDA

  • IQR
    • Observations above \(q_{0.75} + (1.5 \times \text{IQR})\) are considered outliers
    • Observations below \(q_{0.25} - (1.5 \times \text{IQR})\) are considered outliers
    • Where \(q_{0.25}\) and \(q_{0.75}\) correspond to first and third quartile respectively, and IQR is the difference between the third and first quartile
  • Hampel Filter
    • Observations above \(\text{median} + (3 \times \text{MAD})\) are considered outliers
    • Observations below \(\text{median} - (3 \times \text{MAD})\) are considered outliers
    • Use mad(vec, constant = 1)  for the MAD

Tests

  • ** All tests assume data is from a Normal distribution **
  • See the EDA section for ways to find potential outliers to test
  • Grubbs’s Test
    • Test either a maximum or minimum point

      • If you suspect multiple points, you have remove the max/min points above/below the suspect point. Then test the subsetted data. Repeat as necessary
    • H0: There is no outlier in the data.

    • Ha: There is an outlier in the data.

    • Test statistics

      \[ G = \frac {\bar{Y} - Y_{\text{min}}}{s}G = \frac {Y_{\text{max}} - \bar{Y}}{s} \]

      • Statistics for whether the minimum or maximum sample value is an outlier
    • The maximum value is outlier if

      \[ G > \frac {N-1}{\sqrt{N}} \sqrt{\frac {t^2_{\alpha/(2N),N-2}}{N-2+t^2_{\alpha/(2N),N-2}}} \]

      • “<” for minimum
      • t is denotes the critical value of the t distribution with (N-2) degrees of freedom and a significance level of α/(2N).
      • For testing either the maximum or minimum value, use a significance level of level of α/N
    • Requirements

      • Normally distributed
      • More than 7 observations
    • outliers::grubbs.test(x, type = 10, opposite = FALSE, two.sided = FALSE)

      • x: a numeric vector of data values

      • type=10: check if the maximum value is an outlier, 11 = check if both the minimum and maximum values are outliers, 20 = check if one tail has two outliers.

      • opposite:

        • FALSE (default): check value at maximum distance from mean
        • TRUE: check value at minimum distance from the mean
      • two-sided: If this test is to be treated as two-sided, this logical value indicates that.

    • see bkmk for examples

  • Dixon’s Test
    • Test either a maximum or minimum point
      • If you suspect multiple points, you have remove the max/min points above/below the suspect point. Then test the subsetted data. Repeat as necessary.
    • Most useful for small sample size (usually n≤25)
    • H0: There is no outlier in the data.
    • Ha: There is an outlier in the data.
    • outliers::dixon.test
      • Will only accept a vector between 3 and 30 observations
      • “opposite=TRUE” to test the maximum value
  • Rosner’s Test (aka generalized (extreme Studentized deviate) ESD test) Tests multiple points
    • Avoids the problem of masking, where an outlier that is close in value to another outlier can go undetected.
    • Most appropriate when n≥20
    • H0: There are no outliers in the data set
    • Ha: There are up to k outliers in the data set
    • res <- EnvStats::rosnerTest(x,k)
      • x: numeric vector
      • k: upper limit of suspected outliers
      • alpha: 0.05 default
      • The results of the test, res , is a list that contains a number of objects
    • res$all.stats shows all the calculated statistics used in the outlier determination and the results
      • “Value” shows the data point values being evaluated
      • “Outlier” is True/False on whether the point is determined to be an outlier by the test
      • Rs are the test statistics
      • λs are the critical values

Preprocessing

  • Removal
    • An option if there’s sound reasoning (e.g. data entry error, etc.)
  • Winsorization
    • A typical strategy is to set all outliers (values beyond a certain threshold) to a specified percentile of the data
    • Example: A 90% winsorization would see all data below the 5th percentile set to the 5th percentile, and data above the 95th percentile set to the 95th percentile.
    • Packages

Statistics

  • For a skewed distribution, a Winsorized Mean (percentage of points replaced) often has less bias than a Trimmed Mean
  • For a symmetric distribution, a Trimmed Mean (percentage of points removed) often has less variance than a Winsorized Mean.
  • Hodges–Lehmann Estimator
    • Packages: {DescTools::HodgesLehmann}
    • A robust and nonparametric estimator of a population’s location parameter.
    • For populations that are symmetric about one median, such as the Gaussian or normal distribution or the Student t-distribution, the Hodges–Lehmann estimator is a consistent and median-unbiased estimate of the population median.
      • Has a Breakdown Point of 0.29, which means that the statistic remains bounded even if nearly 30 percent of the data have been contaminated.
        • Sample Median is more robust with breakdown point of 0.50 for symmetric distributions, but is less efficient (i.e. needs more data).
    • For non-symmetric populations, the Hodges–Lehmann estimator estimates the “pseudo–median”, which is closely related to the population median (relatively small difference).
      • The psuedo-median is defined for heavy-tailed distributions that lack a finite mean.
    • For two-samples, it’s the median of the difference between a sample from x and a sample from y.
    • One-Variable Procedure
      • Find all possible two-element subsets of the vector.
      • Calculate the mean of each two-element subset.
      • Calculate the median of all the subset means.
    • Two-Variable Procedure
      • Find all possible two-element subsets between the two vectors (i.e. cartesian product)
      • Calculate difference between subsets
      • Calculate median of differences

Models