Ensembling

Misc

  • Statistical ensemble nearly as good as a DL ensemble and was much faster and cheaper. (Thread)
    • 4 models used in the ensemble
      • AutoARIMA and Exponential Smoothing ({forecast})
      • Complex Exponential Smoothing ({smooth})
      • Dynamically Optimized Theta method ({{StatsForecast}})
  • Combinations Seen
    • KNN, random forest and principal components regression together with ARIMA, Unobserved Components Model and linear regression
  • It’s ideal to include individual models with errors uncorrelated with each other.
    • Some teams in the M5 competitions combined models trained on subsets of the training data and with different loss functions.
  • Other methods: complex blending, stacking
  • Complex ensembles can be difficult to maintain in production, so the performance improvement of using one should reflect that fact.

Diagnostics

  • Residual Testing
    • Hyndman suggests taking average degrees of freedom (dof) of the models included in the ensemble to use for residual tests (?)

Averaging (Equal Weighting)

  • aka Blending
  • CDC ensemble: https://github.com/reichlab/covid19-forecast-hub#ensemble-model
    • Each model has a forecast distribution for each horizon
      • “23 quantiles to be submitted for each of the one through four week ahead values for forecasts of deaths, and a full set of 7 quantiles for the one through four week ahead values for forecasts of cases (see technical README for details), and that the 10th quantile of the predictive distribution for a 1 week ahead forecast of cumulative deaths is not below the most recently observed data.”
    • Used the median prediction across all eligible models at each quantile level

Adaptive Ensembling

  • Misc

    • Video, Github (Amaal Saadallah)
      • She also has paper that uses Reinforcement Learning with actor-critic approach to compute weights, EADRL
  • Overview

  • Steps

    1. Drift Detection
      • Drift from stationarity detected which triggers a re-selection of base models and ensemble

      • Calculating drift

        1. Select a testing window with most recent data
        2. Calculate predictions using each base model from the ensemble for this testing window
        3. Calculate scaled root correlation (SRC) between each model’s predictions and the testing window observations
          \[ \text{SRC}(\hat Y_{W,t}^{M_i}, Y_{M,t}) = \sqrt{\frac{1-\operatorname{corr}(\hat Y_{W,t}^{M_i}, Y_{W,t})}{2}} \]
          • \(\hat Y\) is the predictions for model, \(M_i\), and testing window \(W,t\).
          • \(Y\) is testing window observations for testing window \(W,t\)
          • \(\text{corr}\) is the pearson correlation
        4. Hoeffding Bound
          \[ \epsilon_F = \sqrt{\frac{R^2 \log(1/\delta)}{2W}} \]
          • \(R\) is the range of the random variable
          • \(W\) is the number of observations of the testing window used to calculate the \(\text{SRC}\)
          • \(\delta\) is a user-defined hyperparameter
    2. Prune base models
      • CV
      • Remove the awful ones
    3. Cluster pruned set of models based on predictions
      • By selecting a model from each cluster, you get a diversified set of models for the ensemble
        • By reducing covariance between models in your ensemble, it reduces the variance in the ensemble’s residuals (my notes)
      • The representative model from each cluster is selected for the ensemble
      • Suggests using an Improper Maximum Likelihood Estimator Clustering (IMLEC) method
        • {otrimle}
          • Hyperparameters automatically tuned; Outliers removed
          • Seems to be a robust gaussian mixture clustering algorithm
          • Has links to paper, Coretto and Hennig, 2016
        • Says Euclidean distance is bad. This method uses covariance to form clusters which fits with trying to get a set of time series with maximum variance for the ensemble
        • Since this is a GMM you could have models belonging to more than 1 cluster I assume.
          • If 1 model is chosen as representative for more than 1 cluster than probably wouldn’t matter too much. Maybe less redundancy.
          • if 1 model is in more than 1 cluster but not chosen as representative in both, it might be interesting to know that. Maybe means there’s some redundancy in your ensemble but not too much. Might want to score an ensemble with only 1 of cluster representatives and see if it’s better.
      • She has a DTW option but seems to prefer this IMLEC method
        • Showed some RMSE stats where the dtw method had a 50 RMSE and IMLEC had a 43
        • Used partitioning for a clustering method but no tuning was involved
    4. Combine models predictions
      • sliding window average, averaging, stacking, metalearner