Forecasting

Misc

  • Also see
  • Loss Functions/Error Metrics
    • If you care about errors measured in percent (i.e. “relative”) and your data are strictly positive, then “relative” metrics such as the MALE and RMSLE or MAPE and sMAPE are also in this class of metrics but have issues (See Loss Functions)
    • If you care about errors measured in real units (e.g. number of apples), or your data can be zero or negative, then “raw” metrics such as MAE or MSE are more appropriate.
      • MAE and RMSE are scale-dependent, so they can only be used to compare models on 1 series or on multiple series if they have the same units. MAE will lead to forecasts of the median, while minimizing the RMSE will lead to forecasts of the mean
      • Also see Loss Functions >> Misc >> re Stochastic Gradient Descent in ML/DL models
    • If you want to compare or aggregate performance metrics across time series, then you might want to use scaled metrics such as MASE, MASLE
      • Using MASLE will require your data are strictly positive
  • Error Components
    • Scale-Dependent Base Error - A measurement of the difference between the forcast and the observed. It’s scale-dependent because the larger the magnitude of the time series, the larger the error.

    • Positive Transform - You don’t want negative error values to cancel out the positive error values, so want them all to be positive. A square or absolute value is typically used.

      • Squared positive transforms optimize on the mean and absolute value positive transfomrs optimize on the median
    • Summary Operator - A method for aggregating the errors (e.g. mean, weighted mean, median, geometric mean, etc.)

    • Example: MAE
      \[ \text{MAE} = \mbox{mean}(|e_t|) \]

      • \(e_t\): The residual is the scale-dependent base error
      • \(|\quad |\): The absolute value is the positive transform
      • \(\mbox{mean}\): The mean is the summary operator
    • Example: MAPE
      \[ \text{MAPE} = \mbox{mean}\left ( \left | 100\frac{e_t}{y_t} \right | \right ) \]

      • \(100\frac{e_t}{y_t}\): scale-dependent base error
      • \(| \quad |\): positive transform
      • \(\mbox{mean}\): summary operator
  • Considerations when choosing a metric
    • Use Case: If it’s for a report, you’ll want something more interpretable than if it’s just for modeling.
    • Scale
      • For products, you care more about the forecasting error of big ticket or large sales volume items than for less expensive or smaller sales volume items.
      • If you’re forecasting temperature, then scale is irrelevant.
    • Over/Under-Forecasting
      • Should either have a greater penalty than the other?
    • Time Series Properties
      • Outliers, Intermittency, Level Shifts, etc.
    • Comparable across scales
      • Do we forecast furniture sales better than t-shirts?
  • Averaging vs. Pooling Errors
    • If you’re using one model to forecast multiple related series, then in order to compare model performance for model selection, you’ll need to either pool or average the forecasting errors across series for each model.
    • Averaging
      \[ \overline{\text{RMSE}} = \frac{1}{M} \sum_{i = 1}^M \text{RMSE}_i \]
      • The average of the RMSE for each time series
      • \(M\): The number of time series
    • Pooling
      \[ \text{RMSE}_{\text{pooled}} = \sqrt{\frac{1}{MH}\sum_{i=1}^M\sum_{t=1}^H (e_{t,i})^2} \]
      • \(M\): The number of time series
      • \(H\): The forecast horizon
      • Example
        \[ \text{RMSE}_{\text{pooled}} = \sqrt{\frac{1}{2*3}(e_{1,1}^2 + e_{2,1}^2 + e_{3,1}^2 + e_{1,2}^2 + e_{2,2}^2 + e_{3,2}^2)} \]
        • \(M=2\) means two time series are forecasted by this model
        • \(H=3\) means the forecast horizon is 3
  • How does the amount of data affect prediction
    • Could be useful for choosing a training window for production
    • Example: Relative Absolute Error vs number of rows in the training set
      • Interpretation: No pattern?
        • Might’ve been useful to only look at the values from 0.5. Looks like a lot more points a cluster at data sizes between 1000 and 1200 rows.
    • Example: MAE vs prediction horizon (colored by the number of weeks of data in training set)
      • Interpretation for this data and model: For prediction horizons greater than a couple weeks, having mored data in the training set leads to worse performance
  • Does the model predict values close to zero
    • “Labels” are the observed values
    • Bad performance with values close to zero can be the result of the loss function used where lower losses are not penalized as much as higher losses

Probabilistic

  • Continuous Ranked Probability Score (CRPS)
    • fabletools::accuracy
    • Measures forecast distribution accuracy
    • Combines a MAE score with the spread of simulated point forecasts
    • See notebook (pg 172)
  • Winkler Score
    • fabletools::accuracy
    • Measures how well a forecast is covered by the prediction intervals (PI)
    • See notebook (pg 172)

Tests

  • Diebold-Mariano
    • Tests whether two methods have the same forecast accuracy

    • Formula (source)

      \[ \begin{aligned} &\text{DM} = \frac{\bar d}{\sqrt{\frac{\gamma_0 + 2 \sum_{k=1}^{h-1}\gamma_k}{n}}} \\ &\begin{aligned} \text{where} \quad \bar d &= \frac{1}{n} \sum_{i=1}^n d_i \\ \gamma_k &= \frac{1}{n} \sum_{i=k+1}^n (d_i - \bar d)(d_{i-k} - \bar d) \end{aligned} \end{aligned} \]

      • \(d_i\): Loss Differential
        • For MSE, \(d_i = r_{i, 1}^2 - r_{i, 2}^2\)
        • For MAE, \(d_i = |r_{i, 1}| - |r_{i, 2}|\)
        • Where \(r_{i,1}\) are the residuals of model 1 and \(r_{i,2}\) are the residuals for model 2
      • \(\gamma_k\): Autocovariance at lag \(k\)
        • Approximated using ACF or Bartlett weights
    • Assumes \(\text{DM} \sim \mathcal{N}(0,1)\) for \(\mu = \mathbb{E}[d_i] = 0\) (i.e. loss-differential is stationary), so it’s a z-stat

      • The DM statistic for the MASE has been empirically shown to approximate this distribution, while the mean relative absolute error (MRAE), MAPE and sMAPE do not. (wiki)
    • “Tends to reject the null hypothesis too often for small samples” (source). So Harvey, Leybourne and Newbold (1997) proposed a modified version

      \[ DM_{\text{adj}} = \text{DM} \cdot \sqrt{\frac{n+1-2h+h(h-1)}{n}} \]

    • Example: {forecast} docs

      # Test on out-of-sample one-step forecasts
      f1 <- ets(WWWusage[1:80])
      f2 <- auto.arima(WWWusage[1:80])
      f1.out <- ets(WWWusage[81:100], model = f1)
      #> Model is being refit with current smoothing parameters but initial states are being re-estimated.
      #> Set 'use.initial.values=TRUE' if you want to re-use existing initial values.
      f2.out <- Arima(WWWusage[81:100], model = f2)
      
      accuracy(f1.out)
      #>                     ME    RMSE      MAE       MPE     MAPE      MASE      ACF1
      #> Training set 0.2100836 3.24835 2.570459 0.1203497 1.352355 0.4246845 0.2287215
      accuracy(f2.out)
      #>                    ME     RMSE      MAE       MPE     MAPE      MASE
      #> Training set 1.081679 3.329012 2.437119 0.6810673 1.375924 0.4026544
      #>                      ACF1
      #> Training set -0.004460367
      
      dm.test(residuals(f1.out), residuals(f2.out), h = 1)
      #> 
      #>  Diebold-Mariano Test
      #> 
      #> data:  residuals(f1.out)residuals(f2.out)
      #> DM = -0.14392, Forecast horizon = 1, Loss function power = 2, p-value =
      #> 0.8871
      #> alternative hypothesis: two.sided
      • Uses Harvey, Leybourne and Newbold (1997) proposed a modified version
      • alternative=“less” - The alternative hypothesis is that method 2 is less accurate than method 1
      • alternative=“greater” - The alternative hypothesis is that method 2 is more accurate than method 1
      • alternative=“two.sided” (default) - The alternative hypothesis is that method 1 and method 2 have different levels of accuracy
      • varestimator = “acf” (default) or “bartlett”: Autocovariance estimation