Forecasting
Misc
- Also see
- Loss Functions/Error Metrics
- If you care about errors measured in percent (i.e. “relative”) and your data are strictly positive, then “relative” metrics such as the MALE and RMSLE or MAPE and sMAPE are also in this class of metrics but have issues (See Loss Functions)
- If you care about errors measured in real units (e.g. number of apples), or your data can be zero or negative, then “raw” metrics such as MAE or MSE are more appropriate.
- MAE and RMSE are scale-dependent, so they can only be used to compare models on 1 series or on multiple series if they have the same units. MAE will lead to forecasts of the median, while minimizing the RMSE will lead to forecasts of the mean
- Also see Loss Functions >> Misc >> re Stochastic Gradient Descent in ML/DL models
- If you want to compare or aggregate performance metrics across time series, then you might want to use scaled metrics such as MASE, MASLE
- Using MASLE will require your data are strictly positive
- Error Components
Scale-Dependent Base Error - A measurement of the difference between the forcast and the observed. It’s scale-dependent because the larger the magnitude of the time series, the larger the error.
Positive Transform - You don’t want negative error values to cancel out the positive error values, so want them all to be positive. A square or absolute value is typically used.
- Squared positive transforms optimize on the mean and absolute value positive transfomrs optimize on the median
Summary Operator - A method for aggregating the errors (e.g. mean, weighted mean, median, geometric mean, etc.)
Example: MAE
\[ \text{MAE} = \mbox{mean}(|e_t|) \]- \(e_t\): The residual is the scale-dependent base error
- \(|\quad |\): The absolute value is the positive transform
- \(\mbox{mean}\): The mean is the summary operator
Example: MAPE
\[ \text{MAPE} = \mbox{mean}\left ( \left | 100\frac{e_t}{y_t} \right | \right ) \]- \(100\frac{e_t}{y_t}\): scale-dependent base error
- \(| \quad |\): positive transform
- \(\mbox{mean}\): summary operator
- Considerations when choosing a metric
- Use Case: If it’s for a report, you’ll want something more interpretable than if it’s just for modeling.
- Scale
- For products, you care more about the forecasting error of big ticket or large sales volume items than for less expensive or smaller sales volume items.
- If you’re forecasting temperature, then scale is irrelevant.
- Over/Under-Forecasting
- Should either have a greater penalty than the other?
- Time Series Properties
- Outliers, Intermittency, Level Shifts, etc.
- Comparable across scales
- Do we forecast furniture sales better than t-shirts?
- Averaging vs. Pooling Errors
- If you’re using one model to forecast multiple related series, then in order to compare model performance for model selection, you’ll need to either pool or average the forecasting errors across series for each model.
- Averaging
\[ \overline{\text{RMSE}} = \frac{1}{M} \sum_{i = 1}^M \text{RMSE}_i \]- The average of the RMSE for each time series
- \(M\): The number of time series
- Pooling
\[ \text{RMSE}_{\text{pooled}} = \sqrt{\frac{1}{MH}\sum_{i=1}^M\sum_{t=1}^H (e_{t,i})^2} \]- \(M\): The number of time series
- \(H\): The forecast horizon
- Example
\[ \text{RMSE}_{\text{pooled}} = \sqrt{\frac{1}{2*3}(e_{1,1}^2 + e_{2,1}^2 + e_{3,1}^2 + e_{1,2}^2 + e_{2,2}^2 + e_{3,2}^2)} \]- \(M=2\) means two time series are forecasted by this model
- \(H=3\) means the forecast horizon is 3
- How does the amount of data affect prediction
- Could be useful for choosing a training window for production
- Example: Relative Absolute Error vs number of rows in the training set
- Interpretation: No pattern?
- Might’ve been useful to only look at the values from 0.5. Looks like a lot more points a cluster at data sizes between 1000 and 1200 rows.
- Interpretation: No pattern?
- Example: MAE vs prediction horizon (colored by the number of weeks of data in training set)
- Interpretation for this data and model: For prediction horizons greater than a couple weeks, having mored data in the training set leads to worse performance
- Does the model predict values close to zero
- “Labels” are the observed values
- Bad performance with values close to zero can be the result of the loss function used where lower losses are not penalized as much as higher losses
Probabilistic
- Continuous Ranked Probability Score (CRPS)
fabletools::accuracy
- Measures forecast distribution accuracy
- Combines a MAE score with the spread of simulated point forecasts
- See notebook (pg 172)
- Winkler Score
fabletools::accuracy
- Measures how well a forecast is covered by the prediction intervals (PI)
- See notebook (pg 172)
Tests
- Diebold-Mariano
Tests whether two methods have the same forecast accuracy
Formula (source)
\[ \begin{aligned} &\text{DM} = \frac{\bar d}{\sqrt{\frac{\gamma_0 + 2 \sum_{k=1}^{h-1}\gamma_k}{n}}} \\ &\begin{aligned} \text{where} \quad \bar d &= \frac{1}{n} \sum_{i=1}^n d_i \\ \gamma_k &= \frac{1}{n} \sum_{i=k+1}^n (d_i - \bar d)(d_{i-k} - \bar d) \end{aligned} \end{aligned} \]
- \(d_i\): Loss Differential
- For MSE, \(d_i = r_{i, 1}^2 - r_{i, 2}^2\)
- For MAE, \(d_i = |r_{i, 1}| - |r_{i, 2}|\)
- Where \(r_{i,1}\) are the residuals of model 1 and \(r_{i,2}\) are the residuals for model 2
- \(\gamma_k\): Autocovariance at lag \(k\)
- Approximated using ACF or Bartlett weights
- \(d_i\): Loss Differential
Assumes \(\text{DM} \sim \mathcal{N}(0,1)\) for \(\mu = \mathbb{E}[d_i] = 0\) (i.e. loss-differential is stationary), so it’s a z-stat
- The DM statistic for the MASE has been empirically shown to approximate this distribution, while the mean relative absolute error (MRAE), MAPE and sMAPE do not. (wiki)
“Tends to reject the null hypothesis too often for small samples” (source). So Harvey, Leybourne and Newbold (1997) proposed a modified version
\[ DM_{\text{adj}} = \text{DM} \cdot \sqrt{\frac{n+1-2h+h(h-1)}{n}} \]
Example: {forecast} docs
# Test on out-of-sample one-step forecasts <- ets(WWWusage[1:80]) f1 <- auto.arima(WWWusage[1:80]) f2 <- ets(WWWusage[81:100], model = f1) f1.out #> Model is being refit with current smoothing parameters but initial states are being re-estimated. #> Set 'use.initial.values=TRUE' if you want to re-use existing initial values. <- Arima(WWWusage[81:100], model = f2) f2.out accuracy(f1.out) #> ME RMSE MAE MPE MAPE MASE ACF1 #> Training set 0.2100836 3.24835 2.570459 0.1203497 1.352355 0.4246845 0.2287215 accuracy(f2.out) #> ME RMSE MAE MPE MAPE MASE #> Training set 1.081679 3.329012 2.437119 0.6810673 1.375924 0.4026544 #> ACF1 #> Training set -0.004460367 dm.test(residuals(f1.out), residuals(f2.out), h = 1) #> #> Diebold-Mariano Test #> #> data: residuals(f1.out)residuals(f2.out) #> DM = -0.14392, Forecast horizon = 1, Loss function power = 2, p-value = #> 0.8871 #> alternative hypothesis: two.sided
- Uses Harvey, Leybourne and Newbold (1997) proposed a modified version
- alternative=“less” - The alternative hypothesis is that method 2 is less accurate than method 1
- alternative=“greater” - The alternative hypothesis is that method 2 is more accurate than method 1
- alternative=“two.sided” (default) - The alternative hypothesis is that method 1 and method 2 have different levels of accuracy
- varestimator = “acf” (default) or “bartlett”: Autocovariance estimation
- Friedman and Neyeni Tests
- Packages
- {tsutils::nemenyi} - Perform nonparametric multiple comparisons, across columns, using the Friedman and the post-hoc Nemenyi tests.
- Friedman - Tests whether a difference between models exists
- Nemenyi - Pairwise tests to see which models differ from which after the Friedman tests’s Null has been rejected.
- Packages