4  Extreme Value Theory

4.1 Misc

  • Packages
    • CRAN Task View

    • {erf} - Able to extrapolate extimates beyond the training data since erf is based on EVT and is also flexible since it uses a RF

      • video: from the 33min mark to 55:19
      • Q(τ) is the desired quantile you want to estimate
        • Q(τ0) is an intermediate quantile (e.g. 0.80) that can be estimated using a quantile RF (package uses {grf})
          • Depends on thickness of tail (i.e whether the shape parameter is negative, 0, or positive)
          • 0.80 tends to work reasonable well
          • The higher the threshold you use, the less variance but higher bias
      • ξ(x) and σ(x) says the shape and scale parameters depend on the predictors. They’re estimated by minimizing a probability distribution’s log-likelihood which are multiplied by weights extracted from quantile RF.
      • tune minimum node size, penalty term on the variability of the shape parameter
      • cv using deviance metric for model selection
    • {gbex} - no docs, only paper, gradient boosting for extreme quantile regression; able to extrapolate since they’re based on EVT

    • {evgam} - Extreme Value GAM; able to extrapolate since they’re based on EVT

    • {extRemes} (Tutorial, Slides with examples, Vignette, Bootstrapping) - General functions for performing extreme value analysis

      • Allows for inclusion of covariates into the parameters of the extreme-value distributions, as well as estimation through MLE, L-moments, generalized (penalized) MLE (GMLE), as well as Bayes.
      • Inference methods include parametric normal approximation, profile-likelihood, Bayes, and bootstrapping.
      • Some bivariate functionality and dependence checking (e.g., auto-tail dependence function plot, extremal index estimation) is also included
    • {SpatialGEV} (JOSS, Github): Fast Bayesian inference for spatial extreme value models in R

    • {EQRN} (Paper) - Extreme Quantile Regression Neural Networks for Conditional Risk Assessment (Uses {torch})

  • Papers
    • Distributional regression models for Extended Generalized Pareto distributions
      • As an example of modeling with an Extended Generalized Pareto Distribution (EGPD), precipitation with time and location covariates is modeled using {gamlss}. The authors wrote an extension to be able to use the EGPD distribution in order to capture extremes better than a Gamma distribution.
      • Code is in the paper and also in a github repo.
    • On the optimal prediction of extreme events in heavy-tailed time series with applications to solar flare forecasting
      • Uses a modified fractional ARIMA model (FARIMA or ARFIMA)
      • No code but seems feasible to implement using {extRemes} and code from {forecast::arfima}
    • New flexible versions of extended generalized Pareto model for count data
      • Github
      • The Discrete Generalized Pareto distribution (DGPD) is preferred for high threshold exceedances which makes it ideal for analyzing extreme values and rare events, but it becomes less effective for low threshold exceedances.
      • Peak-Over-Threshold (POT) method approximates the distribution of exceedances above a high threshold using the Generalized Pareto Distribution (GPD).
        • When the threshold is sufficiently high, standard extreme value models, such as the POT method with DGPD approximation can be applied to model the exceedances
        • Choosing an appropriate threshold \(u\) is critical for the effective application of the POT approximation.
          • Setting the threshold too low can introduce bias into the estimates, as the DGPD is justified only in an asymptotic sense.
          • Setting the threshold too high reduces the number of data points, increasing the estimation variance.
          • In practice, selecting a suitable threshold in a continuous setting often involves using parameter stability plots and mean residual life plots which may not always clearly indicate the best threshold.
          • The 3rd extension handles this problem by allowing you to set a lower threshold that’s interpretable to the user while still accurately modeling the distribution of the extreme values.
      • This paper provides three extensions for the DGPD copula
        1. The entire distribution of the data, including both bulk and tail and bypassing the threshold selection step
          • For capturing the distribution characteristics of integer-valued data, including their variability and distribution shape.
        2. The entire distribution along with Zero Inflation
          • For datasets where non-negative integer values are prevalent but also have a significant number of zero observations (often referred to as ”excessive zeros”)
          • e.g. Environmental data or medical statistics, where zeros might occur more frequently than other values. DEGPD models are designed to account for such zero inflation while still modeling the non-zero counts accurately.
        3. The tail of the distribution for low threshold exceedances
          • For analyzing discrete data where interest lies in exceedances above a specified threshold.
          • e.g. For modeling rare and extreme events such as high precipitation levels or extreme temperature spike counts.
          • Doesn’t require censoring in the likelihood which simplifies the modeling process in cases where setting an appropriate high threshold is challenging.
  • {erf} and {gbex} peform better than regular quantile rf model types for quantiles > 0.80 (video: from the 33min mark to 55:19, results towards the end)
    • Non-ML methods like {evgam} perform poorly for data with highh dim
  • Why using Random Forest models that do NOT incorporate EVT usually don’t produce good results.
    • Typical RF weighs every data point equally while a grf (see Regression, Quantile), depending on the quantile estimate, will weigh data points closer to the quantile more heavily
    • Quantile Regression Forests work fine on moderate quantiles (e.g. 0.80) but even those like grfs struggle with more extreme quantiles because no matter how large the quantile you choose, the predicted quantile will be no larger than the most extreme data point. They use empirical methods and have no way to extrapolate.

4.2 Distribution Tail Classification

  • Misc
  • Difference between tail events and outliers:
    • Outliers tend to be extreme values that occur very infrequently. Typically they are less than 1% of the data.
    • Tail events are less extreme values compared to outliers but occur with greater frequency.
  • Tail events can be difficult to predict because
    • Although not as rare as outliers, it’s still difficult to get enough to data to model these events with any sufficient precision.
    • Difficult to obtain leading indicators which are correlated with the likelihood of a tail event occurring
  • Prediction tips
    • Consider binning numerics to help the model learn sparse patterns.
    • Use realtime features
      • Example: Predicting delivery time tail events
        • unexpected rainstorm (weather data)
        • road construction (traffic data)
    • Utilize a quadratic or L2 loss function.
      • Mean Squared Error (MSE) is perhaps the most commonly used example. Because the loss function is calculated based on the squared errors, it is more sensitive to the larger deviations associated with tail events
  • Heavy tails
    • Your random variable distribution is heavy tailed if:
        • where the exponential survival function, 
        • Says if you take the ratio of your most extreme positive values (i.e. your survival function) at the tail (i.e. supremum)(numerator) and those of the positive tail of exponential survival function (denominator), then that ratio will go to positive infinity as x goes to infinity
        • Or in other words, the probability mass of the pdf of your random variable in the tail is greater than the probability mass that of the exponential pdf
        • Also means that the moment generating function is equal to infinity which means that it can’t be used to calculate distribution parameters
    • Subsets of heavy tails
      • Along with survival function ratio (see above), these tails have additional conditions 
      • long tails
        • common in finance
        • Your random variable distribution is long tailed if:
          • it follows the explosion principle
            • If an extreme event manifests itself, then the probability of an even more extreme event approaches 1
              • no time prediction on the next more extreme event but extreme value theory + timeseries + conditions say extreme events tend to cluster
            • Says, for example, if you take a huge loss in your portfolio, it’s a mistake to think that that value is an upper bound on losses or that the probability of an even larger loss is negligible
          • Not practical to determine from data
        • subexponential tails
          • subset of long tail
          • Your random variable distribution is subexponential tailed if:
            • it follows the one-shot aka catastrophe principle aka “winner takes all”
                • where Sn is a partial sum of values of your random variable; Mn is a partial maximum; x is a large value
                • says at some point the partial sum, Sn , will be dominated by one large value, Mn
                • Example: if your portfolio follows this principle, then your total loss can be mostly attributed to one large loss
            • tools available to practically test
          • Examples:
            • log-normal
              • can get normal parameters from lognormal parameters by formula that involves exponentiation (see notebook) or vice versa with logs
              • all statistical moments always exist
          • fat tails
            • Fat-Tailed distributions describe quantities whose aggregate statistics are driven by rare events. For instance, the top 1% accounts for about 30% of the wealth in the US.
            • The central problem with these types of distributions is insufficient data. In other words, we need a large volume of data (more than is usually available) to estimate its true statistical properties accurately.
              • Estimating the mean
            • Masquerade Problem - Fat-Tailed distributions can appear thin-tailed, but thin-tailed can never appear fat-tailed.
              • Fat-Tailed quantities demonstrate significant regularity (e.g., most viruses are tame, stocks typically move between -1% and 1%)
              • Mistaking fat tails for thin ones is dangerous because a single wrong prediction can erase a long history of correct ones. So, if you’re unsure, it’s better to err on the side of fat tails.
              • L(x) is just characterized as slowly varying function that gets dominated by the decaying inverse power law element, x-α. as x goes to infinity
              • α is a shape parameter, aka “tail index” aka “Pareto index”
            • Examples
              • pareto
                • pareto has similar relationship with the exponential distribution as lognormal does with normal
                  • xm is the (positive) minimum of the randomly distributed pareto variable, X that has iindex α
                  • Yexp is exponentially distributed with rate α
                • some theoretical statistical moments may not exist
                  • If the theoretical moments do not exist, then calculating the sample moments is useless
                  • Example: Pareto (α = 1.5) has a finite mean and an infinite variance
                    • Need α > 2 for a finite variance
                    • Need α > 1 for a finite mean
                    • In general you need α > p for the pth moment to exist
                    • If the nth moment is not finite, then the (n+1)th moment is not finite.
  • Light tails
    • Opposite of heavy
    • Instead of larger than pdf or survival function of the exponential version, it’s equal to or smaller than.
      • i.e. your function decays as fast or faster as x goes to infinity as an exponential
    • Examples
      • exponential, normal
  • Both
    • class depends on parameter values
    • Examples
      • Weibull

  • Tests
    • Notes
      • All the plots below should be used and considered when diagnosing tails
      • Can use the zipf and me plots to find the thresholds in the data where it would be useful to start modeling the data as pareto or lognormal
    • Ask these questions
      1. Does the subject matter you’re modeling lead you to expect a certain type of tail?
        • Example: Does the explosion principle hold or not?
      2. Is there an upper bound to your data (theoretical or actual)?
        • Example: Is the upper bound due to the quality of the data
      3. Do I have over 10,000 observations?
        • In the various plots below, it can be difficult to distinguish between Pareto (fat tail) and Lognormal (long tail) distributions. As a rule-of-thumb, usually takes 10K observations to really be able to tell the two apart in order to get enough data points in the tail.
        • Usually get at least 10K observations in a market risk portfolio, but not in credit risk or operational risk portfolios
    • Q-Q plot
      • exponential quartiles on the y-axis and ordered data on the x-axis
        • See EDA >> Numeric >> Q-Q plot for code
      • if data hugs the diagonal line –> exponential –> light tails
      • if data is concave –> potentially heavy tails
      • if data is convex –> potentially tails that are lighter than an exponential
    • Zipf plot
      • log-log plot of the empirical survival function of the data
        • log of the pareto survival function makes it linear where the slope of the line is -α
      • indicates if there’s a power law decay in the tails of the data (i.e. fat tails)
        • results of this plot is “necessary” but not “sufficient” for confirmation of fat tails (pareto)
        • It is sufficient to say it’s not a pareto if there’s curvature
        • The real data shows linearity at the very end, so even though it’s not linear from the beginning, it is still potentially fat tailed
          • Real data often show mixed, complex behaviors.
        • Also not that even in the simulated dataset, the data points at the end have some randomness to them and don’t fall directly on the line.
          • the randomness is called small sample bias; usually not much data in the tails
        • log-normal can look like a pareto if its sigma parameter is large (small data). It will look linear and curve down at the very end.
        • Example above shows lognormal with sd = 1, so sd doesn’t have to be very large to be tricky to discern from a Pareto.
        • If the data has a smallish range (x-axis), then that is a signal to wary about deeming the distribution to having fat tails
          • This one goes from 0 to 100 while the one above it goes from 0 to a million
          • “Large” or “small” depends on the type of data your looking at though. In another subject matter, maybe 100 is considered large, so context matters
        • Compare slopes between your original data (red) and aggregations of your data in a zipf plot; If you have fat tails, the line will be shifted because of aggregation but the slope, α, will remain the same
          • Examples of aggregation methods (halves the sample size)
            1. Order data from largest to smallest; add a1 + an, a2 + an-1, …; plot alongside original data (green)
            2. Order data from largest to smallest; add a1 + a2, a3 + a4, …; plot alongside original data (blue)
    • Mean Excess (ME) plot
      • Calculating the empirical mean excess variable - Order the data, calc mean2, remove the 1 data point, calc mean2, remove data points 1 and 2, calc mean3, and so on. Then plot the means

        • lognormal is similar to pareto in this plot as well. The more data you have the easier it will be to distinguish the two.
        • The left equation is for the lognormal curve (with Normal parameters) and the right equation is the pareto
          • Need α > 1, so that the mean is finite
        • Disregard last few points (small sample bias)
        • Points in green circles (only a few points in tails, so difficult to be confident about)
          • left: shows a straight line
          • right: concave down
        • Right plot: curvature at the beginning common in the wild, since you’re not likely dealing with pure distributions but some kind of noisy mixture
    • Maximum to sum plot (MS Plot)
      • S is the partial sum, M is the partial maximum, p is the order of the moment that you want to see if it exists or not
        • For lognormal, all moments always exist
        • For pareto, you usually only need to check up to p = 4 or p = 5
          • For higher levels of p (and hence α) the pareto distribution begins to act like a normal
          • Usually in credit, market, or operational risk markets you’re dealing with pareto 0 < α <= 3
      • Procedure
        • choose a p that you want to check
        • for each n, calculate the sum, maximum, and ratio
        • y-axis is the ratio, x-axis is the n value
        • A lognormal will always converge to 0 for every p you check (black line)
        • When a moment doesn’t exist (i.e. infinite), it just oscillates and never converges (orange line)
        • MS plots always start at 1
        • Potentially with fewer than 100 observations, you could start to see a convergence if one is going to happen. Of course hundreds of observations is better. Point is that it doesn’t take thousands.
      • Left - credit data (real estate losses), Right - operational loss data
          • Left
            • p = 1 definitely exists; p = 2 is iffy; p = 3,4 don’t exist
            • Interpretation: either α is between 1 and 2 or there aren’t enough observations to show a convergence
              • Although n is pretty large in this case
          • Right
            • p = 1 is iffy, the rest don’t exist
            • Interpretation: α might be less than 1
    • Concentration Profile
      • Requirements
        • data >= 0 and mean is finite
      • Similar to the Mean Excess plot, except the gini index is computed instead of the mean
        • In the wild you can expect mixtures, so there will likely be noisy behavior in the beginning and when the fat tail is reached, a flat line is formed