3 Extreme Value Theory

3.1 Misc

Packages
- CRAN Task View
- {erf} - Able to extrapolate extimates beyond the training data since erf is based on EVT and is also flexible since it uses a RF
  - video: from the 33min mark to 55:19
  - Q(τ) is the desired quantile you want to estimate
    - Q(τ0) is an intermediate quantile (e.g. 0.80) that can be estimated using a quantile RF (package uses {grf})
      - Depends on thickness of tail (i.e whether the shape parameter is negative, 0, or positive)
      - 0.80 tends to work reasonable well
      - The higher the threshold you use, the less variance but higher bias
  - ξ(x) and σ(x) says the shape and scale parameters depend on the predictors. They’re estimated by minimizing a probability distribution’s log-likelihood which are multiplied by weights extracted from quantile RF.
  - tune minimum node size, penalty term on the variability of the shape parameter
  - cv using deviance metric for model selection
- {gbex} - no docs, only paper, gradient boosting for extreme quantile regression; able to extrapolate since they’re based on EVT
- {evgam} - Extreme Value GAM; able to extrapolate since they’re based on EVT
- {extRemes} (Tutorial, Slides with examples, Vignette, Bootstrapping) - General functions for performing extreme value analysis
  - Allows for inclusion of covariates into the parameters of the extreme-value distributions, as well as estimation through MLE, L-moments, generalized (penalized) MLE (GMLE), as well as Bayes.
  - Inference methods include parametric normal approximation, profile-likelihood, Bayes, and bootstrapping.
  - Some bivariate functionality and dependence checking (e.g., auto-tail dependence function plot, extremal index estimation) is also included
- {SpatialGEV} (JOSS, Github): Fast Bayesian inference for spatial extreme value models in R
- {EQRN} (Paper) - Extreme Quantile Regression Neural Networks for Conditional Risk Assessment (Uses {torch})
- {maxbootR} - Efficient (C++) Bootstrap Methods for Block Maxima. Includes disjoint blocks and sliding blocks
- {fitdistcp} - Distribution Fitting with Calibrating Priors for Commonly Used Distributions
  - Using maximum likelihood gives predictions of extreme return levels that are exceeded more often than expected (i.e. under-predicts extreme values)
  - “For instance, when using the generalised extreme value distribution (GEVD) with 50 annual data values, fitted using maximum likelihood, we find that 200-year return levels are exceeded more than twice as often as expected; i.e. they are exceeded in more than 1 in 100 simulated years.”
  - Bayesian prediction using right Haar priors
Papers
- Distributional regression models for Extended Generalized Pareto distributions
  - As an example of modeling with an Extended Generalized Pareto Distribution (EGPD), precipitation with time and location covariates is modeled using {gamlss}. The authors wrote an extension to be able to use the EGPD distribution in order to capture extremes better than a Gamma distribution.
  - Code is in the paper and also in a github repo.
- On the optimal prediction of extreme events in heavy-tailed time series with applications to solar flare forecasting
  - Uses a modified fractional ARIMA model (FARIMA or ARFIMA)
  - No code but seems feasible to implement using {extRemes} and code from {forecast::arfima}
- New flexible versions of extended generalized Pareto model for count data
  - Github
  - The Discrete Generalized Pareto distribution (DGPD) is preferred for high threshold exceedances which makes it ideal for analyzing extreme values and rare events, but it becomes less effective for low threshold exceedances.
  - Peak-Over-Threshold (POT) method approximates the distribution of exceedances above a high threshold using the Generalized Pareto Distribution (GPD).
    - When the threshold is sufficiently high, standard extreme value models, such as the POT method with DGPD approximation can be applied to model the exceedances
    - Choosing an appropriate threshold \(u\) is critical for the effective application of the POT approximation.
      - Setting the threshold too low can introduce bias into the estimates, as the DGPD is justified only in an asymptotic sense.
      - Setting the threshold too high reduces the number of data points, increasing the estimation variance.
      - In practice, selecting a suitable threshold in a continuous setting often involves using parameter stability plots and mean residual life plots which may not always clearly indicate the best threshold.
      - The 3rd extension handles this problem by allowing you to set a lower threshold that’s interpretable to the user while still accurately modeling the distribution of the extreme values.
  - This paper provides three extensions for the DGPD copula
    1. The entire distribution of the data, including both bulk and tail and bypassing the threshold selection step
      - For capturing the distribution characteristics of integer-valued data, including their variability and distribution shape.
    2. The entire distribution along with Zero Inflation
      - For datasets where non-negative integer values are prevalent but also have a significant number of zero observations (often referred to as ”excessive zeros”)
      - e.g. Environmental data or medical statistics, where zeros might occur more frequently than other values. DEGPD models are designed to account for such zero inflation while still modeling the non-zero counts accurately.
    3. The tail of the distribution for low threshold exceedances
      - For analyzing discrete data where interest lies in exceedances above a specified threshold.
      - e.g. For modeling rare and extreme events such as high precipitation levels or extreme temperature spike counts.
      - Doesn’t require censoring in the likelihood which simplifies the modeling process in cases where setting an appropriate high threshold is challenging.
- On the optimal prediction of extreme events in heavy-tailed time series with applications to solar flare forecasting
- Conformal Prediction for Long-Tailed Classification
{erf} and {gbex} peform better than regular quantile rf model types for quantiles > 0.80 (video: from the 33min mark to 55:19, results towards the end)
- Non-ML methods like {evgam} perform poorly for data with highh dim
Why using Random Forest models that do NOT incorporate EVT usually don’t produce good results.
- Typical RF weighs every data point equally while a grf (see Regression, Quantile), depending on the quantile estimate, will weigh data points closer to the quantile more heavily
- Quantile Regression Forests work fine on moderate quantiles (e.g. 0.80) but even those like grfs struggle with more extreme quantiles because no matter how large the quantile you choose, the predicted quantile will be no larger than the most extreme data point. They use empirical methods and have no way to extrapolate.

Terms

Block Maxima - The highest values recorded within specific, consecutive, non-overlapping periods (or “blocks”) of time or sequences of observations
- e.g. Take a multi-year time series with monthly frequency is divided into yearly blocks. The maximum value within each block are the block maxima.
- Can be considered inefficient since only 1 point is considered from a block of data points.
- Also see
  - Distributions, Gumbel
  - Peaks Over Threshold (POT)
Fisher-Tippett-Gnedenko Theorem - States that, under certain conditions, the distribution of block maxima (if the blocks are large enough) will converge to one of three types of distributions, which are all part of the Generalized Extreme Value (GEV) distribution family. This makes modeling and predicting future extremes possible.
Peaks Over Threshold (POT) - Focuses on identifying and analyzing all observations that exceed a certain high level (the threshold).
- Values above the threshold are called exceedances
- The amount the exceedance exceeds the threshold is the excess
- Exceedance Rate is often modeled as a Poisson process
  - If values above the threshold are highly dependent, a “declustering” scheme might need to be applied first (e.g., ensuring exceedances are separated by a certain minimum time period) to achieve approximate independence before applying the POT model.
- The Pickands–Balkema–de Haan theorem states that for a sufficiently high threshold, the distribution of these excesses can be well-approximated by the Generalized Pareto Distribution (GPD)
- Threshold should be high enough so using a GPD distribution doesn’t produced biased predictions and low enough to include enough exceedances to estimate the shape and scale of the GPD
  - Graphical Methods: Mean Residual Life (MRL) Plot, Parameter Stability Plots
  - Others: Choose a quantile, choose a threshold that’s results in enough points for a valid GPD model, domain knowledge
- In contrast to block maxima, it’s data efficient (doesn’t discard potential extreme points), focuses on all extreme values, and isn’t dependent on the block size.

3.2 Distribution Tail Classification

Misc
- Notes from quantitative risk management lectures QRM 4-3, 4-4, https://www.youtube.com/watch?v=O0fdBwBRGU4
- Packages
  - {tailplots} - Estimators and Plots for Gamma and Pareto Tail Detection
    - Includes a g function that distinguishes between log-convex and log-concave tail behavior.
    - Also includes methods for visualizing these estimators and their associated confidence intervals across various threshold values.
- Skewness and Kurtosis, like all higher moments, have high variances. ({fitdistrplus} vignette)
  - Also see Distributions >> FItting Distributions >> {fitdistrplus} >> Skewness and Kurtosis for an example of bootstrapped skewness and kurtosis values to get an ideal of the variance.
Difference between tail events and outliers:
- Outliers tend to be extreme values that occur very infrequently. Typically they are less than 1% of the data.
- Tail events are less extreme values compared to outliers but occur with greater frequency.
Tail events can be difficult to predict because
- Although not as rare as outliers, it’s still difficult to get enough to data to model these events with any sufficient precision.
- Difficult to obtain leading indicators which are correlated with the likelihood of a tail event occurring
Prediction tips
- Consider binning numerics to help the model learn sparse patterns.
- Use realtime features
  - Example: Predicting delivery time tail events
    - unexpected rainstorm (weather data)
    - road construction (traffic data)
- Utilize a quadratic or L2 loss function.
  - Mean Squared Error (MSE) is perhaps the most commonly used example. Because the loss function is calculated based on the squared errors, it is more sensitive to the larger deviations associated with tail events
Heavy tails
- Your random variable distribution is heavy tailed if:
  - - where the exponential survival function,
    - Says if you take the ratio of your most extreme positive values (i.e. your survival function) at the tail (i.e. supremum)(numerator) and those of the positive tail of exponential survival function (denominator), then that ratio will go to positive infinity as x goes to infinity
    - Or in other words, the probability mass of the pdf of your random variable in the tail is greater than the probability mass that of the exponential pdf
    - Also means that the moment generating function is equal to infinity which means that it can’t be used to calculate distribution parameters
- Subsets of heavy tails
  - Along with survival function ratio (see above), these tails have additional conditions
  - long tails
    - common in finance
    - Your random variable distribution is long tailed if:
      - it follows the explosion principle
        
        If an extreme event manifests itself, then the probability of an even more extreme event approaches 1
        
        no time prediction on the next more extreme event but extreme value theory + timeseries + conditions say extreme events tend to cluster
        
        Says, for example, if you take a huge loss in your portfolio, it’s a mistake to think that that value is an upper bound on losses or that the probability of an even larger loss is negligible
      - Not practical to determine from data
    - subexponential tails
      - subset of long tail
      - Your random variable distribution is subexponential tailed if:
        
        it follows the one-shot aka catastrophe principle aka “winner takes all”
        
        where Sn is a partial sum of values of your random variable; Mn is a partial maximum; x is a large value
        
        says at some point the partial sum, Sn , will be dominated by one large value, Mn
        
        Example: if your portfolio follows this principle, then your total loss can be mostly attributed to one large loss
        
        tools available to practically test
      - Examples:
        
        log-normal
        
        can get normal parameters from lognormal parameters by formula that involves exponentiation (see notebook) or vice versa with logs
        
        all statistical moments always exist
      - fat tails
        
        Fat-Tailed distributions describe quantities whose aggregate statistics are driven by rare events. For instance, the top 1% accounts for about 30% of the wealth in the US.
        
        The central problem with these types of distributions is insufficient data. In other words, we need a large volume of data (more than is usually available) to estimate its true statistical properties accurately.
        
        Estimating the mean
        
        Masquerade Problem - Fat-Tailed distributions can appear thin-tailed, but thin-tailed can never appear fat-tailed.
        
        Fat-Tailed quantities demonstrate significant regularity (e.g., most viruses are tame, stocks typically move between -1% and 1%)
        
        Mistaking fat tails for thin ones is dangerous because a single wrong prediction can erase a long history of correct ones. So, if you’re unsure, it’s better to err on the side of fat tails.
        
        L(x) is just characterized as slowly varying function that gets dominated by the decaying inverse power law element, x-α. as x goes to infinity
        
        α is a shape parameter, aka “tail index” aka “Pareto index”
        
        Examples
        
        pareto
        
        pareto has similar relationship with the exponential distribution as lognormal does with normal
        
        xm is the (positive) minimum of the randomly distributed pareto variable, X that has iindex α
        
        Yexp is exponentially distributed with rate α
        
        some theoretical statistical moments may not exist
        
        If the theoretical moments do not exist, then calculating the sample moments is useless
        
        Example: Pareto (α = 1.5) has a finite mean and an infinite variance
        
        Need α > 2 for a finite variance
        
        Need α > 1 for a finite mean
        
        In general you need α > p for the pth moment to exist
        
        If the nth moment is not finite, then the (n+1)th moment is not finite.
Light tails
- Opposite of heavy
- Instead of larger than pdf or survival function of the exponential version, it’s equal to or smaller than.
  - i.e. your function decays as fast or faster as x goes to infinity as an exponential
- Examples
  - exponential, normal
Both
- class depends on parameter values
- Examples
  - Weibull

Tests
- Notes
  - All the plots below should be used and considered when diagnosing tails
  - Can use the zipf and me plots to find the thresholds in the data where it would be useful to start modeling the data as pareto or lognormal
- Ask these questions
  1. Does the subject matter you’re modeling lead you to expect a certain type of tail?
    - Example: Does the explosion principle hold or not?
  2. Is there an upper bound to your data (theoretical or actual)?
    - Example: Is the upper bound due to the quality of the data
  3. Do I have over 10,000 observations?
    - In the various plots below, it can be difficult to distinguish between Pareto (fat tail) and Lognormal (long tail) distributions. As a rule-of-thumb, usually takes 10K observations to really be able to tell the two apart in order to get enough data points in the tail.
    - Usually get at least 10K observations in a market risk portfolio, but not in credit risk or operational risk portfolios
- Q-Q plot
  - exponential quartiles on the y-axis and ordered data on the x-axis
    - See EDA >> Numeric >> Q-Q plot for code
  - if data hugs the diagonal line –> exponential –> light tails
  - if data is concave –> potentially heavy tails
  - if data is convex –> potentially tails that are lighter than an exponential
- Zipf plot
  - log-log plot of the empirical survival function of the data
    - log of the pareto survival function makes it linear where the slope of the line is -α
  - indicates if there’s a power law decay in the tails of the data (i.e. fat tails)
    - results of this plot is “necessary” but not “sufficient” for confirmation of fat tails (pareto)
    - It is sufficient to say it’s not a pareto if there’s curvature
  - - The real data shows linearity at the very end, so even though it’s not linear from the beginning, it is still potentially fat tailed
      - Real data often show mixed, complex behaviors.
    - Also not that even in the simulated dataset, the data points at the end have some randomness to them and don’t fall directly on the line.
      - the randomness is called small sample bias; usually not much data in the tails
  - - log-normal can look like a pareto if its sigma parameter is large (small data). It will look linear and curve down at the very end.
    - Example above shows lognormal with sd = 1, so sd doesn’t have to be very large to be tricky to discern from a Pareto.
    - If the data has a smallish range (x-axis), then that is a signal to wary about deeming the distribution to having fat tails
      - This one goes from 0 to 100 while the one above it goes from 0 to a million
      - “Large” or “small” depends on the type of data your looking at though. In another subject matter, maybe 100 is considered large, so context matters
  - - Compare slopes between your original data (red) and aggregations of your data in a zipf plot; If you have fat tails, the line will be shifted because of aggregation but the slope, α, will remain the same
      - Examples of aggregation methods (halves the sample size)
        
        Order data from largest to smallest; add a1 + an, a2 + an-1, …; plot alongside original data (green)
        
        Order data from largest to smallest; add a1 + a2, a3 + a4, …; plot alongside original data (blue)
- Mean Excess (ME) plot
  - Calculating the empirical mean excess variable - Order the data, calc mean2, remove the 1 data point, calc mean2, remove data points 1 and 2, calc mean3, and so on. Then plot the means
  - - lognormal is similar to pareto in this plot as well. The more data you have the easier it will be to distinguish the two.
    - The left equation is for the lognormal curve (with Normal parameters) and the right equation is the pareto
      - Need α > 1, so that the mean is finite
  - - Disregard last few points (small sample bias)
    - Points in green circles (only a few points in tails, so difficult to be confident about)
      - left: shows a straight line
      - right: concave down
    - Right plot: curvature at the beginning common in the wild, since you’re not likely dealing with pure distributions but some kind of noisy mixture
- Maximum to sum plot (MS Plot)
  - S is the partial sum, M is the partial maximum, p is the order of the moment that you want to see if it exists or not
    - For lognormal, all moments always exist
    - For pareto, you usually only need to check up to p = 4 or p = 5
      - For higher levels of p (and hence α) the pareto distribution begins to act like a normal
      - Usually in credit, market, or operational risk markets you’re dealing with pareto 0 < α <= 3
  - Procedure
    - choose a p that you want to check
    - for each n, calculate the sum, maximum, and ratio
    - y-axis is the ratio, x-axis is the n value
  - - A lognormal will always converge to 0 for every p you check (black line)
    - When a moment doesn’t exist (i.e. infinite), it just oscillates and never converges (orange line)
    - MS plots always start at 1
    - Potentially with fewer than 100 observations, you could start to see a convergence if one is going to happen. Of course hundreds of observations is better. Point is that it doesn’t take thousands.
  - Left - credit data (real estate losses), Right - operational loss data
    - - Left
        
        p = 1 definitely exists; p = 2 is iffy; p = 3,4 don’t exist
        
        Interpretation: either α is between 1 and 2 or there aren’t enough observations to show a convergence
        
        Although n is pretty large in this case
      - Right
        
        p = 1 is iffy, the rest don’t exist
        
        Interpretation: α might be less than 1
- Concentration Profile
  - Requirements
    - data >= 0 and mean is finite
  - Similar to the Mean Excess plot, except the gini index is computed instead of the mean
  - - In the wild you can expect mixtures, so there will likely be noisy behavior in the beginning and when the fat tail is reached, a flat line is formed