Extreme Value Theory
Misc
Packages
- CRAN Task View
- {EQRN} (Paper) - Extreme Quantile Regression Neural Networks for Conditional Risk Assessment (Uses {torch})
- {erf} - Implements the extremal random forests (ERF), an algorithm to predict extreme conditional quantiles in large dimensions.
.resources/Screenshot (1395).png)
- Able to extrapolate extimates beyond the training data since erf is based on EVT and is also flexible since it uses a RF
- video: from the 33min mark to 55:19
- Q(τ) is the desired quantile you want to estimate
- Q(τ0) is an intermediate quantile (e.g. 0.80) that can be estimated using a quantile RF (package uses {grf})
- Depends on thickness of tail (i.e whether the shape parameter is negative, 0, or positive)
- 0.80 tends to work reasonable well
- The higher the threshold you use, the less variance but higher bias
- Q(τ0) is an intermediate quantile (e.g. 0.80) that can be estimated using a quantile RF (package uses {grf})
- ξ(x) and σ(x) says the shape and scale parameters depend on the predictors. They’re estimated by minimizing a probability distribution’s log-likelihood which are multiplied by weights extracted from quantile RF.
- tune minimum node size, penalty term on the variability of the shape parameter
- cv using deviance metric for model selection
- {evgam} - Extreme Value GAM; able to extrapolate since they’re based on EVT
- {extRemes} (Tutorial, Slides with examples, Vignette, Bootstrapping) - General functions for performing extreme value analysis
- Allows for inclusion of covariates into the parameters of the extreme-value distributions, as well as estimation through MLE, L-moments, generalized (penalized) MLE (GMLE), as well as Bayes.
- Inference methods include parametric normal approximation, profile-likelihood, Bayes, and bootstrapping.
- Some bivariate functionality and dependence checking (e.g., auto-tail dependence function plot, extremal index estimation) is also included
- {fitdistcp} - Distribution Fitting with Calibrating Priors for Commonly Used Distributions
- Using maximum likelihood gives predictions of extreme return levels that are exceeded more often than expected (i.e. under-predicts extreme values)
- “For instance, when using the generalized extreme value distribution (GEVD) with 50 annual data values, fitted using maximum likelihood, we find that 200-year return levels are exceeded more than twice as often as expected; i.e. they are exceeded in more than 1 in 100 simulated years.”
- Bayesian prediction using right Haar priors
- {gbex} - no docs, only paper, gradient boosting for extreme quantile regression; able to extrapolate since they’re based on EVT
- {GLmom} (Papers) - Provides generalized L-moments estimation methods for the generalized extreme value (‘GEV’) distribution. (examples work with maximum value time series data)
- GLME (Generalized L-Moment Estimation): Combines L-moments (Hosking, 1990) with penalty functions to regularize the shape parameter, providing more stable estimates especially for small samples (Shin et al., 2025a).
- NS L-moment based estimation: Pure L-moment equations for non-stationary models without penalty (Shin et al., 2025b).
- MAGEV (Model Averaging GEV): Combines MLE and L-moment estimates through weighted model averaging for robust high quantile estimation (Shin et al., 2026).
- {maxbootR} - Efficient (C++) Bootstrap Methods for Block Maxima. Includes disjoint blocks and sliding blocks
- {SpatialGEV} (JOSS, Github): Fast Bayesian inference for spatial extreme value models in R
- {TailID} - Detect Sensitive Points in the Tail. Utilizes the Generalized Pareto Distribution (GPD) for assessing tail behavior and detecting inconsistent points with the Identical Distribution hypothesis of the tail.
Papers
- Distributional regression models for Extended Generalized Pareto distributions
- As an example of modeling with an Extended Generalized Pareto Distribution (EGPD), precipitation with time and location covariates is modeled using {gamlss}. The authors wrote an extension to be able to use the EGPD distribution in order to capture extremes better than a Gamma distribution.
- Code is in the paper and also in a github repo.
- On the optimal prediction of extreme events in heavy-tailed time series with applications to solar flare forecasting
- Uses a modified fractional ARIMA model (FARIMA or ARFIMA)
- No code but seems feasible to implement using {extRemes} and code from {forecast::arfima}
- New flexible versions of extended generalized Pareto model for count data
- Github
- The Discrete Generalized Pareto distribution (DGPD) is preferred for high threshold exceedances which makes it ideal for analyzing extreme values and rare events, but it becomes less effective for low threshold exceedances.
- Peak-Over-Threshold (POT) method approximates the distribution of exceedances above a high threshold using the Generalized Pareto Distribution (GPD).
- When the threshold is sufficiently high, standard extreme value models, such as the POT method with DGPD approximation can be applied to model the exceedances
- Choosing an appropriate threshold \(u\) is critical for the effective application of the POT approximation.
- Setting the threshold too low can introduce bias into the estimates, as the DGPD is justified only in an asymptotic sense.
- Setting the threshold too high reduces the number of data points, increasing the estimation variance.
- In practice, selecting a suitable threshold in a continuous setting often involves using parameter stability plots and mean residual life plots which may not always clearly indicate the best threshold.
- The 3rd extension handles this problem by allowing you to set a lower threshold that’s interpretable to the user while still accurately modeling the distribution of the extreme values.
- This paper provides three extensions for the DGPD copula
- The entire distribution of the data, including both bulk and tail and bypassing the threshold selection step
- For capturing the distribution characteristics of integer-valued data, including their variability and distribution shape.
- The entire distribution along with Zero Inflation
- For datasets where non-negative integer values are prevalent but also have a significant number of zero observations (often referred to as ”excessive zeros”)
- e.g. Environmental data or medical statistics, where zeros might occur more frequently than other values. DEGPD models are designed to account for such zero inflation while still modeling the non-zero counts accurately.
- The tail of the distribution for low threshold exceedances
- For analyzing discrete data where interest lies in exceedances above a specified threshold.
- e.g. For modeling rare and extreme events such as high precipitation levels or extreme temperature spike counts.
- Doesn’t require censoring in the likelihood which simplifies the modeling process in cases where setting an appropriate high threshold is challenging.
- The entire distribution of the data, including both bulk and tail and bypassing the threshold selection step
- On the optimal prediction of extreme events in heavy-tailed time series with applications to solar flare forecasting
- Conformal Prediction for Long-Tailed Classification
- Distributional regression models for Extended Generalized Pareto distributions
{erf} and {gbex} peform better than regular quantile rf model types for quantiles > 0.80 (video: from the 33min mark to 55:19, results towards the end)
- Non-ML methods like {evgam} perform poorly for data with highh dim
Why using Random Forest models that do NOT incorporate EVT usually don’t produce good results.
- Typical RF weighs every data point equally while a grf (see Regression, Quantile), depending on the quantile estimate, will weigh data points closer to the quantile more heavily
- Quantile Regression Forests work fine on moderate quantiles (e.g. 0.80) but even those like grfs struggle with more extreme quantiles because no matter how large the quantile you choose, the predicted quantile will be no larger than the most extreme data point. They use empirical methods and have no way to extrapolate.
EVT models are specifically designed to model and understand the behavior of extreme values themselves. Rather than treating extremes as nuisances to be minimized, EVT focuses on characterizing the tail behavior of distributions.
When to use EVT:
- Extreme events are scientifically interesting and relevant (not just noise)
- You need to estimate probabilities of rare, high-impact events
- Risk assessment is the goal (finance, insurance, engineering, climate science)
- You’re dealing with inherently heavy-tailed phenomena
- You have sufficient data in the tails to fit EVT models reliably
Terms
Block Maxima - The highest values recorded within specific, consecutive, non-overlapping periods (or “blocks”) of time or sequences of observations
- e.g. Take a multi-year time series with monthly frequency is divided into yearly blocks. The maximum value within each block are the block maxima.
- Can be considered inefficient since only 1 point is considered from a block of data points.
- Also see
- Distributions, Gumbel
- Peaks Over Threshold (POT)
Fisher-Tippett-Gnedenko Theorem - States that, under certain conditions, the distribution of block maxima (if the blocks are large enough) will converge to one of three types of distributions, which are all part of the Generalized Extreme Value (GEV) distribution family. This makes modeling and predicting future extremes possible.
Peaks Over Threshold (POT) - Focuses on identifying and analyzing all observations that exceed a certain high level (the threshold).
- Values above the threshold are called exceedances
- The amount the exceedance exceeds the threshold is the excess
- Exceedance Rate is often modeled as a Poisson process
- If values above the threshold are highly dependent, a “declustering” scheme might need to be applied first (e.g., ensuring exceedances are separated by a certain minimum time period) to achieve approximate independence before applying the POT model.
- The Pickands–Balkema–de Haan theorem states that for a sufficiently high threshold, the distribution of these excesses can be well-approximated by the Generalized Pareto Distribution (GPD)
- Threshold should be high enough so using a GPD distribution doesn’t produced biased predictions and low enough to include enough exceedances to estimate the shape and scale of the GPD
- Graphical Methods: Mean Residual Life (MRL) Plot, Parameter Stability Plots
- Others: Choose a quantile, choose a threshold that’s results in enough points for a valid GPD model, domain knowledge
- In contrast to block maxima, it’s data efficient (doesn’t discard potential extreme points), focuses on all extreme values, and isn’t dependent on the block size.
Survival Function - For a random variable \(X\), the survival function (also called the complementary CDF or reliability function) is defined as:
\[ S(x) = P(X \gt x) = 1 - F(x) \]where \(F(x)\) is the Cumulative Distribution Function (CDF) which is \(P(X\le x)\)
- \(S(x)\) is sometimes written as \(\bar F(x)\)
- The survival function is the tail probability. The rate at which \(S(x) \rightarrow 0\) as \(x \rightarrow \infty\) determines which extreme value domain a distribution belongs to.
- Example: Exponential Survival Function: \(S_{\text{exp}}(x) = e^{-\lambda x}\)
Distribution Tail Classification
Misc
- Notes from quantitative risk management lectures QRM 4-3, 4-4, https://www.youtube.com/watch?v=O0fdBwBRGU4
- Skewness and Kurtosis, like all higher moments, have high variances. ({fitdistrplus} vignette)
- Also see Distributions >> FItting Distributions >> {fitdistrplus} >> Skewness and Kurtosis for an example of bootstrapped skewness and kurtosis values to get an ideal of the variance.
- Difference between tail events and outliers:
- Outliers tend to be extreme values that occur very infrequently. Typically they are less than 1% of the data.
- Tail events are less extreme values compared to outliers but occur with greater frequency.
- Tail events can be difficult to predict because
- Although not as rare as outliers, it’s still difficult to get enough to data to model these events with any sufficient precision.
- Difficult to obtain leading indicators which are correlated with the likelihood of a tail event occurring
- Prediction tips
- Consider binning numerics to help the model learn sparse patterns.
- Use realtime features
- Example: Predicting delivery time tail events
- unexpected rainstorm (weather data)
- road construction (traffic data)
- Example: Predicting delivery time tail events
- Utilize a quadratic or L2 loss function.
- Mean Squared Error (MSE) is perhaps the most commonly used example. Because the loss function is calculated based on the squared errors, it is more sensitive to the larger deviations associated with tail events
Heavy tails
- Your random variable distribution is heavy tailed if:
\[ \begin{align} &\lim_{x \rightarrow + \infty} \sup \frac{\bar F(x)}{\bar F_{\text{exp}}(x)} = \frac{\bar F(x)}{e^{-\lambda x}}, \;\; \forall \lambda \gt 0 \\ &\text{where} \;\; \bar F_{\text{exp}} = P(X \ge x) = e^{-\lambda x} \end{align} \]- \(\bar F_{\text{exp}}\) is the Exponential distribution Survival Function
- Says if you take the ratio of your most extreme positive values (i.e. your survival function) at the tail (i.e. supremum)(numerator) and those of the positive tail of exponential survival function (denominator), then that ratio will go to positive infinity as \(x\) goes to infinity
- Or in other words, the probability mass in the tail of the pdf of your random variable is greater than the probability mass in the tail of the exponential pdf.
- Also means that the moment generating function is equal to infinity which means that it can not be used to calculate distribution parameters
Long Tails
- Subset of Heavy Tails, so retains the Survival Function Ratio characteristic
- Common in finance
- Your random variable distribution is long tailed if it follows the Explosion Principle which says, “If an extreme event manifests itself, then the probability of an even more extreme event approaches 1.”
\[ \lim_{x \rightarrow \infty} P(X \gt x + t|X\gt x) = 1, \;\; \forall t > 0 \]- There is no time prediction on the next more extreme event, but extreme value theory + timeseries + conditions say extreme events tend to cluster
- Example: if you take a huge loss in your portfolio, it’s a mistake to think that that value is an upper bound on losses or that the probability of an even larger loss is negligible
- This characteristic is not practical to determine from data
Subexponential Tails
- Subset of Long Tails, so retains the Survival Function Ratio and the Explosion Principle characteristics
- Your random variable distribution is subexponential tailed if it follows the One-Shot aka Catastrophe Principle aka Winner-Takes-All
\[ P(S_n \gt x) \approx P(M_n \gt x) \; \text{as} \; n \rightarrow \infty \]- \(S_n\) is a partial sum of values of your random variable
- \(M_n\) is a partial maximum
- \(x\) is a large value
- Says at some point the partial sum, \(S_n\), will be dominated by one large value, \(M_n\)
- Example: If your portfolio follows this principle, then your total loss can be mostly attributed to one large loss
- Tools are available to practically test for this characteristic
- Distribution Examples:
- Log-Normal
- Normal distribution parameters can be calculated from from Log-Normal parameters through exponentiation or vice versa by logging
- All statistical moments always exist
- Log-Normal
Fat Tails
- Subset of Subexponential Tails, so retains the Survival Function Ratio, the Explosion Principle, Catastrophe Principle characteristics
- Fat-Tailed distributions describe quantities whose aggregate statistics are driven by rare events. For instance, the top 1% accounts for about 30% of the wealth in the US.
- The central problem with these types of distributions is insufficient data. In other words, we need a large volume of data (more than is usually available) to estimate its true statistical properties accurately.
- Masquerade Problem - Fat-Tailed distributions can appear thin-tailed, but a thin-tailed distribution can never appear fat-tailed.
- Fat-Tailed quantities demonstrate significant regularity (e.g., most viruses are tame, stocks typically move between -1% and 1%)
- Mistaking fat tails for thin ones is dangerous because a single wrong prediction can erase a long history of correct ones. So, if you’re unsure, it’s better to err on the side of fat tails.
- Survival Function (in general)
\[ \bar F = x^{-\alpha}L(x) \]- \(L(x)\) is characterized as a slowly varying function that gets dominated by the decaying inverse power law element, \(x^{-\alpha}\) , as \(x\) goes to infinity
- \(\alpha\) is a shape parameter, aka Tail Index aka Pareto Index
- Distribution Examples
- Pareto
- Also see Distributions >> Pareto
- The Pareto has similar relationship with the Exponential distribution as Log-Normal does with Normal
\[ \begin{align} &Y_{\text{exp}} = \log \left(\frac{X_{\text{pareto}}}{X_{m}}\right)\\ &Y_{\text{pareto}} = x_m e^{Y_{\text{exp}}} \end{align} \]- \(x_m\) is the (positive) minimum of the randomly distributed pareto variable, \(X\), that has index \(\alpha\)
- \(Y_{\text{exp}}\) is exponentially distributed with rate \(\alpha\)
- Some theoretical statistical moments may not exist
- If the theoretical moments do not exist, then calculating the sample moments is useless
- Example: Pareto (\(\alpha = 1.5\)) has a finite mean and an infinite variance
- Need \(\alpha \gt 2\) for a finite variance
- Need \(\alpha \gt 1\) for a finite mean
- In general, you need \(\alpha \gt p\) for the \(p^{\text{th}}\) moment to exist
- If the \(n^{\text{th}}\) moment is not finite, then the \((n+1)^{\text{th}}\) moment is not finite.
- Pareto
Light Tails
- Opposite of Heavy Tails
- Instead of the probability mass in the tail of your random variable pdf being larger than the probability mass in the tail of the Exponential pdf, it’s equal to or smaller than.
- i.e. your survival function decays as fast or faster as \(x\) goes to infinity as an exponential
- Distribution Examples
- Exponential
- The characteristics of the Exponential distribution are considered the benchmark distribution for whether a distribution has heavy tails or not.
- Normal
- Exponential
- Some distributions can be heavy or light depending on the parameter values
- e.g. Weibull
Tail Classification Plots
Misc
- Packages
- {tailplots} - Estimators and Plots for Gamma and Pareto Tail Detection
- Includes a g function that distinguishes between log-convex and log-concave tail behavior.
- Also includes methods for visualizing these estimators and their associated confidence intervals across various threshold values.
- {tailplots} - Estimators and Plots for Gamma and Pareto Tail Detection
- Workflow
- All the plots below should be used and considered when diagnosing tails
- Can use the Zipf and ME plots to find the thresholds in the data where it would be useful to start modeling the data as Pareto or Log-Normal
- Ask these questions
- Does the subject matter you’re modeling lead you to expect a certain type of tail?
- Example: Does the explosion principle hold or not?
- Is there an upper bound to your data (theoretical or actual)?
- Example: Is the upper bound due to the quality of the data
- Do I have over 10,000 observations?
- In the various plots below, it can be difficult to distinguish between Pareto (fat tail) and Lognormal (long tail) distributions. As a rule-of-thumb, usually takes 10K observations to really be able to tell the two apart in order to get enough data points in the tail.
- Usually get at least 10K observations in a market risk portfolio, but not in credit risk or operational risk portfolios
- Does the subject matter you’re modeling lead you to expect a certain type of tail?
Q-Q
- Plot exponential quartiles on the y-axis and ordered data on the x-axis
- See EDA, General >> Continuous Variables >> Q-Q plot for code
- If data hugs the diagonal line \(\rightarrow\) Exponential \(\rightarrow\) Light Tails
- If data is concave \(\rightarrow\) potentially Heavy Tails
- If data is convex \(\rightarrow\) potentially tails that are lighter than an Exponential
Zipf
- A log-log plot of the empirical survival function of the data
- The log of the pareto survival function makes it linear where the slope of the line is \(-\alpha\)
- Interpretation
- Indicates if there’s a power law decay in the tails of the data (i.e. fat tails)
- The result of this plot is “necessary,” but not “sufficient” for confirmation of fat tails (pareto)
- It is sufficient to say it’s not a pareto if there’s curvature
- Example:
- The real data shows linearity at the very end, so even though it’s not linear from the beginning, it is still potentially fat tailed
- Real data often show mixed, complex behaviors.
- Also note that even in the simulated dataset, the data points at the end have some randomness to them and don’t fall directly on the line.
- The randomness is called small sample bias, i.e. there’s usually not much data in the tails.
- The real data shows linearity at the very end, so even though it’s not linear from the beginning, it is still potentially fat tailed
- Example: Log-Normal
- A Log-Normal variable can look like a Pareto if its sigma parameter is large (small data). It will look linear and curve down at the very end.
- Example above shows Log-Normal with sd = 1, so the sd doesn’t have to be very large to be tricky to discern from a Pareto.
- If the data has a smallish range (x-axis), then that is a signal to wary about deeming the distribution to having fat tails
- This one goes from 0 to 100 while the one above it goes from 0 to a million
- “Large” or “small” depends on the type of data your looking at though. In another subject matter, maybe 100 is considered large, so context matters
- Example: Aggregate and Compare
- Compare slopes between your original data (red) and aggregations of your data in a zipf plot. If you have fat tails, the line will be shifted because of aggregation but the slope, \(\alpha\), will remain the same
- Examples of aggregation methods (halves the sample size)
- Order data from largest to smallest; add \(a_1 + a_n, a_2 + a_{n-1}, \ldots\); plot alongside original data (green)
- Order data from largest to smallest; add \(a_1 + a_2, a_3 + a_4, \ldots\); plot alongside original data (blue)
Mean Excess (ME)
- Calculating the Empirical Mean Excess variable:
- Order the data
- Calculate mean (1)
- Remove the 1 data point
- Calculate mean (2)
- Remove data points 1 and 2
- Calculate mean (3),
- Continue until you run out of data (I assume)
- Then plot the means
- A Log-Normal variable looks similar to Pareto in this plot as well. The more data you have the easier it will be to distinguish the two.
- (In top figure) The left equation is for the Log-Normal curve (with Normal parameters) and the right equation is the Pareto
- Need \(\alpha \gt 0\), so that the mean is finite
- Example:
- Disregard last few points (small sample bias)
- Points in green circles (only a few points in tails, so difficult to be confident about)
- Left: Straight line
- Right: Concave down
- Right plot: Curvature at the beginning common in the wild, since you’re not likely dealing with pure distributions but some kind of noisy mixture
Maximum to Sum (MS)
Maximum to Sum Ratio
\[ \begin{align} &S_n(p) = \sum_{i=1}^n |X_i|^p \\ &M_n(p) = \max\{|X_1|^p, \ldots, |X_n|^p\}\\ &R_n(p) = \frac{M_n(p)}{S_n(p)}; \;\; n\ge1,\; p\gt 0 \end{align} \]
- \(S\) is the partial sum, \(M\) is the partial maximum, and \(p\) is the order of the moment that you want to see if it exists or not
Procedure
- Choose a \(p\) that you want to check
- For each \(n\), calculate the sum, maximum, and ratio
- Plot where the y-axis is the ratio, and the x-axis is the \(n\) value
MS plots always start at 1
For the Log-Normal, all moments always exist
For the Pareto, you usually only need to check up to \(p = 4\) or \(p = 5\)
- For higher levels of \(p\) (and hence \(\alpha\)) the Pareto distribution begins to act like a Normal
- Usually in credit, market, or operational risk markets you’re dealing with Pareto \(0 \le \alpha \le 3\)
Interpretation
- A lognormal will always converge to 0 for every p you check (black line)
- When a moment doesn’t exist (i.e. infinite), it just oscillates and never converges (orange line)
- Potentially with fewer than 100 observations, you could start to see a convergence if one is going to happen. Of course hundreds of observations is better. Point is that it doesn’t take thousands.
-
- \(p = 1\) definitely exists; \(p = 2\) is iffy; \(p = 3,4\) don’t exist
- Interpretation:
- Either \(\alpha\) is between 1 and 2 or there aren’t enough observations to show a convergence
- Although \(n\) is pretty large in this case
-
- \(p = 1\) is iffy, the rest don’t exist
- Interpretation: \(\alpha\) might be less than 1
Concentration Profile
- Requirements
- Data \(\ge 0\) and mean is finite
- Similar to the Mean Excess plot, except the gini index is computed instead of the mean
- In the wild you can expect mixtures, so there will likely be noisy behavior in the beginning, and when the fat tail is reached, a flat line is formed.
.resources/84f3854e3b965f3558decf7fb93a14d6.jpeg)
.resources/fat-tails-vs-norm-mean-convergence-1.webp)
.resources/zipf-plot-distr.png)
.resources/meplot-distr.png)
.resources/msplot-distr.png)
.resources/msplot-credit1.png)
.resources/msplot-oper.png)
.resources/concent-prof-distr.png)