Generalized Additive Models

Misc

Packages

{mgcv} - Mixed GAM Computation Vehicle with Automatic Smoothness Estimation
- See R >> Documents >> Regression >> GAMs >> Generalized Additive Models: An Introduction with R, Second Edition
- bam: Uses numerical methods are designed for datasets containing upwards of several tens of thousands of data
  - Has a much lower memory footprint than gam
  - Can compute on a cluster set up by {parallel}
  - If discrete=TRUE
    - Uses a method based on discretization of covariate values and C code level parallelization (controlled by the nthreads argument instead of the cluster argument) is used
    - Number of response data can not exceed .Machine$integer.max
{mgcvUtils} - Various mgcv-related GAM utilities
{gamlss} - GAM modeling where all the parameters of the assumed distribution for the response can be modelled as functions of the explanatory variables
{gamboostLSS} - Boosting models for fitting generalized additive models for location, shape and scale (‘GAMLSS’) to potentially high dimensional data.
{bamlss} - Bayesian Additive Models for Location, Scale, and Shape (and Beyond) (GAMLSS)
{GeDS} - Geometrically Designed Spline Regression
- Alternative to a traditional GAM which estimates the smoothing parameter while keeping the number of knots and locations fixed
- Efficiently estimates the number of knots and their positions, as well as the spline order
- Models: GAMs, Component-wise Gradient Boosting, Functional Gradient Boosting (FGB)
- Distributions: Any distribution from the Exponential family

Also see
- Feature Engineering, Splines
- Mixed Effects, GLMM >> Examples >> Example 1, 2
Resources
- Video “Introduction to Generalized Additive Models with R and mgcv” Gavin Simpson
- Video “The Wonderful World of mgcv” Noam Ross
- Hierarchical generalized additive models in ecology: an introduction with mgcv (2019)
- Yes! You can do that in mgcv
Papers
- Bayesian views of generalized additive modelling
  - Bayesian GAMs explainer
Large gaps in the values of the predictor variable can be a problem if you are trying to interpolate between those gaps. (See bkmks, method = "reml" + s(x, m = 1))
Thread discussing an example using basis type, bs = “sz”, which is meant for separating a baseline from other effects
Partial Effect Plots - Show the component contributions, on the link scale, of each model term to the linear predictor.
- {gratia::draw}
- Sound similar to Partial Dependence Plots/Profiles except instead of the average response value on the Y-axis, it’s the effect size.
- The Y-axis on these plots is typically centred around 0 due to most smooths having a sum-to-zero identifiability constraint applied to them
- Show link-scale predictions of the response for each smooth, conditional upon all other terms in the model, including any parametric effects (i.e. fixed effects) and the intercept, having zero contribution.
- These plots show adjusted predictions, just where the adjustment includes setting the contribution of all other model terms to the predicted value to zero

Description

Misc
- Notes from Bayesian Views of Generalized Additive Modelling (See Papers)
- Effective Degrees of Freedom (EDF): The degrees of freedom actually used by the model, once the penalty is taken into account
  - Usually defined as the sum of the diagonal elements of the hat matrix
Model
\[ g(\mu_i) = \boldsymbol{\alpha_i^T \theta} + s_1(x_{1i}) + s_2(x_{2i}) + s_3(x_{3i}, x_{4i}) \]
- $\mu_i = \mathbb{E}(Y_i)$
- $Y_i \sim EF(\mu_i, \phi)$
  - $Y_i(i = 1, \ldots, n)$ is the response
  - $EF(\mu_i, \phi)$ indicates an exponential family distribution with mean $\mu_i$ and scale parameter $\phi$.
- $\boldsymbol \alpha_i^T$ is a vector of slopes and intercept covariates, where $\theta$ are their associated coefficients.
- $s_j$ are smooth functions of one or more covariates $x_{1i}$, $x_{2i}$, $x_{3i}$, $x_{4i}$, ….
Splines (aka Smooths)
\[ s(x) = \sum_{k=1}^K \beta_k b_k (x) \]
- Concept: A complicated function can be formed by summing smaller, less complicated basis functions.
- $\beta_k$ are coefficients to be estimated
- $b_k$ are fixed basis functions (with maximum complexity or basis dimension $K$)
  - To avoid overfitting (too large of a $K$), this term gets penalized according to its wiggliness.
Penalty
\[ \sum_{m=1}^M \boldsymbol{\lambda_m \beta^T S_m \beta} \]
- $\lambda_m$ are estimated smoothing parameters that control the influence of the penalty
- $\beta$ is a vector of coefficients
- $S_m$ is a matrix of the fixed parts of the penalty
  - These are integrated (sometimes summed) squared derivatives (“changes in”) $b_k s$
- Note that multiple $\lambda$ can correspond to a single smooth or multiple smooths may share a single $\lambda$, so $M$ is not necessarily the number of unique $\lambda$ in the model.
- Example
  - A thin-plate regression spline was fitted to the data with differing smoothing parameters ($\lambda$)
  - The blue line is the function used to generate the points (with noise added)
  - The black line is the fit with differing $\lambda$ values
  - Estimated $\lambda$ has an EDF of 8.3
  - $\lambda = 0$ (i.e. no penalty) has a maximum EDF, EDF = 49
  - $\lambda = \infty$ (numerically) leading to a linear fit and an EDF of 1.

Diagnostics

{gratia::appraise}
- QQ plot of deviance residuals,
- Scatterplot of deviance residuals against the linear predictor,
- Histogram of deviance residuals, and
- Scatterplot of observed vs fitted values.
{gratia::draw(mod, residuals = TRUE)} - Adds partial residuals to partial effects plots
- Can help diagnose overfitting in your spline terms
“Deviance explained” is the R² value for GAMs

mgcv::gam.check(gam_fit)

## Method: GCV  Optimizer: magic
## Smoothing parameter selection converged after 19 iterations.
## The RMS GCV score gradient at convergence was 5.938335e-08 .
## The Hessian was positive definite.
## Model rank =  21 / 22 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##                                           k'  edf  k-index p-value   
## s(id)                                   1.00  0.35    0.82  <2e-16 ***
## s(log_profit_rug_business_b)            9.00  8.52    1.01    0.69   
## s(log_profit_rug_business_b):treatment 10.00  1.50    1.01    0.62   
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Check if the size of the basis expansion (k) for each smooth is sufficiently large
- k.check can also do this
- If all your smoothing predictors are not sufficiently large, then this indicates that using a GAM is a bad fit for your data.
- See SO post from Simpson

Formal test for the necessity of a smooth
```
m <- 
  gam(y ~ x + s(x, m = c(2, 0), bs = "tp"),
      data = foo,
      method = "REML",
      family = binomial())
```
- See EDA, General >> Continuous Predictor vs Outcome >> Continuous and Categorical for examples
- bs = "tp" is just the default thin plate basis function
- Fit the predictor of interest as a linear term (x) plus a smooth function of x
- Modify the basis for the smooth so that it no longer includes linear functions in the span of the basis with m = c(2, 0)
  - m: Controls the penalty on the wiggliness of spline
  - 2: An order-2 penalty (the default and most common) penalizes the second derivative of the function, which relates to its curvature.
    - Higher values would penalize higher-order derivatives, resulting in even smoother functions.
  - 0: Specifies that no null space basis is required.
    - The null space is the span of functions that aren’t affected by the (main) penalty, because they have 0 second derivative. (i.e. terms that doen’t have curvature)
    - For an order-2 penalty, the null space typically includes constant and linear terms.
- summary will give a test for the necessity of the wiggliness provided by the smooth over the linear effect estimated by the linear term. Check the p-value of the smooth term. If it’s significant, then a spline should be used. In your model, you wouldn’t use the zeroed out null space specification though.
- From Simpson SO post
- Also see Wood’s “Generalized Additive Models: An Introduction with R”, 2nd Ed, section 6.12.3, “Testing a parametric term against a smooth alternative” p 312-313 (R >> Documents >> Regression >> gam)