Regularized

Misc

  • Regularized Logistic Regression is most necessary when the number of candidate predictors is large in relationship to the effective sample size 3np(1−p) where p is the proportion of Y=1 Harrell
  • If using sparse matrix, then you don’t need to normalize predictors
  • Preprocessing
    • Standardize numerics
    • Dummy or factor categoricals
    • Remove NAs, na.omit
  • Packages
    • {glmnet} - handles families: Gaussian, binomial, Poisson, probit, quasi-poisson, and negative binomial GLMs, along with a few other special cases: the Cox model, multinomial regression, and multi-response Gaussian.
    • {robustHD}: Robust methods for high-dimensional data, in particular linear model selection techniques based on least angle regression and sparse regression
    • In {{sklearn}} (see Model building, sklearn >> Algorithms >> Stochaistic Gradient Descent (SGD)), the hyperparameters are different than in R
      • lambda (R) is alpha (py)
      • alpha (R) is 1 - L1_ratio (py)
    • {SLOPE} - Lasso regression that handles correlated predictors by clustering them
    • {{Multi-Layer-Kernel-Machine}} - Multi-Layer Kernel Machine (MLKM) is a Python package for multi-scale nonparametric regression and confidence bands. The method integrates random feature projections with a multi-layer structur
    • {BoomSpikeSlab} - MCMC for Spike and Slab Regression
      • Spike and slab regression is Bayesian regression with prior distributions containing a point mass at zero. The posterior updates the amount of mass on this point, leading to a posterior distribution that is actually sparse, in the sense that if you sample from it many coefficients are actually zeros. Sampling from this posterior distribution is an elegant way to handle Bayesian variable selection and model averaging.
      • {ScaleSpikeSlab} - Scalable Spike-and-Slab
        • A scalable Gibbs sampling implementation for high dimensional Bayesian regression with the continuous spike-and-slab prior.
      • For variable selection, the BSS prior seems to work best with Bayesian Model Averaging (BMA) (paper)
  • Predictive Performance Comparison between Logistic LASSO and Ridge
    • In general, Ridge outperforms LASSO unless the data are noisy
    • Small n and \(\le\) 10 Events per Variable (EPV) \(\rightarrow\) Bad Performance
    • Large \(n\) and 10 EPV \(\rightarrow\) Reasonable Performance
    • Large \(n\) and \(\gt\) 30 EPV \(\rightarrow\) Penalization effects are small
    • Between 10 to 30 EPV \(\rightarrow\)
      • Binary prediction models perform worse continuous prediction models
        • Think this refers to the dreaded dichotomization of continuous response variables.
      • Performance depends on the size of \(n\)
    • A completely balanced multinomial outcome variable performs worse than a slightly unbalanced one.
      • Good performance for balanced multinomial variables requires large EPVs
    • At \(\gt\) 50 EPV, performance doesn’t improve much
  • Variable Selection
    • For Inference, only Adaptive LASSO is capable of handling block and time series dependence structures in data
      • See A Critical Review of LASSO and Its Derivatives for Variable Selection Under Dependence Among Covariates
        • “We found that one version of the adaptive LASSO of Zou (2006) (AdapL.1se) and the distance correlation algorithm of Febrero-Bande et al. (2019) (DC.VS) are the only ones quite competent in all these scenarios, regarding to different types of dependence.”
        • There’s a deeper description of the model in the supplemental materials of the paper. I think the “.1se” means it’s using the lambda.1se from cv.
      • Re the distance correlation algorithm (it’s a feature selection alg used in this paper as benchmark vs LASSO variants)
        • “the distance correlation algorithm for variable selection (DC.VS) of Febrero-Bande et al. (2019). This makes use of the correlation distance (Székely et al., 2007; Szekely & Rizzo, 2017) to implement an iterative procedure (forward) deciding in each step which covariate enters the regression model.”
        • Starting from the null model, the distance correlation function, dcor.xy, in {fda.usc} is used to choose the next covariate
          • guessing you want large distances and not sure what the stopping criteria is
        • algorithm discussed in this paper, Variable selection in Functional Additive Regression Models
      • Harrell is skeptical. “I’d be surprised if the probability that adaptive lasso selects the”right” variables is more than 0.1 for N < 500,000.”

Concepts

  • Shrinking effect estimates turns out to always be best
    • OLS is the Best Linear Unbiased Estimator (BLUE), but being unbiased means the variance of the estimated effects is large from sample to sample and therefore outcome variable predictions using OLS don’t generalize well.
    • If you predicted y using the sample mean times some coefficient, it’s always(?) the case that you’ll have a better generalization error with a coefficient less than 1 (shrinkage).
  • Regularized Regression vs OLS
    • As N ↑, standard errors ↓
      • regularized regression and OLS regression produce similar predictions and coefficient estimates.
    • As the number of covariates ↑ (relative to the sample size), variance of estimates ↑
      • regularized regression and OLS regression produce much different predictions and coefficient estimates
      • Therefore OLS predictions are usually fine in a low dimension world (not usually the case)
  • Model Equation
    \[ \text{argmin}\; \mathcal{L}(\lambda, \alpha) = \frac{1}{2n} \sum_{i=1}^n (y_i - \hat y_i)^2 + \lambda(\frac{1}{2}(1-\alpha)\;||\hat\beta||_2^2\; + \alpha \; ||\hat \beta||_1) \]
    • \(\lambda\) : The penalization factor
    • \(\alpha = 1\) : LASSO
    • \(\alpha = 0\) : Ridge
    • \(0 \lt \alpha \lt 1\) : Elastic Net
    • \(||\hat \beta||_2^2\) : The sum of squared coefficients. The L2 norm has been squared, so the square root isn’t taken.
      • When \(\alpha = 0\), the L2 norm is applied.
    • \(||\hat \beta||_1\): The sum of the absolute value of coefficients — i.e. L1 norm.
      • When \(\alpha = 1\), the L1 norm is applied.

Ridge

  • The regularization reduces the influence of correlated variables on the model because the weight is shared between the two predictive variables, so neither alone would have strong weights. This is unlike Lasso which just drops one of the variables (which one gets dropped isn’t consistent).
  • Linear transformations in the design matrix will affect the predictions made by ridge regression.

Lasso

  • When lasso drops a variable, it doesn’t mean that the variable wasn’t important.
    • The variable, \(x_1\), could’ve been correlated with another variable, \(x_2\), and lasso happens to drop \(x_1\) because in this sample, \(x_2\), predicted the outcome just a tad better.

Adaptive LASSO

  • Purple dot indicates that it’s a weighted (\(w_j\)) version of LASSO
  • Green checkmark indicates it’s optimization is a convex problem
  • Better Selection, Bias Reduction are attributes that it has that are better than standard LASSO
  • Weighted versions of the LASSO attach the particular importance of each covariate for a suitable selection of the weights. Joint with iteration, this modification allows for a reduction of the bias.
    • Zhou (2006) say that you should choose your weights so the adaptive Lasso estimates have the Oracle Property:
      • You will always identify the set of nonzero coefficients…when the sample size is infinite
      • The estimates are unbiased, normally distributed, and the correct variance (Zhou (2006) has the technical definition)…when the sample size is infinite.
    • To have these properties, \(w_j = \frac{1}{|\hat\beta_j|^q}\), where \(q > 0\) and \(\hat\beta_j\) is an unbiased estimate of the true parameter, \(\beta\)
      • Generally, people choose the Ordinary Least Squares (OLS) estimate of \(\beta\) because it will be unbiased. Ridge regression produces coefficient estimates that are biased, so you cannot guarantee the Oracle Property holds.
        • In practice, this probably doesn’t matter. The Oracle Property is an asymptotic guarantee (when \(n \rightarrow \infty\)), so it doesn’t necessary apply to your data with a finite number of observations. There may be scenarios where using Ridge estimates for weights performs really well. Zhou (2006) recommends using Ridge regression over OLS when your variables are highly correlated.
  • See article, Adaptive LASSO, for examples with a continuous, binary, and multinomial outcome

Firth’s Estimator

  • Penalized Logistic Regression estimator

  • For sample sizes less than around n = 1000 or sparse data, using Firth Estimator is recommended

  • Misc

    • Notes from
    • Packages
      • {brglm2} -
      • {logistf} - Includes FLIC and FLAC extensions; uses profile penalized likelihood confidence intervals which outperform Wald intervals; includes a function that performs a penalized likelihood ratio test on some (or all) selected factors
        • emmeans::emmeans is supported
    • Invariant to linear transformations of the design matrix (i.e. predictor variables) unlike Ridge Regression
    • While the standard Firth correction leads to shrinkage in all parameters, including the intercept, and hence produces predictions which are biased towards 0.5, FLIC and FLAC are able to exclude the intercept from shrinkage while maintaining the desirable properties of the Firth correction and ensure that the sum of the predicted probabilities equals the number of events.
  • Penalized Likelihood

    \[ L^*(\beta\;|\;y) = L(\beta\;|\;y)\;|I(\beta)|^{\frac{1}{2}} \]

    • Equivalent to penalization of the log-likelihood by the Jeffreys prior

    • \(I(\beta)\) is the Fisher information matrix, i. e. minus the second derivative of the log likelihood

  • Maximum Likelihood vs Firth’s Correction

    • Bias

    • Variance

    • Coefficient and CI bar comparison on a small dataset (n = 35, k = 7)

  • Limitations

    • Relies on maximum likelihood estimation, which can be sensitive to datasets with large random sampling variation. In such cases, Ridge Regression may be a better choice as it provides some shrinkage and can stabilize the estimates by pulling them towards the observed event rate.
    • Less effective than ridge regression in datasets with highly correlated covariates
    • For the Firth Estimator, the Wald Test can perform poorly in data sets with extremely rare events.