Chapter Summaries

Chapter 2

  • Using counts
    • Garden of forking paths - all the potential ways we can get our sample data (sequence of marbles drawn from a bag) given a hypothesis is true (e.g. 1 blue, 3 white marbles in a bag)
    • “Conjectures” are potential outcomes (1 blue and 3 white marbles in a bag) and each conjecture is a path in the garden of forking paths
    • With data, we count the number of ways (forking paths) the data is consistent with each conjecture (aka likelihood)
    • With new data, old counts become a prior, and updated total path counts = prior * new path counts (i.e. prior * likelihood)
  • As components of a model
    • The conjecture (aka parameter value), p,  with the most paths is our best guess at the truth. They’re converted to probabilities (i.e. probability of a parameter value) and now called “relative plausibilities”.
    • The likelihood is a function (e.g. dbinom) that gives the probability of an observation given a parameter value (conjecture)
    • Prior probabilities (or relative plausibilities) must be assigned to each unknown parameter.
    • The updated relative plausibility of a conjecture, p, is the posterior probability
    • posteriorp1 = (priorp1 * likelihoodp1) / sum(all prior*likelihood products for each possible value of parameter, p)
      • The denominator normalizes the updated plausability so that the sum of  updated plausabilities for all the parameter values is 1 (necessary to formally be a proability density)
  • dbinom(x, size, prob) - probability density function for the binomial distribution
    • x = # of observations of the event (e.g. hitting water on the globe)
    • size = sample size (N) (number of tosses)
    • prob = parameter value (conjecture)(i.e. hypothesized proportion of water on the earth) (p)

Chapter 3 - Sampling

  • Sampling from the posterior
    • rbinom - random variable generator of the binomial distribution
  • Summarizing the posterior - mean, median, MAP, HPDI
  • posterior prediction distribution
    • GOF Question: If we were to make predictions about the most probable p using this model, how consistent is our model with the data?
      • Answer: if the shape of the PPD matches the shape of the sampled posterior distribution, then the model is consistent with the data. (i.e. good fit)

Chapter 4 - Linear Models

  • The posterior distribution is the probability density of every combination of all the parameter values
    • posterior distribution: after considering every possible combination of the parameters, it’s the assigned relative plausibilities to every combination, conditional on this model and these data. (from Ch.6)
      • The posterior is the joint distribution of all combinations of the parameters at the same time, Pr(parameters|outcome, predictors)
    • Many posterior distributions are approximately gaussian/multivariate gaussian
    • Example: intercept model
      • it’s a density of every combination of value of mean and sd
    • Example: Height ~ Weight
      • The posterior is Pr(α, β , σ | H, W)
        • which is proportional to Normal(W|μ,σ) ⨯ Normal(α|178,100) ⨯ LogNormal(β|0,1) ⨯ Uniform(σ|0,10)
  • Intro to priors
  • Model Notation
    • Example single variable regression where height is the response and weight the predictor
      • hi ~ Normal(μ, σ)  # response
      • μi = α + βxi  # response mean (deterministic, i.e. no longer a parameter)
      • α ~ Normal(178, 100) # intercept
      • β ~ Normal(0, 10) # slope
      • σ ~ Uniform(0, 50) # response s.d.
      • parameters are α, β, and σ
  • Centering/standardization of predictors can remove correlation between parameters
    • Without this transformation, parameters and their uncertainties will co-vary within the posterior distribution
      • e.g. high intercepts will often mean high slopes
    • Without independent parameters
      • They can’t be interpreted independently
      • Effects on prediction aren’t independent
  • CIs, PIs
  • basis splines (aka b-splines)

Chapter 5 - Multi-Variable Linear Models

  • Model Notation
    • DivorceRatei ~ Normal(μi, σ)
    • μi = α + β1MedianAgeMarriage_si + β2MarriageRate_si
    • α ~ Normal(10, 10)
    • β1 ~ Normal(0, 1)
    • β2 ~ Normal(0, 1)
    • σ ~ Uniform(0, 10)
  • Interpretation
  • DAGs
  • Inferential Plots
    • predictor residual, counter-factual, and posterior prediction
  • masking
    • correlation between predictors and opposite sign correlation of each with the outcome variable can lead increased estimated effects in a multi-regression as compared to individual bivariable regressions
  • categorical variables (not ordinals)
    • Using an index variable is preferred to dummy variables
      • the index method allows the priors for each category to have the same uncertainty
    • no intercepts used in the model specifications
    • Contrasts

Chapter 6 - Colliders, Multicollinearity, Post-Treatment Bias

  • multicollinearity
    • Consequences
      • the posterior distribution will seem to suggest that none of the collinear variables is reliably associated with the outcome, even if all of the collinear variables are in reality strongly associated with the outcome.
        • The posterior distributions of the parameter estimates will have very large spreads (std.devs)
        • i.e. parameter mean estimates shrink and their std.devs inflate as compared to the bivariate regression results.
      • predictions won’t be biased but interpretation of effects will be impossible
    • solutions
      • Think causally about what links the collinear variables and regress using that variable instead of the collinear ones
      • Use data reduction methods
  • post-treatment bias
    • mistaken inferences arising from including variables that are consequences of other variables
      • i.e. the values of the variable are a result after treatment has been applied
      • e.g. using presence of fungus as a predictor even though it’s value is determined after the anti-fungus treatment has been applied
    • Consequence: it can mask or unmask treatment effects depending the causal model (DAG)
  • collider bias
    • When you condition on a collider, it induces statistical—but not necessarily causal— associations.
    • Consequence:
      • The statistical correlations/associations are present in the data and may mislead us into thinking they are causal.
        • Although, the variables involved may be useful for predictive models as the backdoor paths do provide valid information about statistical associations within the data.
      • Depending on the causal model, these induced effects can be inflated
    • A more complicated demonstration of Simpson’s Paradox (see My Appendix)
  • Applications of Backdoor Criterion
    • See Causal Inference >> Statistical Rethinking
    • Recipe
      1. List all of the paths connecting X (the potential cause of interest) and Y (the outcome).
      2. Classify each path by whether it is open or closed. A path is open unless it contains a collider.
      3. Classify each path by whether it is a backdoor path. A backdoor path has an arrow entering X.
      4. If there are any backdoor paths that are also open, decide which variable(s) to condition on to close it.

Chapter 7 - Information Theory, Prediction Metrics, Model Comparison

  • Regularizing prior (type of skeptical prior)
    • Typicall priors with smaller sd values
    • Flat priors result in a posterior that encodes as much of the training sample as possible. (i.e. overfitting)
    • When tuned properly, reduces overfitting while still allowing the model to learn the regular features of a sample.
      • Too skeptical (i.e. sd too small) results in underfitting
  • Information Entropy
    • The uncertainty contained in a probability distribution is the average log-probability of an event.
  • Kullback-Leibler Divergence (K-L Divergence)
    • The additional uncertainty induced by using probabilities from one distribution to describe another distribution.
  • Log Pointwise Predictive Density (lppd)
    • Sum of the log average probabilities
      • larger is better
    • The log average probability is an approximation of information entropy
  • Deviance
    • -2*lppd
      • smaller is better
    • An approximation for the K-L divergence
  • Predictive accuracy metrics Can’t use any information criteria prediction metrics to compare models with different likelihood functions
    1. Pareto-Smoothed Importance Sampling Cross-Validation (PSIS)
      • Estimates out-of-sample LOO-CV lppd
        • loo pkg
          • “elpd_loo” - larger is better
          • “looic” - is just (-2 * elpd_loo) to convert it to the deviance scale, therefore smaller is better
        • Rethinking pkg: smaller is better
      • Weights observations based on influence on the posterior
      • Uses highly influential observations to formulate a pareto distribution and sample from it(?)
    2. Widely Applicable Information Criterion (WAIC)
      • Deviance with a penalty term based on the variance of the outcome variable’s observation-level log-probabilities from the posterior
      • Estimates out-of-sample deviance
        • loo pkg:
          • “elpd_waic”: larger is better
          • “waic”:  is just (-2 * elpd_waic) to convert it to deviance scale, therefore smaller is better
        • Rethinking pkg: smaller is better
    3. Bayes Factor
      • The ratio (or difference when logged) of the average likelihoods (the denominator of bayes theorem) of two models.
  • Model Comparison
    • To judge whether two models are “easy to distinguish” (i.e. kinda like whether their scores are statistically different), we look at the differences between the model with the best WAIC and the WAICs of the other models along with the standard error of the difference of the WAIC scores
    • Leave-one-out cross-validation (LOO-CV)
      • Has serious issues, I think (see Vehtari paper for recommendations, (haven’t read it yet))
  • Outliers
    • Detection - High p_waic (WAIC) and k (PSIS) values can indicate outliers
    • Solutions
      • Mixture Model
      • Robust Regression using t-distribution for outcome variable. As shape parameter, v, approaches 1+, tails become thicker.

Chapter 8 - Interactions

  • continuous:categorical interaction
    • Coded similarly to coding categoricals (index method)
  • continuous:continuous interaction
    • Coded very similar to the traditional R formula
    • Interaction prior is the same as the variables used in the interaction
  • Plots
    • A counterfactual plot can be used to show the reverse of the typical interaction interpretation (i.e. association of continuous predictor conditioned on the categorical)
    • Triptych plots are a type of facetted predictor (one of the interaction variables) vs fitted graph where you facet by bins, quantiles, levels of the other interaction variable

Chapter 9 - MCMC

  • Gibbs
    • Optimizes sampling the joint posterior density by using conjugate priors
    • Inefficient for complex models
    • Can’t discern bad chains as well as HMC
  • Hamiltonian Monte Carlo (HMC)
    • Uses Hamiltonian differential equations in a particle physics simulation to sample the joint posterior density
      • Momentum and direction are randomly chosen
    • Hyperparameters
      • Used to reduce autocorrelation of the sampling (sampling is sequential) (U-Turn Problem)
      • Determined during warm-up in HMC
        • Stan uses NUTS2
      • Leapfrog steps (L) - paths between sampled posterior value combinations are made up of leapfrog steps
      • Step Size (ε) - The length of a leapfrog step is the step size
  • Diagnostics
    • Effective Sample Size (ESS) - Measures the amount by which autocorrelation in samples increases uncertainty (standard errors) relative to an independent sample
      • Bulk_ESS - effective sample size around the bulk of the posterior (i.e. around the mean or median)
        • When value is much lower than the actual number of iterations (minus warmup) of your chains, it means the chain is inefficient, but possibly still okay
      • Tail_ESS - effective sample size in the tails of the posterior
        • No idea what’s good here.
    • Rhat (Gelman-Rubin convergence diagnostic) - estimate of the convergence of Markov chains to the target distribution
      • If converges, Rhat = 1+
      • If value is above 1.00, it usually indicates that the chain has not yet converged, and probably you shouldn’t trust the samples.
        • Early versions of this diagnostic can fail for more complex models (i.e. bad chains even when value = 1)
    • Trace Plots
      • Multi-line plot depicting the sampling of parameter values in the joint posterior
      • lazy, fat caterpillars = good chains
      • Not recommended since 1 pathological chain can remain hidden in the plot
    • Trank plots
      • A layered histogram method that is easier to discern each chain’s health than using trace plots
  • Set-up
    • Warm-up samples
      • More complex models require more warm-up
      • Start will default and adjust based on ESS values
    • Post-Warmup samples
      • 200 for mean estimates using not-too-complex regression models
      • Much moar required for
        • Complex models
        • Finer resolution of the tails
        • Non-Gaussian distributions
    • Chains
      • debugging: 1
        • Some stan errors only display when 1 chain is used
      • Validation of chains: 3 or 4
      • Final Run: only need 1 but can use more depending on compute power/# of vCPUs
  • Problems with ugly chains in trace/trank plots
    • Solutions for the 2 examples were to use weakly informative priors ¯\_(ツ)_/¯

Chapter 10 - GLM Concepts

  • The principle of maximum entropy provides an empirically successful way to choose likelihood functions. Information entropy is essentially a measure of the number of ways a distribution can arise, according to stated assumptions. By choosing the distribution with the biggest information entropy, we thereby choose a distribution that obeys the constraints on outcome variables, without importing additional assumptions. Generalized linear models arise naturally from this approach, as extensions of the linear models in previous chapters.
  • The maximum entropy distribution is the one with the greatest information entropy (i.e. log number of ways per event) and is the most plausible distribution.
    • No guarantee that this is the best probability distribution for the real problem you are analyzing. But there is a guarantee that no other distribution more conservatively reflects your assumptions.
    • maximum entropy also provides a way to generate a more informative prior that embodies the background information, while assuming as little else as possible.
  • Omitted variable bias can have worse effects with GLMs
  • Gaussian
    • A perfectly uniform distribution would have infinite variance, in fact. So the variance constraint is actually a severe constraint, forcing the high-probability portion of the distribution to a small area around the mean.
    • The Gaussian distribution gets its shape by being as spread out as possible for a distribution with fixed variance.
    • The Gaussian distribution is the most conservative distribution for a continuous outcome variable with finite variance.
      • The mean µ doesn’t matter here, because entropy doesn’t depend upon location, just shape.
  • Binomial
    • Binomial distribution has the largest entropy of any distribution that satisfies these constraints:
      • only two unordered events (i.e. dichotomous)
      • constant expected value (i.e. exp_val = sum(prob*num_events))
    • If only two un-ordered outcomes are possible and you think the process generating them is invariant in time—so that the expected value remains constant at each combination of predictor values— then the distribution that is most conservative is the binomial.
  • Other distributions

Chapter 11

  • Logistic Regression models a 0/1 outcome and data is at the case level
    • Example: Acceptance, A; Gender, G; Department, D Ai ~ Bernoulli(pi) logit(pi) = α[Gi, Di]
  • Binomial Regression models the counts of a Bernoulli variable that have been aggregated by some group variable(s)
    • Example: Acceptance counts that have been aggregated by department and gender Ai ~ Binomial(Ni, pi) logit(pi) = α[Gi, Di]
  • Results are the same no matter whether you choose to fit a logistic regression with case-level data or aggregate the case-level data into counts and fit a binomial regression
  • brms models
    • Logistic
      • family = bernouilli
      • formula: outcome_var|trials(1)
    • Binomial:
      • family = binomial
      • formula
        • balanced: outcome_var|trials(group_n)
        • unbalanced: outcome_var|trials(vector_with_n_for_each_group)
  • rstanarm models specified just like using glm
  • Flat Normal priors for logistic or binomial do NOT have high sds. High sds say that outcome event probability is always close to 0 or close 1.
    • For flat intercept: sd = 1.5
    • For flat slope: sd = 1.0
    • See Bayes, priors for details on other options
  • Effects
    • Types
      • Absolute Effects - The effect of a (counter-factual) change in predictor value (type of treatment) has on the outcome (probability of an event)
        • Contrast of the predicted values (e.g. marginal means)
      • Relative Effects - The effect of a (counter-factual) change in predictor value (type of treatment) has on the outcome (odds of an event)
  • UC Berkeley gender discrimination analysis
  • Poisson - when N is very large and probability of an event, p, is very small, then expected value and variance are approximately the same. The shape of this distribution is the Poisson
  • Flat Normal priors for Poisson also do NOT have high sds
    • Not sure if these are standard flat priors, but the priors in the example were
      • Intercept sd = 0.5
      • Slope sd = 0.2