Garden of forking paths - all the potential ways we can get our sample data (sequence of marbles drawn from a bag) given a hypothesis is true (e.g. 1 blue, 3 white marbles in a bag)
“Conjectures” are potential outcomes (1 blue and 3 white marbles in a bag) and each conjecture is a path in the garden of forking paths
With data, we count the number of ways (forking paths) the data is consistent with each conjecture (aka likelihood)
With new data, old counts become a prior, and updated total path counts = prior * new path counts (i.e. prior * likelihood)
As components of a model
The conjecture (aka parameter value), p, with the most paths is our best guess at the truth. They’re converted to probabilities (i.e. probability of a parameter value) and now called “relative plausibilities”.
The likelihood is a function (e.g. dbinom) that gives the probability of an observation given a parameter value (conjecture)
Prior probabilities (or relative plausibilities) must be assigned to each unknown parameter.
The updated relative plausibility of a conjecture, p, is the posterior probability
posteriorp1 = (priorp1 * likelihoodp1) / sum(all prior*likelihood products for each possible value of parameter, p)
The denominator normalizes the updated plausability so that the sum of updated plausabilities for all the parameter values is 1 (necessary to formally be a proability density)
dbinom(x, size, prob) - probability density function for the binomial distribution
x = # of observations of the event (e.g. hitting water on the globe)
size = sample size (N) (number of tosses)
prob = parameter value (conjecture)(i.e. hypothesized proportion of water on the earth) (p)
Chapter 3 - Sampling
Sampling from the posterior
rbinom - random variable generator of the binomial distribution
Summarizing the posterior - mean, median, MAP, HPDI
posterior prediction distribution
GOF Question: If we were to make predictions about the most probable p using this model, how consistent is our model with the data?
Answer: if the shape of the PPD matches the shape of the sampled posterior distribution, then the model is consistent with the data. (i.e. good fit)
Chapter 4 - Linear Models
The posterior distribution is the probability density of every combination of all the parameter values
posterior distribution: after considering every possible combination of the parameters, it’s the assigned relative plausibilities to every combination, conditional on this model and these data. (from Ch.6)
The posterior is the joint distribution of all combinations of the parameters at the same time, Pr(parameters|outcome, predictors)
Many posterior distributions are approximately gaussian/multivariate gaussian
Example: intercept model
it’s a density of every combination of value of mean and sd
Example: Height ~ Weight
The posterior is Pr(α, β , σ | H, W)
which is proportional to Normal(W|μ,σ) ⨯ Normal(α|178,100) ⨯ LogNormal(β|0,1) ⨯ Uniform(σ|0,10)
Intro to priors
Model Notation
Example single variable regression where height is the response and weight the predictor
hi ~ Normal(μ, σ) # response
μi = α + βxi # response mean (deterministic, i.e. no longer a parameter)
α ~ Normal(178, 100) # intercept
β ~ Normal(0, 10) # slope
σ ~ Uniform(0, 50) # response s.d.
parameters are α, β, and σ
Centering/standardization of predictors can remove correlation between parameters
Without this transformation, parameters and their uncertainties will co-vary within the posterior distribution
predictor residual, counter-factual, and posterior prediction
masking
correlation between predictors and opposite sign correlation of each with the outcome variable can lead increased estimated effects in a multi-regression as compared to individual bivariable regressions
categorical variables (not ordinals)
Using an index variable is preferred to dummy variables
the index method allows the priors for each category to have the same uncertainty
the posterior distribution will seem to suggest that none of the collinear variables is reliably associated with the outcome, even if all of the collinear variables are in reality strongly associated with the outcome.
The posterior distributions of the parameter estimates will have very large spreads (std.devs)
i.e. parameter mean estimates shrink and their std.devs inflate as compared to the bivariate regression results.
predictions won’t be biased but interpretation of effects will be impossible
solutions
Think causally about what links the collinear variables and regress using that variable instead of the collinear ones
Use data reduction methods
post-treatment bias
mistaken inferences arising from including variables that are consequences of other variables
i.e. the values of the variable are a result after treatment has been applied
e.g. using presence of fungus as a predictor even though it’s value is determined after the anti-fungus treatment has been applied
Consequence: it can mask or unmask treatment effects depending the causal model (DAG)
collider bias
When you condition on a collider, it induces statistical—but not necessarily causal— associations.
Consequence:
The statistical correlations/associations are present in the data and may mislead us into thinking they are causal.
Although, the variables involved may be useful for predictive models as the backdoor paths do provide valid information about statistical associations within the data.
Depending on the causal model, these induced effects can be inflated
A more complicated demonstration of Simpson’s Paradox (see My Appendix)
“looic” - is just (-2 * elpd_loo) to convert it to the deviance scale, therefore smaller is better
Rethinking pkg: smaller is better
Weights observations based on influence on the posterior
Uses highly influential observations to formulate a pareto distribution and sample from it(?)
Widely Applicable Information Criterion (WAIC)
Deviance with a penalty term based on the variance of the outcome variable’s observation-level log-probabilities from the posterior
Estimates out-of-sample deviance
loo pkg:
“elpd_waic”: larger is better
“waic”: is just (-2 * elpd_waic) to convert it to deviance scale, therefore smaller is better
Rethinking pkg: smaller is better
Bayes Factor
The ratio (or difference when logged) of the average likelihoods (the denominator of bayes theorem) of two models.
Model Comparison
To judge whether two models are “easy to distinguish” (i.e. kinda like whether their scores are statistically different), we look at the differences between the model with the best WAIC and the WAICs of the other models along with the standard error of the difference of the WAIC scores
Leave-one-out cross-validation (LOO-CV)
Has serious issues, I think (see Vehtari paper for recommendations, (haven’t read it yet))
Outliers
Detection - High p_waic (WAIC) and k (PSIS) values can indicate outliers
Solutions
Mixture Model
Robust Regression using t-distribution for outcome variable. As shape parameter, v, approaches 1+, tails become thicker.
Chapter 8 - Interactions
continuous:categorical interaction
Coded similarly to coding categoricals (index method)
continuous:continuous interaction
Coded very similar to the traditional R formula
Interaction prior is the same as the variables used in the interaction
Plots
A counterfactual plot can be used to show the reverse of the typical interaction interpretation (i.e. association of continuous predictor conditioned on the categorical)
Triptych plots are a type of facetted predictor (one of the interaction variables) vs fitted graph where you facet by bins, quantiles, levels of the other interaction variable
Chapter 9 - MCMC
Gibbs
Optimizes sampling the joint posterior density by using conjugate priors
Inefficient for complex models
Can’t discern bad chains as well as HMC
Hamiltonian Monte Carlo (HMC)
Uses Hamiltonian differential equations in a particle physics simulation to sample the joint posterior density
Momentum and direction are randomly chosen
Hyperparameters
Used to reduce autocorrelation of the sampling (sampling is sequential) (U-Turn Problem)
Determined during warm-up in HMC
Stan uses NUTS2
Leapfrog steps (L) - paths between sampled posterior value combinations are made up of leapfrog steps
Step Size (ε) - The length of a leapfrog step is the step size
Diagnostics
Effective Sample Size (ESS) - Measures the amount by which autocorrelation in samples increases uncertainty (standard errors) relative to an independent sample
Bulk_ESS - effective sample size around the bulk of the posterior (i.e. around the mean or median)
When value is much lower than the actual number of iterations (minus warmup) of your chains, it means the chain is inefficient, but possibly still okay
Tail_ESS - effective sample size in the tails of the posterior
No idea what’s good here.
Rhat (Gelman-Rubin convergence diagnostic) - estimate of the convergence of Markov chains to the target distribution
If converges, Rhat = 1+
If value is above 1.00, it usually indicates that the chain has not yet converged, and probably you shouldn’t trust the samples.
Early versions of this diagnostic can fail for more complex models (i.e. bad chains even when value = 1)
Trace Plots
Multi-line plot depicting the sampling of parameter values in the joint posterior
lazy, fat caterpillars = good chains
Not recommended since 1 pathological chain can remain hidden in the plot
Trank plots
A layered histogram method that is easier to discern each chain’s health than using trace plots
Set-up
Warm-up samples
More complex models require more warm-up
Start will default and adjust based on ESS values
Post-Warmup samples
200 for mean estimates using not-too-complex regression models
Much moar required for
Complex models
Finer resolution of the tails
Non-Gaussian distributions
Chains
debugging: 1
Some stan errors only display when 1 chain is used
Validation of chains: 3 or 4
Final Run: only need 1 but can use more depending on compute power/# of vCPUs
Problems with ugly chains in trace/trank plots
Solutions for the 2 examples were to use weakly informative priors ¯\_(ツ)_/¯
Chapter 10 - GLM Concepts
The principle of maximum entropy provides an empirically successful way to choose likelihood functions. Information entropy is essentially a measure of the number of ways a distribution can arise, according to stated assumptions. By choosing the distribution with the biggest information entropy, we thereby choose a distribution that obeys the constraints on outcome variables, without importing additional assumptions. Generalized linear models arise naturally from this approach, as extensions of the linear models in previous chapters.
The maximum entropy distribution is the one with the greatest information entropy (i.e. log number of ways per event) and is the most plausible distribution.
No guarantee that this is the best probability distribution for the real problem you are analyzing. But there is a guarantee that no other distribution more conservatively reflects your assumptions.
maximum entropy also provides a way to generate a more informative prior that embodies the background information, while assuming as little else as possible.
Omitted variable bias can have worse effects with GLMs
Gaussian
A perfectly uniform distribution would have infinite variance, in fact. So the variance constraint is actually a severe constraint, forcing the high-probability portion of the distribution to a small area around the mean.
The Gaussian distribution gets its shape by being as spread out as possible for a distribution with fixed variance.
The Gaussian distribution is the most conservative distribution for a continuous outcome variable with finite variance.
The mean µ doesn’t matter here, because entropy doesn’t depend upon location, just shape.
Binomial
Binomial distribution has the largest entropy of any distribution that satisfies these constraints:
only two unordered events (i.e. dichotomous)
constant expected value (i.e. exp_val = sum(prob*num_events))
If only two un-ordered outcomes are possible and you think the process generating them is invariant in time—so that the expected value remains constant at each combination of predictor values— then the distribution that is most conservative is the binomial.
Other distributions
Chapter 11
Logistic Regression models a 0/1 outcome and data is at the case level
Example: Acceptance, A; Gender, G; Department, D Ai ~ Bernoulli(pi) logit(pi) = α[Gi, Di]
Binomial Regression models the counts of a Bernoulli variable that have been aggregated by some group variable(s)
Example: Acceptance counts that have been aggregated by department and gender Ai ~ Binomial(Ni, pi) logit(pi) = α[Gi, Di]
Results are the same no matter whether you choose to fit a logistic regression with case-level data or aggregate the case-level data into counts and fit a binomial regression
Poisson - when N is very large and probability of an event, p, is very small, then expected value and variance are approximately the same. The shape of this distribution is the Poisson
Flat Normal priors for Poisson also do NOT have high sds
Not sure if these are standard flat priors, but the priors in the example were