Chapter 2
Counting the Ways
One blue marble is drawn from the bag and replaced. The bag is shaken, and a white marble is drawn and replaced. Finally, the bag is shaken, and a blue marble is drawn and replaced
- Each ring is a iid observation (bag shaken and a marble drawn and replaced)
- In this example, the “garden of forking paths” is set of all potential draws (consisting of 3 observations), given the conjecture of there being 1 blue and 3 white marbles in the bag
If we actually do draw a marble, record the observation, replace the marble, repeat 2 more times, and the result is blue, white, blue, then these are the number of paths in each conjecture’s garden that are consistent with that outcome
- For conjecture 1 blue, 3 white, (1, 3, 1) is the number of paths in each ring, respectively, that remain consistent with the sequence of recorded observations.
- When multiplied together, the product equals the total consistent paths.
- For conjecture 1 blue, 3 white, (1, 3, 1) is the number of paths in each ring, respectively, that remain consistent with the sequence of recorded observations.
After the bag is shaken, a new marble is drawn (new data) — it’s blue. Previous counts are now the prior counts.
The ways this new blue marble can be drawn, given a conjecture, is used to update each prior count through multiplication.
- This is equivalent to starting over and drawing another marble after the previous 3 iid observations.
Plausibilities
- The plausibility of a conjecture (p1) is the (prior plausibility given p1) ⨯ (“New Count” given p1). Then, that product is standardized into a probability so that it is comparable to other conjectures.
- plausibilityp1 = (prior_plausibilityp1 ⨯ new_countsp1) / the sum of all (new_countspi ⨯ prior_plausibilitypi) products of the other conjectures
- It’s the probability of the conjecture given the new data
- The plausibility of a conjecture, p, after seeing new evidence, Dnew, is proportional to the ways the conjecture, p, can produce the new evidence, Dnew, times the prior plausibility of the conjecture, p.
- Equivalent Notations:
Plausibility of p after Dnew ∝ ways p can produce Dnew ⨯ prior plausibility of p
Plausibility of p after \(D_{new} = \frac {\text{Ways} \; p \; \text{can produce}\ D_{new}\ \times\ \text{Prior plausibility}\ p}{\text{Sum of products}}\)
Sum of products = sum of the WAYS of each conjecture. For
- Conjecture 0 blues = 0 ways
- Conjecture 1 blue = 3 ways (current example)
- Conjecture 2 blues = 8 ways
- Conjecture 3 blues = 9 ways
- Conjecture 4 blues = 0 ways
- Therefore sum of products = 20
If the prior plausibility of conjecture, p, of 1 blue marble = 1 (and the rest of the conjectures, i.e. flat prior), then plausibility of conjecture 1 blue = (3 ⨯ 1)/20 = 0.15. The plausibility calculation normalizes the counts to be between 0 and 1.
- Equivalent Notations:
In Bayesian Language
- A conjectured proportion of blue marbles, p, is usually called a parameter value. It’s just a way of indexing possible explanations of the data
- In the example below, the proportion of surface water is the unknown parameter, but the conjecture could also be other things like sample size, treatment effect, group variation, etc.
- There can also be multiple unknown parameters for the likelihood to consider.
- Every parameter must have a corresponding prior probability assigned to it.
- The relative number of ways that a value p can produce the data is usually called a likelihood.
- It is derived by the enumerating all the possible data sequences that could have happened and then eliminating those sequences inconsistent with the data (i.e. paths_consistent_with_data / total_paths).
- As a model component, the likelihood is a function that gives the probability of an observation given a parameter value (conjecture)
- “How likely your sample data is out of all sample data of the same length?”
- Example: The proportion of water to land on the earth:
- W is distributed Binomially with N trials and a probability of p for W in each trial, \(W \sim \mbox{Binomial}(N, p)\)
- “The count of ‘water’ observations (globe is tossed and finger lands on water), W, is distributed binomially, with probability p of ‘water’ on each toss of a globe and N tosses in total.”
- Notation: \(L(p \ | \ W, N)\)
- Assumptions:
- Observations are independent of each other
- The probability of observation of W (water) is the same for every observation
dbinom(x, size, prob)
- Finds the probability (i.e. likelihood) of getting a certain number of successes (x) in a certain number of trials (size) where the probability of success on each trial is fixed (prob).
- Args
- x = # of observations of water (W)
- size = sample size (N) (number of tosses)
- prob = parameter value (conjecture)(i.e. hypothesized proportion of water on the earth) (p)
- The prior plausibility of any specific p is usually called the prior probability.
- A distribution initial plausibilities for every value of a parameter
- Expresses prior knowledge about a parameter and constrains estimates to reasonable ranges
- Unless there’s already strong evidence for using a particular prior, multiple priors should be tried to see how sensitive the estimates are to the choice of a prior
- Example where the prior is a probability distribution for the parameter:
- p is distributed Uniformly between 0 and 1, (i.e. each conjecture is equally likely), \(p \sim \mbox{Uniform}(0, 1)\)
- Weakly Informative or Regularizing priors: conservative; guards against inferences of strong association
- Mathematically equivalent to penalized likelihood
- The new, updated relative plausibility of a specific p is called the posterior probability.
- The set of estimates, aka relative plausibilities of different parameter values, aka posterior probabilities, conditional on the data — is known as the posterior distribution or posterior density (e.g. \(Pr(p \ | \ N, W)\)).
- Thoughts
- The likelihood, prior, and posterior densities are probability densities each with an area = 1. Looking at the marble tables it looks like the individual posterior probabilities sum to 1. So, the sum (we’re talking densities so this “sum” = integration) of all the products of the multiiplication of the prior and likelihood densities must not have an area = 1. Therefore, the denominator (i.e. sum of products) then standardizes each of these products so the posterior density does have an area of 1.
Numerical Solvers for the Posterior Distribution
- Grid Approximation - compute the posterior distribution from only a portion of potential values (the grid of parameter values) for a set of unknown parameters
- Doesn’t scale well as the number of parameters grows
- Steps:
- Decide how many values you want to use in your grid (e.g.
seq( from = 0, to = 1, len = 1000)
)- Number of parameter values in your grid is equal to the number of points in your posterior distribution
- Compute the prior value for each parameter value in your grid (e.g.
rep(1, 1000)
, uniform prior) - Compute the likelihood (e.g. using
dbinom(x, size, p = grid))
for each grid value - Multiply the likelihood times the prior which is the unstandardized posterior
- Standardize that posterior by dividing by
sum(unstd_posterior)
- Decide how many values you want to use in your grid (e.g.
- Quadratic approximation - the posterior distribution can be represented by the Gaussian distribution quite well. The log of a Gaussian (posterior) distribution is quadratic.
- Steps:
- Find the mode of the posterior. Uses quadratic approximation. With a uniform prior this is equivalent to MLE
- Estimate the curvature of the posterior using another numerical method
- Needs larger sample sizes. How large is model dependent.
- {rethinking} function,
quap( )
- Inputs are likelihood function (e.g. dbinom) and prior function (e.g. punif), and data for the likelihood function
- Outputs mean posterior probability and the std dev of the posterior distribution
- Steps:
- MCMC only briefly mentioned