Chapter 2

Counting the Ways

  • One blue marble is drawn from the bag and replaced. The bag is shaken, and a white marble is drawn and replaced. Finally, the bag is shaken, and a blue marble is drawn and replaced

    • Each ring is a iid observation (bag shaken and a marble drawn and replaced)
    • In this example, the “garden of forking paths” is set of all potential draws (consisting of 3 observations), given the conjecture of there being 1 blue and 3 white marbles in the bag
  • If we actually do draw a marble, record the observation, replace the marble, repeat 2 more times, and the result is blue, white, blue, then these are the number of paths in each conjecture’s garden that are consistent with that outcome

    • For conjecture 1 blue, 3 white, (1, 3, 1) is the number of paths in each ring, respectively, that remain consistent with the sequence of recorded observations.
      • When multiplied together, the product equals the total consistent paths.
  • After the bag is shaken, a new marble is drawn (new data) — it’s blue. Previous counts are now the prior counts.

    • The ways this new blue marble can be drawn, given a conjecture, is used to update each prior count through multiplication.

      • This is equivalent to starting over and drawing another marble after the previous 3 iid observations.

Plausibilities

  • The plausibility of a conjecture (p1) is the (prior plausibility given p1) ⨯ (“New Count” given p1). Then, that product is standardized into a probability so that it is comparable to other conjectures.
    • plausibilityp1 = (prior_plausibilityp1 ⨯ new_countsp1) / the sum of all (new_countspi ⨯ prior_plausibilitypi) products of the other conjectures
    • It’s the probability of the conjecture given the new data
  • The plausibility of a conjecture, p, after seeing new evidence, Dnew, is proportional to the ways the conjecture, p, can produce the new evidence, Dnew, times the prior plausibility of the conjecture, p.
    • Equivalent Notations:
      • Plausibility of p after Dnew ∝ ways p can produce Dnew ⨯ prior plausibility of p

      • Plausibility of p after \(D_{new} = \frac {\text{Ways} \; p \; \text{can produce}\ D_{new}\ \times\ \text{Prior plausibility}\ p}{\text{Sum of products}}\)

        • Sum of products = sum of the WAYS of each conjecture. For

          • Conjecture 0 blues = 0 ways
          • Conjecture 1 blue   = 3 ways (current example)
          • Conjecture 2 blues = 8 ways
          • Conjecture 3 blues = 9 ways
          • Conjecture 4 blues = 0 ways
          • Therefore sum of products = 20
        • If the prior plausibility of conjecture, p, of 1 blue marble = 1 (and the rest of the conjectures, i.e. flat prior), then plausibility of conjecture 1 blue = (3 ⨯ 1)/20 = 0.15. The plausibility calculation normalizes the counts to be between 0 and 1.

In Bayesian Language

  • A conjectured proportion of blue marbles, p, is usually called a parameter value. It’s just a way of indexing possible explanations of the data
    • In the example below, the proportion of surface water is the unknown parameter, but the conjecture could also be other things like sample size, treatment effect, group variation, etc.
    • There can also be multiple unknown parameters for the likelihood to consider.
    • Every parameter must have a corresponding prior probability assigned to it.
  • The relative number of ways that a value p can produce the data is usually called a likelihood.
    • It is derived by the enumerating all the possible data sequences that could have happened and then eliminating those sequences inconsistent with the data (i.e. paths_consistent_with_data / total_paths).
    • As a model component, the likelihood is a function that gives the probability of an observation given a parameter value (conjecture)
      • “How likely your sample data is out of all sample data of the same length?”
    • Example: The proportion of water to land on the earth:
      • W is distributed Binomially with N trials and a probability of p for W in each trial, \(W \sim \mbox{Binomial}(N, p)\)
      • “The count of ‘water’ observations (globe is tossed and finger lands on water), W, is distributed binomially, with probability p of ‘water’ on each toss of a globe and N tosses in total.”
      • Notation: \(L(p \ | \ W, N)\)
      • Assumptions:
        1. Observations are independent of each other
        2. The probability of observation of W (water) is the same for every observation
      • dbinom(x, size, prob)
        • Finds the probability (i.e. likelihood) of getting a certain number of successes (x) in a certain number of trials (size) where the probability of success on each trial is fixed (prob).
        • Args
          • x = # of observations of water (W)
          • size = sample size (N) (number of tosses)
          • prob = parameter value (conjecture)(i.e. hypothesized proportion of water on the earth) (p)
  • The prior plausibility of any specific p is usually called the prior probability.
    • A distribution initial plausibilities for every value of a parameter
    • Expresses prior knowledge about a parameter and constrains estimates to reasonable ranges
    • Unless there’s already strong evidence for using a particular prior, multiple priors should be tried to see how sensitive the estimates are to the choice of a prior
    • Example where the prior is a probability distribution for the parameter:
      • p is distributed Uniformly between 0 and 1, (i.e. each conjecture is equally likely), \(p \sim \mbox{Uniform}(0, 1)\)
    • Weakly Informative or Regularizing priors: conservative; guards against inferences of strong association
      • Mathematically equivalent to penalized likelihood
  • The new, updated relative plausibility of a specific p is called the posterior probability.
  • The set of estimates, aka relative plausibilities of different parameter values, aka posterior probabilities, conditional on the data — is known as the posterior distribution or posterior density (e.g. \(Pr(p \ | \ N, W)\)).
  • Thoughts
    • The likelihood, prior, and posterior densities are probability densities each with an area = 1. Looking at the marble tables it looks like the individual posterior probabilities sum to 1. So, the sum (we’re talking densities so this “sum” = integration) of all the products of the multiiplication of the prior and likelihood densities must not have an area = 1. Therefore, the denominator (i.e. sum of products) then standardizes each of these products so the posterior density does have an area of 1.

Numerical Solvers for the Posterior Distribution

  • Grid Approximation - compute the posterior distribution from only a portion of potential values (the grid of parameter values) for a set of unknown parameters
    • Doesn’t scale well as the number of parameters grows
    • Steps:
      1. Decide how many values you want to use in your grid (e.g. seq( from = 0, to = 1, len = 1000) )
        • Number of parameter values in your grid is equal to the number of points in your posterior distribution
      2. Compute the prior value for each parameter value in your grid (e.g. rep(1, 1000) , uniform prior)
      3. Compute the likelihood (e.g. using dbinom(x, size, p = grid)) for each grid value
      4. Multiply the likelihood times the prior which is the unstandardized posterior
      5. Standardize that posterior by dividing by sum(unstd_posterior)
  • Quadratic approximation - the posterior distribution can be represented by the Gaussian distribution quite well. The log of a Gaussian (posterior) distribution is quadratic.
    • Steps:
      1. Find the mode of the posterior. Uses quadratic approximation. With a uniform prior this is equivalent to MLE
      2. Estimate the curvature of the posterior using another numerical method
    • Needs larger sample sizes. How large is model dependent.
    • {rethinking} function, quap( )
      • Inputs are likelihood function (e.g. dbinom) and prior function (e.g. punif), and data for the likelihood function
      • Outputs mean posterior probability and the std dev of the posterior distribution
  • MCMC only briefly mentioned