Glossary

A priori - a type of knowledge that can be derived by reason alone
- A priori analyses are performed as part of the research planning process.
A posteriori - a type of knowledge that expresses an empirical fact unknowable by reason alone.
- Same as post-hoc. Post-Hoc analysis is conducted after the experiment.
averaging over - See marginalization
bias - Also see unbiased estimator - Bias in statistics refers to a systematic error or distortion that leads to inaccurate or misleading results.
- Models can inherit biases present in the data or introduce biases due to assumptions or simplifications made during the modeling process.
- Estimating population parameters (e.g., mean, proportion, regression coefficient) are typically based on a sample of data. An unbiased estimator is one whose expected value is equal to the true population parameter value, meaning that on average, the estimate will be correct. However, biased estimators can systematically over- or underestimate the true parameter value.
ceteris paribus - latin for “all things being equal” or “other things held constant.”
closed form - a mathematical expression that uses a finite number of standard operations. It may contain constants, variables, certain well-known operations, and functions, but usually no limit, differentiation, or integration.
consistency - Requires that the outcome of the procedure with unlimited data should identify the underlying truth. Usage is restricted to cases where essentially the same procedure can be applied to any number of data items. In complicated applications of statistics, there may be several ways in which the number of data items may grow. For example, records for rainfall within an area might increase in three ways: records for additional time periods; records for additional sites with a fixed area; records for extra sites obtained by extending the size of the area. In such cases, the property of consistency may be limited to one or more of the possible ways a sample size can grow
degrees of freedom - When discussed about variable-sample size tradeoff, usually means n-p, where is the number of rows and p is the number of variables. The more variables used in the model the fewer degrees of freedom and therefore less power and precision.
exchangeability - It means we can swap around, or reorder, variables in the sequence without changing their joint distribution.
- Every IID (independent, identically distributed) sequence is exchangeable - but not the other way around. Every exchangeable sequence is identically distributed, though
  - Example: If you draw a sequence of red and blue marbles from a bag without replacement, the sample is exchangeable but not independent. e.g. drawing a red marble affects the probability of drawing a red or blue marble next.
- Time Series, Hierarchical, Repeated Measure, Spatial, etc. are non-exchangable.
efficiency - A test, estimator, etc. is more efficient than another test, estimator, etc. if it requires fewer observation to obtain the same level of performance.
ergodicity - the idea that a point of a moving system, either a dynamical system or a stochastic process, will eventually visit all parts of the space that the system moves in, in a uniform and random sense
external validity - Our estimates are externally valid if inferences and conclusions can be generalized from the population and setting studied to other populations and settings. (also see internal validity)
identifiable (aka point-indentifiable) - theoretically possible to learn the true values of this model’s underlying parameters after obtaining an infinite number of observations from it (see non-identifiability, partially-indentifiable)
ill-conditioned - In SVD decomposition, when there’s a huge difference between largest and smallest eigenvalue of the original matrix, A, the ratio of which is called condition number.
internal validity - our estimates are internally valid if statistical inferences about causal effects are valid for the population being studied. (also see external validity)
intractable - problems for which there exist no efficient algorithms to solve them. Most intractable problems have an algorithm – the same algorithm – that provides a solution, and that algorithm is the brute-force search
locality - effects have causes and chains of cause and effect must be unbroken in space and time (not the case in ‘entanglement’)
marginalization (aka averaged over)- The process of eliminating one or more variables from a joint probability distribution or a multivariate statistical model to obtain the distribution or model for a subset of variables. The resulting distribution or model is called a marginal distribution or marginal model. It allows you to focus on the behavior of specific variables while considering the uncertainty associated with others.
- For example, marginalizing over a joint distribution (i.e. many variables) gets you a marginal distribution (i.e. fewer variables). In other words, if you have a joint probability distribution for two variables \(X\) and \(Y\), the marginal distribution of \(X\) is obtained by summing or integrating over all possible values of \(Y\). Similarly, the marginal distribution of \(Y\) is obtained by summing or integrating over all possible values of \(X\).
- Notation: \(P(X) = \sum_Y P(X,Y) \;\text{or}\; P(X) = \int P(X,Y)\;dY\)
- Once you have to the marginal distribution, this allows you compute conditional distributions. For example, after obtaining the marginal distribution, \(P(X,Y)\), from the joint distribution, \(P(X,Y,Z)\), you can compute the conditional distributions, \(P(X|Y)\) and \(P(Y|X)\).
- The uncertainty associated with \(Z\) is indirectly considered in the sense that the marginal distribution \(P(X,Y)\) accounts for all possible values of \(Z\) by integrating over them. However, \(P(X,Y)\) itself doesn’t provide explicit information about the uncertainty associated with \(Z\).
- The marginalizing in a modeling sense can be described as weighted average. (Claude Sonnet)
  \[ \mathbb{E}(Y|X) = \int \mathbb{E}(Y|X,Z)f(Z)dZ \]
  - Where \(dZ\) the is the probability density function (pdf) of the omitted \(Z\) in the sample.
  - The goal is to estimate the expected value (or mean) of \(Y\) given the values of \(X\), which is denoted as \(\mathbb{E}(Y|X)\). However, since \(Z\) is omitted from the model, we cannot directly estimate \(\mathbb{E}(Y|X, Z)\), which is the conditional mean of \(Y\) given both \(X\) and \(Z\).
  - Instead, what the model provides is a weighted average of all possible conditional means \(\mathbb{E}(Y|X, Z)\) for each value of the omitted \(Z\), weighted by the distribution (frequency/probability) of \(Z\) in the sample data.
  - Each possible conditional mean \(\mathbb{E}(Y|X, Z)\) is weighted by how frequent/probable that value of \(Z\) is in the data, according to its pdf \(f(Z)\).
  - The weightings \(f(Z)\) ensure that values of \(Z\) that are more common in the sample get higher weights in the average.
  - This weighted average provides an estimate of \(E(Y|X)\) that has “averaged over” or accounted for the distribution of the omitted \(Z\), even though \(Z\) itself is not explicitly included in the model.
  - In essence, it combines all the different conditional means corresponding to different values of \(Z\) into one overall mean, using the sample distribution of \(Z\) as weights.
- Example: Omitted Covariates (Harrell)
  - “When we condition on baseline characteristics in \(X\), these describe types of patients. So we obtain patient-type-specific tendencies of \(Y\). When \(X\) omits an important patient characteristic we obtain patient-type-specific values up to the resolution of how “type” is measured. Such conditional estimates will be marginal over omitted covariates, i.e., they will average over the sample distribution of omitted covariates. For linear models this is consequential only in not further reducing the residual variance, so some efficiency is lost. For nonlinear models such as logistic and Cox models, the consequence is that the treatment effect is a kind of weighted average over the sample distribution of omitted covariates that are important. That doesn’t make it wrong or unhelpful. The effect of this on the average is to underestimate the true treatment effect that compares like with like by conditioning on all important covariates.”
non-identifiability - the structure of the data and model do not make it possible to estimate the parameter’s value. Multicollinearity is a type of non-identifiability problem. (i.e. two or more parametrizations of the model are observationally equivalent) (see identifiable, partially-indentifiable)
overdetermined system - In linear regression, when there are more observations than features, n > p
paired data - Data where there are two observations or measurements taken on the same subject or experimental unit under different conditions or at different times. The observations are “paired” because they are not independent – they are related in some way since they come from the same subject.
- Before-After Study: Measuring some response variable on subjects before and after some treatment or intervention. For example, measuring blood pressure before and after taking a new medication.
- Case-Control Study: Measuring a variable on matched pairs of cases (with some condition) and controls (without the condition). The cases and controls are matched on variables like age, gender, etc.
- Repeated Measures: Taking multiple measurements on the same subjects under different conditions or at different time points. For example, measuring reaction times for the same subjects under different drug dosages.
partial coefficient - The coefficient of a variable in a multivariable regression. In a simple regression, the coefficient of the variable is just called the “regression coefficient.”
partially-indentifiable (aka set identifiable) - non-identifiable but possible to learn the true values of a certain subset of the model parameters
robust - a “robust” estimator in statistics is one that is insensitive to outliers, whereas a “robust” estimator in econometrics is insensitive to heteroskedasticity and autocorrelation (hyndman)
support (aka range) - the set of values that the random variable can take.
- For discrete random variables, it is the set of all the realizations that have a strictly positive probability of being observed.
- For continuous random variables, it is the set of all numbers whose probability density is strictly positive.
- See link for examples
underspecification - In general, the solution to a problem is underspecified if there are many distinct solutions that solve the problem equivalently.
An unbiased estimator is an accurate statistic that’s used to approximate a population parameter.
- “Accurate” in this sense means that it’s neither an overestimate nor an underestimate. If an overestimate or underestimate does happen, the mean of the difference is called a “bias.”
Weak Law of Large Numbers (Bernoulli’s theorem) - states that if you have a sample of independent and identically distributed random variables, as the sample size grows larger, the sample mean will tend toward the population mean