Multilevel
Misc
- Also see Post-Hoc Analysis, general
- Packages
- {effectsize} - Has many of the metrics discussed here and others — with confidence intervals
Tukey Test
Difference in effects
Example: Is there a statistically significant difference between the estimated effects of the categories of the fixed effect, “Season” Data from Multilevel Modeling and Effects Statistics for Sports Scientists in R
library(multcomp) # pairwise comparisons <- glht(fit, linfct=mcp(Season="Tukey")) fit_tukey summary(fit_tukey) ## ## Simultaneous Tests for General Linear Hypotheses ## ## Multiple Comparisons of Means: Tukey Contrasts ## ## ## Fit: lmer(formula = Distance ~ Season + (1 | Athlete), data = data) ## ## Linear Hypotheses: ## Estimate Std. Error z value Pr(>|z|) ## Postseason - Inseason == 0 36.71 90.08 0.408 0.911 ## Preseason - Inseason == 0 1166.00 90.08 12.944 <1e-05 *** ## Preseason - Postseason == 0 1129.29 110.32 10.236 <1e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## (Adjusted p values reported -- single-step method)
emmeans(fit, specs = pairwise ~ Season) ## $emmeans ## Season emmean SE df lower.CL upper.CL ## Inseason 5104 137 20.8 4818 5389 ## Postseason 5140 151 30.6 4831 5449 ## Preseason 6270 151 30.6 5961 6579 ## Degrees-of-freedom method: kenward-roger ## Confidence level used: 0.95 ## $contrasts ## contrast estimate SE df t.ratio p.value ## Inseason - Postseason -36.7 90.1 978 -0.408 0.9125 ## Inseason - Preseason -1166.0 90.1 978 -12.944 <.0001 ## Postseason - Preseason -1129.3 110.3 978 -10.236 <.0001 -of-freedom method: kenward-roger Degrees: tukey method for comparing a family of 3 estimates P value adjustment
- Interpretation
- There is NOT a difference between the effect that Postseason has on Distance and the effect that Inseason has on Distance.
- There is a difference with between the other two pairs of categores
- Estimated mean distance given season type
- I’m not sure these estimates are appropriate in this situation since the Season variable is inherently unbalanced.
- Also see emmeans Post-Hoc Analysis, emmeans
- Interpretation
Cohen’s D
Standardized difference in means given a grouping variable
Generally recommended to use \(g_{\text{rm}}\) or \(g_{\text{av}}\)
Standard practice is use whichever one of those two values is closer to \(d_s\) , because it helps make the result comparable with between-subject studies.
Correction for bias can be important when dof < 50
Appropriate Version Per Use Case
Use Version Independent groups, power analyses where \(\sigma_\text{pop}\) is known or \(\sigma\) is calculated with \(n\) \(d_{\text{pop}}\) Independent groups, power analyses where \(\sigma_\text{pop}\) is unknown or \(\sigma\) is calculated with \(n-1\) \(d_s\) Independent groups, corrects for small sample bias; report for use in meta-analyses \(g\) Independent groups, when treatment might affect SD \(\Delta\) Correlated groups; generally recommended over \(g_{\text{rm}}\) \(g_{\text{av}}\) Correlated groups; more conservative than \(g_{\text{av}}\) \(g_{\text{rm}}\) Correlated groups; power analyses \(d_z\)
Notes from: Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and ANOVAs (Lakens)
Can be used to compare effects across studies, even when the dependent variables are measured in different ways
Examples
When one study uses 7-point scales to measure dependent variables, while the other study uses 9-point scales
When completely different measures are used, such as when one study uses self-report measures, and another study used physiological measurements.
The bias-corrected version is known as Hedges’ g, and in the r family of effect sizes, the correction for eta squared (η2) is known as omega squared (ω2)
Guidelines
Range: 0 to \(\infty\)
Cohen (1992)
- |d| < 0.2 “negligible”
- |d| < 0.5 “small”
- |d| < 0.8 “medium”
- otherwise “large”
Values should not be interpreted rigidly
- e.g. Small effect sizes can have large consequences, such as an intervention that leads to a reliable reduction in suicide rates with an effect size of d = 0.1.
The only reason to use these benchmarks is when the findings are extremely novel, and cannot be compared to related findings in the literature.
Two groups of Independent Observations (Between-Subjects)
\[ \begin{align} d_s &= \frac{\bar X_1 - \bar X_2}{\sqrt{\frac{(n_1-1)SD^2_1 + (n_2-1)SD^2_2}{n_1 + n_2 - 2}}}\\ &= t\;\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}\\ & \approx \frac{2t}{\sqrt{N}} \end{align} \]
Where the denominator is the pooled standard deviation
\(t\) is the t-value of two-sample t-test
Typically used in an a priori power analysis for between-subjects designs
Hedges’ g (bias-corrected)
\[ g_s = d_s \times \left(1-\frac{3}{4(n_1 + n_2) - 9}\right) \]
- The same correction is used for all types of Cohen’s d
- The difference between Hedges’s gs and Cohen’s ds is very small, especially in sample sizes above 20
Interpretation: A percentage of the standard deviation. Best to relate it to other effect sizes in the literature and it’s practical consequences if possible.
- e.g. \(d_s = 0.5\) says the difference in means is half a standard deviation.
Whenever standard deviations differ substantially between groups, Glass’s \(\Delta\) should also be reported
One Sample or Correlated Samples (Within-Subjects)
\[ \begin{aligned} &d_z = \frac{M_{\text{diff}}}{S_{\text{diff}}} = \frac{t}{\sqrt{n}} \\ &\begin{aligned} \text{where} \quad S_{\text{diff}}^{(1)} &= \sqrt{\frac{\sum(X_{\text{diff}} - M_{\text{diff}})^2}{N-1}} \\ S_{\text{diff}}^{(2)} &= \sqrt{\text{SD}_1^2 + \text{SD}_2^2 - (2\cdot r\cdot \text{SD}_1 \cdot \text{SD}_2)} \end{aligned} \end{aligned} \]
- \(M_{\text{diff}}\) is the difference between the mean (M) of the difference scores and the comparison value, \(\mu\) (typically 0)
- For paired data, the mean of the difference scores is equal to the difference in means of the two groups, so you may see it described or calculated either way.
- \(X_{\text{diff}}\) are the difference scores (i.e. the difference between the repeated measurements)
- \(S_\text{diff}\) is the SD of the difference scores.
- It can be calculated two different ways, but I doubt both are equal to each other.
- The second way seems to be the preferred way since it incorporates a correlation measure.
- \(t\) is the t-value of a paired samples t-test
- \(r\) is the correlation between measurements
- \(M_{\text{diff}}\) is the difference between the mean (M) of the difference scores and the comparison value, \(\mu\) (typically 0)
Repeated Measures (Within-Subjects)
\[ d_{\text{rm}} = d_z \cdot \sqrt{2(1-r)} \]
Alternative
\[ d_{\text{av}} = \frac{M_{\text{diff}}}{\frac{\text{SD}_1 + \text{SD}_2}{2}} \]
- Ignores the correlation between measures
If it is believe that the intervention/treatment affected the SD after the intervention, then it is advised to only use either (pre-treatment) \(\text{SD}_1\) (recommended) or (post-treatment) \(\text{SD}_2\) and report which one is used. The calculated effect is then known as Glass’s \(\boldsymbol{\Delta}\)
Example: Distance (outcome), Season (Grouping variable)
Comparing Distance means given Season (3 levels) type
Data from Multilevel Modeling and Effects Statistics for Sports Scientists in R
Another package, {effectsize}, is similar in that its formula arg only allows for grouping variables with only 2 levels
- May have other features though, since it’s part of the easystats suite.
library(effsize) ::cohen.d(preseason_data$Distance, inseason_data$Distance) effsize## ## Cohen's d ## ## d estimate: 0.9157833 (large) ## 95 percent confidence interval: ## lower upper ## 0.7493283 1.0822383
- Season is a categorical fixed effect with 3 levels
- Other Available Arguments: hedges.correction, pooled, paired, within, noncentral
library(rstatix) %>% data ::cohens_d(Distance ~ Season, ci = TRUE) rstatix #> .y. group1 group2 effsize n1 n2 conf.low conf.high magnitude #> * <chr> <chr> <chr> <dbl> <int> <int> <dbl> <dbl> <ord> #> 1 Distance Inseason Postseason -0.0317 600 200 -0.18 0.13 negligible #> 2 Distance Inseason Preseason -0.877 600 200 -1.06 -0.71 large #> 3 Distance Postseason Preseason -0.884 200 200 -1.09 -0.68 large
- Same types of arguments as {effsize} are available and also bootstrap CIs
- Magnitude (interpretation) by Cohen’s (1992) guidelines
<- esci::estimate_mdiff_two( estimate data = mydata, outcome_variable = Prediction, grouping_variable = Exposure, conf_level = 0.95, assume_equal_variance = TRUE )$es_smd |> estimate::gather(key = "type", tidyrvalue = "value") #> type value #> 1 outcome_variable_name Prediction #> 2 grouping_variable_name Exposure #> 3 effect 20 ‒ 1 #> 4 effect_size 0.571611929854665 #> 5 LL 0.327273973938463 #> 6 UL 0.81492376943417 #> 7 numerator 11.3842850063322 #> 8 denominator 19.8603120279963 #> 9 SE 0.124402744976289 #> 10 df 268 #> 11 d_biased 0.573217832141019 $es_smd_properties$message estimate#> This standardized mean difference is called d_s because the standardizer used was s_p. d_s has been corrected for bias. Correction for bias can be important when df < 50. See the rightmost column for the biased value.
- Fairly large effect: d = 0.57 95% CI [0.33, 0.81] and the confidence interval is fairly narrow
- Makes available the type of cohen’s d and the denominator used
Common Language Effect Size (CLES)
AKA Probability of Superiority
Converts the effect size into a percentage which is supposed to more understandable for laymen
Misc
- Notes from The Common Language Effect Size Statistic
- Packages
Interpretation
- Between-Subjects: The probability that a randomly sampled person from the first group will have a higher observed measurement than a randomly sampled person from the second group
- Within-Subjects: The probability that an individual has a higher value on one measurement than the other.
Formula
Assumes variables are normally distributed and \(\sigma_1 = \sigma_2\)
Original paper gives some evidence that these formulas are pretty robust to violations though.
Recommended only for continuous variables
Between-Subjects
\[ \begin{align} \tilde d &= \frac{|M_1 - M_2|}{\sqrt{p_1\text{SD}_1^2 + p_2\text{SD}_2^2}} \\ Z &= \frac{\tilde d}{\sqrt{2}} \end{align} \]
- \(M_i\): The mean of the ith group variable
- \(p_i\): The proportion of the sample size of the ith group variable
Within Subjects
\[ Z = \frac{|M_1 - M_2|}{\sqrt{\operatorname{SD}_1^2 + \operatorname{SD}_2^2 - 2 \times r \times \operatorname{SD}_1 \times \operatorname{SD}_2}} \]
- \(r\) is the Pearson correlation between the group variables
Alternative Generalization
\(A_{1,2} = P(X_1 > X_2) + 0.5 \times P(X_1 = X_2)\)
Applies for any, not necessarily continuous, distribution that is at least ordinally scaled
Equal to CL in the continuous case
Interpreted as an estimate of the value of CL that would be obtained if the distribution of X were continuous.
Eta Squared
Also see Post-Hoc Analysis, General >> Frequentist Difference-in-Means >> Nonparametric (Rank Eta-Squared, Rank Epsillon-Squared)
Notes from: Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and ANOVAs (Lakens)
Effect Size for ANOVA
Measures the proportion of the variation in Y that is associated with membership of the different groups defined by X, or the sum of squares of the effect divided by the total sum of squares
Eta Squared is an uncorrected effect size estimate that estimates the amount of variance explained based on the sample, and not based on the entire population.
Partial Eta Squared (\(\eta_p^2\)) to improve the comparability of effect sizes between studies, which expresses the sum of squares of the effect in relation to the sum of squares of the effect and the sum of squares of the error associated with the effect.
Although \(\eta_p^2\) is more useful when the goal is to compare effect sizes across studies, it is not perfect, because \(\eta_p^2\) differs when the same two means are compared in a within-subjects design or a between-subjects design.
An \(\eta^2\) of 0.13 means that 13% of the total variance can be accounted for by group membership.
CIs should be at 90%, because if you use 95%, it’s possible that even with a significant F-test, the CI will contain 0. For 90%, this doesn’t happen.
Eta Squared
\[ \eta^2 = \frac{\text{SS}_{\text{effect}}}{\text{SS}_{\text{total}}} \]
- \(\text{SS}_{\text{effect}}\) and \(\text{SS}_{\text{total}}\) are obtained from the ANOVA results
Omega Squared - The correction for eta squared (\(\eta^2\)) is known as omega squared (\(\omega^2\)). Still biased but less biased. The difference is typically small, and the bias decreases as the sample size increases.
\[ \begin{align} \omega^2 &= \frac{\operatorname{df}_{\text{effect}}(\operatorname{MS_{\text{effect}}}-\operatorname{MS_{\text{error}}})}{\operatorname{SS_{\text{total}}} + \operatorname{MS_{\text{error}}}} \quad \text{(between-subjects)} \\ \omega^2 &= \frac{\operatorname{df}_{\text{effect}}(\operatorname{MS_{\text{effect}}}-\operatorname{MS_{\text{error}}})}{\operatorname{SS_{\text{total}}} + \operatorname{MS_{\text{subjects}}}} \quad \text{(within-subjects)} \\ \end{align} \]
Epsilon Squared - Analogous to adjusted R2 (Allen, 2017, p. 382), and has been found to be less biased (Carroll & Nordholm, 1975) even though Omega Squared is more popular.
Partial Eta Squared
\[ \begin{align} \eta_p^2 &= \frac{\operatorname{SS_{\text{effect}}}}{\operatorname{SS_{\text{effect}}} + \operatorname{SS_{\text{error}}}} \quad \text{(fixed and measured variables)}\\ \eta_p^2 &= \frac{F \times \operatorname{df}_{\text{effect}}}{F \times \operatorname{df}_{\text{effect}} + \operatorname{df}_{\text{error}}} \quad \text{(fixed variables)} \end{align} \]
Fixed (e.g., manipulated), not random (e.g., measured)
Bias-Lessened
\[ \omega_p^2 = \frac{\operatorname{df}_{\text{effect}}(\operatorname{MS_{\text{effect}}}-\operatorname{MS_{\text{error}}})}{\operatorname{df}_{\text{effect}} \times \operatorname{MS_{\text{effect}}} + (N - \operatorname{df}_{\text{effect}}) \times \operatorname{MS_{\text{error}}}} \]
- Same equation whether it’s for between-subject designs and within-subject designs
Recommend researchers report \(\eta_G^2\) and/or \(\eta_p^2\), at least until generalized omega-squared is automatically provided by statistical software packages
- For designs where all factors are manipulated between participants, \(\eta_p^2\) and \(\eta_G^2\) are identical, so either effect size can be reported. For within-subjects designs and mixed designs where all factors are manipulated, \(\eta_p^2\) can always be calculated from the F-value and the degrees of freedom using formula 13, but \(\eta_G^2\) cannot be calculated from the reported results,and therefore I recommend reporting \(\eta_G^2\) for these designs
- supplementary spreadsheet provides a relatively easy way to calculate \(\eta_G^2\) for commonly used designs. For designs with measured factors or covariates, neither \(\eta_p^2\) nor \(\eta_G^2\) can be calculated from the
Appropriate Version Per Use Cases
Use Case Version Less Biased Version Comparisons within a single study \(\eta^2\) \(\omega^2\) Power analyses, and for comparisons of effect sizes across studies with the same experimental design \(\eta_p^2\) \(\omega_p^2\) Meta-Analyses to compare across various experimental designs \(\eta_G^2\) \(\omega_G^2\) Guidelines
Cohen’s benchmarks were developed for comparisons between unrestricted populations (e.g., men vs. women), and using these benchmarks when interpreting the \(\eta_p^2\) effect size in designs that include covariates or repeated measures is not consistent with the onsiderations upon which the benchmarks were based.
- Although \(\eta_G^2\) can be compared using these guidelines, it is preferable to compare effect sizes with those in the literature.
Field (2013) (source)
Range Interpretation ES \(\lt\) 0.01 Very Small 0.01 \(\le\) ES \(\lt\) 0.06 Small 0.06 \(\le\) ES \(\lt\) 0.14 Medium ES \(\ge\) 0.14 Large Jacob Cohen (1992) (source) - Applicable to One-Way ANOVAs or Partial Eta or Partial Omega in multiway ANOVA.
Range Interpretation ES \(\lt\) 0.02 Very Small 0.02 \(\le\) ES \(\lt\) 0.13 Small 0.13 \(\le\) ES \(\lt\) 0.26 Medium ES \(\ge\) 0.26 Large