Multilevel

Misc

Also see Post-Hoc Analysis, general
Packages
- {effectsize} - Has many of the metrics discussed here and others — with confidence intervals

Tukey Test

Difference in effects

Example: Is there a statistically significant difference between the estimated effects of the categories of the fixed effect, “Season” Data from Multilevel Modeling and Effects Statistics for Sports Scientists in R

{multcomp}
{emmeans}

library(multcomp)
# pairwise comparisons
fit_tukey <- glht(fit, linfct=mcp(Season="Tukey"))
summary(fit_tukey)
## 
## Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: lmer(formula = Distance ~ Season + (1 | Athlete), data = data)
## 
## Linear Hypotheses:
##                            Estimate Std.  Error z value Pr(>|z|)   
## Postseason - Inseason == 0        36.71   90.08   0.408    0.911   
## Preseason - Inseason == 0       1166.00   90.08  12.944   <1e-05 ***
## Preseason - Postseason == 0     1129.29  110.32  10.236   <1e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

emmeans(fit, specs = pairwise ~ Season)
## $emmeans
##  Season    emmean  SE   df lower.CL upper.CL
##  Inseason    5104 137 20.8     4818    5389
##  Postseason  5140 151 30.6     4831    5449
##  Preseason   6270 151 30.6     5961    6579
## Degrees-of-freedom method: kenward-roger 
## Confidence level used: 0.95 
## $contrasts
##  contrast               estimate    SE  df t.ratio p.value
##  Inseason - Postseason     -36.7  90.1 978  -0.408 0.9125 
##  Inseason - Preseason    -1166.0  90.1 978 -12.944 <.0001 
##  Postseason - Preseason  -1129.3 110.3 978 -10.236 <.0001 
Degrees-of-freedom method: kenward-roger 
P value adjustment: tukey method for comparing a family of 3 estimates

Interpretation
- There is NOT a difference between the effect that Postseason has on Distance and the effect that Inseason has on Distance.
- There is a difference with between the other two pairs of categores
- Estimated mean distance given season type
  - I’m not sure these estimates are appropriate in this situation since the Season variable is inherently unbalanced.
  - Also see emmeans Post-Hoc Analysis, emmeans

Cohen’s D

Standardized difference in means given a grouping variable

Generally recommended to use \(g_{\text{rm}}\) or \(g_{\text{av}}\)

Standard practice is use whichever one of those two values is closer to \(d_s\) , because it helps make the result comparable with between-subject studies.
Correction for bias can be important when dof < 50

Appropriate Version Per Use Case

Use	Version
Independent groups, power analyses where \(\sigma_\text{pop}\) is known or \(\sigma\) is calculated with \(n\)	\(d_{\text{pop}}\)
Independent groups, power analyses where \(\sigma_\text{pop}\) is unknown or \(\sigma\) is calculated with \(n-1\)	\(d_s\)
Independent groups, corrects for small sample bias; report for use in meta-analyses	\(g\)
Independent groups, when treatment might affect SD	\(\Delta\)
Correlated groups; generally recommended over \(g_{\text{rm}}\)	\(g_{\text{av}}\)
Correlated groups; more conservative than \(g_{\text{av}}\)	\(g_{\text{rm}}\)
Correlated groups; power analyses	\(d_z\)

Notes from: Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and ANOVAs (Lakens)
Can be used to compare effects across studies, even when the dependent variables are measured in different ways
- Examples
  - When one study uses 7-point scales to measure dependent variables, while the other study uses 9-point scales
  - When completely different measures are used, such as when one study uses self-report measures, and another study used physiological measurements.
The bias-corrected version is known as Hedges’ g, and in the r family of effect sizes, the correction for eta squared (η2) is known as omega squared (ω2)
Guidelines
- Range: 0 to \(\infty\)
- Cohen (1992)
  - |d| < 0.2 “negligible”
  - |d| < 0.5 “small”
  - |d| < 0.8 “medium”
  - otherwise “large”
- Others: Automated Interpretation of Indices of Effect Size
- Values should not be interpreted rigidly
  - e.g. Small effect sizes can have large consequences, such as an intervention that leads to a reliable reduction in suicide rates with an effect size of d = 0.1.
- The only reason to use these benchmarks is when the findings are extremely novel, and cannot be compared to related findings in the literature.
Two groups of Independent Observations (Between-Subjects)

\[ \begin{align} d_s &= \frac{\bar X_1 - \bar X_2}{\sqrt{\frac{(n_1-1)SD^2_1 + (n_2-1)SD^2_2}{n_1 + n_2 - 2}}}\\ &= t\;\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}\\ & \approx \frac{2t}{\sqrt{N}} \end{align} \]
- Where the denominator is the pooled standard deviation
- \(t\) is the t-value of two-sample t-test
- Typically used in an a priori power analysis for between-subjects designs
- Hedges’ g (bias-corrected)
  
  \[ g_s = d_s \times \left(1-\frac{3}{4(n_1 + n_2) - 9}\right) \]
  - The same correction is used for all types of Cohen’s d
  - The difference between Hedges’s g_s and Cohen’s d_s is very small, especially in sample sizes above 20
- Interpretation: A percentage of the standard deviation. Best to relate it to other effect sizes in the literature and it’s practical consequences if possible.
  - e.g. \(d_s = 0.5\) says the difference in means is half a standard deviation.
- Whenever standard deviations differ substantially between groups, Glass’s \(\Delta\) should also be reported
One Sample or Correlated Samples (Within-Subjects)

\[ \begin{aligned} &d_z = \frac{M_{\text{diff}}}{S_{\text{diff}}} = \frac{t}{\sqrt{n}} \\ &\begin{aligned} \text{where} \quad S_{\text{diff}}^{(1)} &= \sqrt{\frac{\sum(X_{\text{diff}} - M_{\text{diff}})^2}{N-1}} \\ S_{\text{diff}}^{(2)} &= \sqrt{\text{SD}_1^2 + \text{SD}_2^2 - (2\cdot r\cdot \text{SD}_1 \cdot \text{SD}_2)} \end{aligned} \end{aligned} \]
- \(M_{\text{diff}}\) is the difference between the mean (M) of the difference scores and the comparison value, \(\mu\) (typically 0)
  - For paired data, the mean of the difference scores is equal to the difference in means of the two groups, so you may see it described or calculated either way.
- \(X_{\text{diff}}\) are the difference scores (i.e. the difference between the repeated measurements)
- \(S_\text{diff}\) is the SD of the difference scores.
  - It can be calculated two different ways, but I doubt both are equal to each other.
  - The second way seems to be the preferred way since it incorporates a correlation measure.
- \(t\) is the t-value of a paired samples t-test
- \(r\) is the correlation between measurements
Repeated Measures (Within-Subjects)

\[ d_{\text{rm}} = d_z \cdot \sqrt{2(1-r)} \]
- Alternative
  
  \[ d_{\text{av}} = \frac{M_{\text{diff}}}{\frac{\text{SD}_1 + \text{SD}_2}{2}} \]
  - Ignores the correlation between measures
- If it is believe that the intervention/treatment affected the SD after the intervention, then it is advised to only use either (pre-treatment) \(\text{SD}_1\) (recommended) or (post-treatment) \(\text{SD}_2\) and report which one is used. The calculated effect is then known as Glass’s \(\boldsymbol{\Delta}\)

Example: Distance (outcome), Season (Grouping variable)

Comparing Distance means given Season (3 levels) type
Data from Multilevel Modeling and Effects Statistics for Sports Scientists in R

Another package, {effectsize}, is similar in that its formula arg only allows for grouping variables with only 2 levels

May have other features though, since it’s part of the easystats suite.

library(effsize)
effsize::cohen.d(preseason_data$Distance, inseason_data$Distance)
## 
## Cohen's d
## 
## d estimate: 0.9157833 (large)
## 95 percent confidence interval:
##    lower    upper 
## 0.7493283 1.0822383

Season is a categorical fixed effect with 3 levels
Other Available Arguments: hedges.correction, pooled, paired, within, noncentral

library(rstatix)
data %>% 
  rstatix::cohens_d(Distance ~ Season, ci = TRUE)

#>     .y.        group1     group2    effsize     n1     n2   conf.low conf.high magnitude 
#> *  <chr>       <chr>      <chr>       <dbl>   <int>  <int>    <dbl>    <dbl> <ord>     
#> 1 Distance   Inseason  Postseason   -0.0317    600    200     -0.18     0.13  negligible
#> 2 Distance   Inseason   Preseason    -0.877    600    200     -1.06    -0.71       large     
#> 3 Distance Postseason   Preseason    -0.884    200    200     -1.09    -0.68       large

Same types of arguments as {effsize} are available and also bootstrap CIs
Magnitude (interpretation) by Cohen’s (1992) guidelines

estimate <- esci::estimate_mdiff_two(
  data = mydata,
  outcome_variable = Prediction,
  grouping_variable = Exposure,
  conf_level = 0.95,
  assume_equal_variance = TRUE
)
estimate$es_smd |> 
  tidyr::gather(key = "type", 
                value = "value")
#>                      type             value
#> 1   outcome_variable_name        Prediction
#> 2  grouping_variable_name          Exposure
#> 3                  effect            20 ‒ 1
#> 4             effect_size 0.571611929854665
#> 5                      LL 0.327273973938463
#> 6                      UL  0.81492376943417
#> 7               numerator  11.3842850063322
#> 8             denominator  19.8603120279963
#> 9                      SE 0.124402744976289
#> 10                     df               268
#> 11               d_biased 0.573217832141019

estimate$es_smd_properties$message
#> This standardized mean difference is called d_s because the standardizer used was s_p. d_s has been corrected for bias. Correction for bias can be important when df < 50.  See the rightmost column for the biased value.

Fairly large effect: d = 0.57 95% CI [0.33, 0.81] and the confidence interval is fairly narrow
Makes available the type of cohen’s d and the denominator used

Common Language Effect Size (CLES)

AKA Probability of Superiority
Converts the effect size into a percentage which is supposed to more understandable for laymen
Misc
- Notes from The Common Language Effect Size Statistic
- Packages
  - {ebtools::cles}
Interpretation
- Between-Subjects: The probability that a randomly sampled person from the first group will have a higher observed measurement than a randomly sampled person from the second group
- Within-Subjects: The probability that an individual has a higher value on one measurement than the other.
Formula
- Assumes variables are normally distributed and \(\sigma_1 = \sigma_2\)
  - Original paper gives some evidence that these formulas are pretty robust to violations though.
  - Recommended only for continuous variables
- Between-Subjects
  
  \[ \begin{align} \tilde d &= \frac{|M_1 - M_2|}{\sqrt{p_1\text{SD}_1^2 + p_2\text{SD}_2^2}} \\ Z &= \frac{\tilde d}{\sqrt{2}} \end{align} \]
  - \(M_i\): The mean of the i^th group variable
  - \(p_i\): The proportion of the sample size of the i^th group variable
- Within Subjects
  
  \[ Z = \frac{|M_1 - M_2|}{\sqrt{\operatorname{SD}_1^2 + \operatorname{SD}_2^2 - 2 \times r \times \operatorname{SD}_1 \times \operatorname{SD}_2}} \]
  - \(r\) is the Pearson correlation between the group variables
Alternative Generalization
- \(A_{1,2} = P(X_1 > X_2) + 0.5 \times P(X_1 = X_2)\)
- Applies for any, not necessarily continuous, distribution that is at least ordinally scaled
  - Equal to CL in the continuous case
  - Interpreted as an estimate of the value of CL that would be obtained if the distribution of X were continuous.

Eta Squared

Also see Post-Hoc Analysis, General >> Frequentist Difference-in-Means >> Nonparametric (Rank Eta-Squared, Rank Epsillon-Squared)
Notes from: Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and ANOVAs (Lakens)
Effect Size for ANOVA
Measures the proportion of the variation in Y that is associated with membership of the different groups defined by X, or the sum of squares of the effect divided by the total sum of squares
Eta Squared is an uncorrected effect size estimate that estimates the amount of variance explained based on the sample, and not based on the entire population.
Partial Eta Squared (\(\eta_p^2\)) to improve the comparability of effect sizes between studies, which expresses the sum of squares of the effect in relation to the sum of squares of the effect and the sum of squares of the error associated with the effect.
Although \(\eta_p^2\) is more useful when the goal is to compare effect sizes across studies, it is not perfect, because \(\eta_p^2\) differs when the same two means are compared in a within-subjects design or a between-subjects design.
An \(\eta^2\) of 0.13 means that 13% of the total variance can be accounted for by group membership.
CIs should be at 90%, because if you use 95%, it’s possible that even with a significant F-test, the CI will contain 0. For 90%, this doesn’t happen.
Eta Squared

\[ \eta^2 = \frac{\text{SS}_{\text{effect}}}{\text{SS}_{\text{total}}} \]
- \(\text{SS}_{\text{effect}}\) and \(\text{SS}_{\text{total}}\) are obtained from the ANOVA results
Omega Squared - The correction for eta squared (\(\eta^2\)) is known as omega squared (\(\omega^2\)). Still biased but less biased. The difference is typically small, and the bias decreases as the sample size increases.

\[ \begin{align} \omega^2 &= \frac{\operatorname{df}_{\text{effect}}(\operatorname{MS_{\text{effect}}}-\operatorname{MS_{\text{error}}})}{\operatorname{SS_{\text{total}}} + \operatorname{MS_{\text{error}}}} \quad \text{(between-subjects)} \\ \omega^2 &= \frac{\operatorname{df}_{\text{effect}}(\operatorname{MS_{\text{effect}}}-\operatorname{MS_{\text{error}}})}{\operatorname{SS_{\text{total}}} + \operatorname{MS_{\text{subjects}}}} \quad \text{(within-subjects)} \\ \end{align} \]
Epsilon Squared - Analogous to adjusted R² (Allen, 2017, p. 382), and has been found to be less biased (Carroll & Nordholm, 1975) even though Omega Squared is more popular.
Partial Eta Squared

\[ \begin{align} \eta_p^2 &= \frac{\operatorname{SS_{\text{effect}}}}{\operatorname{SS_{\text{effect}}} + \operatorname{SS_{\text{error}}}} \quad \text{(fixed and measured variables)}\\ \eta_p^2 &= \frac{F \times \operatorname{df}_{\text{effect}}}{F \times \operatorname{df}_{\text{effect}} + \operatorname{df}_{\text{error}}} \quad \text{(fixed variables)} \end{align} \]
- Fixed (e.g., manipulated), not random (e.g., measured)
- Bias-Lessened
  
  \[ \omega_p^2 = \frac{\operatorname{df}_{\text{effect}}(\operatorname{MS_{\text{effect}}}-\operatorname{MS_{\text{error}}})}{\operatorname{df}_{\text{effect}} \times \operatorname{MS_{\text{effect}}} + (N - \operatorname{df}_{\text{effect}}) \times \operatorname{MS_{\text{error}}}} \]
  - Same equation whether it’s for between-subject designs and within-subject designs
Recommend researchers report \(\eta_G^2\) and/or \(\eta_p^2\), at least until generalized omega-squared is automatically provided by statistical software packages
- For designs where all factors are manipulated between participants, \(\eta_p^2\) and \(\eta_G^2\) are identical, so either effect size can be reported. For within-subjects designs and mixed designs where all factors are manipulated, \(\eta_p^2\) can always be calculated from the F-value and the degrees of freedom using formula 13, but \(\eta_G^2\) cannot be calculated from the reported results,and therefore I recommend reporting \(\eta_G^2\) for these designs
- supplementary spreadsheet provides a relatively easy way to calculate \(\eta_G^2\) for commonly used designs. For designs with measured factors or covariates, neither \(\eta_p^2\) nor \(\eta_G^2\) can be calculated from the

Appropriate Version Per Use Cases

Use Case	Version	Less Biased Version
Comparisons within a single study	\(\eta^2\)	\(\omega^2\)
Power analyses, and for comparisons of effect sizes across studies with the same experimental design	\(\eta_p^2\)	\(\omega_p^2\)
Meta-Analyses to compare across various experimental designs	\(\eta_G^2\)	\(\omega_G^2\)

Guidelines
- Cohen’s benchmarks were developed for comparisons between unrestricted populations (e.g., men vs. women), and using these benchmarks when interpreting the \(\eta_p^2\) effect size in designs that include covariates or repeated measures is not consistent with the onsiderations upon which the benchmarks were based.
  - Although \(\eta_G^2\) can be compared using these guidelines, it is preferable to compare effect sizes with those in the literature.
- Field (2013) (source)
  
  Range Interpretation
  
  ES \(\lt\) 0.01 Very Small
  
  0.01 \(\le\) ES \(\lt\) 0.06 Small
  
  0.06 \(\le\) ES \(\lt\) 0.14 Medium
  
  ES \(\ge\) 0.14 Large
- Jacob Cohen (1992) (source) - Applicable to One-Way ANOVAs or Partial Eta or Partial Omega in multiway ANOVA.
  
  Range Interpretation
  
  ES \(\lt\) 0.02 Very Small
  
  0.02 \(\le\) ES \(\lt\) 0.13 Small
  
  0.13 \(\le\) ES \(\lt\) 0.26 Medium
  
  ES \(\ge\) 0.26 Large

Range	Interpretation
ES \(\lt\) 0.01	Very Small
0.01 \(\le\) ES \(\lt\) 0.06	Small
0.06 \(\le\) ES \(\lt\) 0.14	Medium
ES \(\ge\) 0.14	Large

Range	Interpretation
ES \(\lt\) 0.02	Very Small
0.02 \(\le\) ES \(\lt\) 0.13	Small
0.13 \(\le\) ES \(\lt\) 0.26	Medium
ES \(\ge\) 0.26	Large