Multilevel
Misc
- Also see Post-Hoc Analysis, general
- Packages
- {effectsize} - Has many of the metrics discussed here and others — with confidence intervals
Tukey Test
Difference in effects
Example: Is there a statistically significant difference between the estimated effects of the categories of the fixed effect, “Season” Data from Multilevel Modeling and Effects Statistics for Sports Scientists in R
library(multcomp) # pairwise comparisons <- glht(fit, linfct=mcp(Season="Tukey")) fit_tukey summary(fit_tukey) ## ## Simultaneous Tests for General Linear Hypotheses ## ## Multiple Comparisons of Means: Tukey Contrasts ## ## ## Fit: lmer(formula = Distance ~ Season + (1 | Athlete), data = data) ## ## Linear Hypotheses: ## Estimate Std. Error z value Pr(>|z|) ## Postseason - Inseason == 0 36.71 90.08 0.408 0.911 ## Preseason - Inseason == 0 1166.00 90.08 12.944 <1e-05 *** ## Preseason - Postseason == 0 1129.29 110.32 10.236 <1e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## (Adjusted p values reported -- single-step method)
emmeans(fit, specs = pairwise ~ Season) ## $emmeans ## Season emmean SE df lower.CL upper.CL ## Inseason 5104 137 20.8 4818 5389 ## Postseason 5140 151 30.6 4831 5449 ## Preseason 6270 151 30.6 5961 6579 ## Degrees-of-freedom method: kenward-roger ## Confidence level used: 0.95 ## $contrasts ## contrast estimate SE df t.ratio p.value ## Inseason - Postseason -36.7 90.1 978 -0.408 0.9125 ## Inseason - Preseason -1166.0 90.1 978 -12.944 <.0001 ## Postseason - Preseason -1129.3 110.3 978 -10.236 <.0001 -of-freedom method: kenward-roger Degrees: tukey method for comparing a family of 3 estimates P value adjustment
- Interpretation
- There is NOT a difference between the effect that Postseason has on Distance and the effect that Inseason has on Distance.
- There is a difference with between the other two pairs of categores
- Estimated mean distance given season type
- I’m not sure these estimates are appropriate in this situation since the Season variable is inherently unbalanced.
- Also see emmeans Post-Hoc Analysis, emmeans
- Interpretation
Cohen’s D
Standardized difference in means given a grouping variable
Generally recommended to use \(g_{\text{rm}}\) or \(g_{\text{av}}\)
Standard practice is use whichever one of those two values is closer to \(d_s\) , because it helps make the result comparable with between-subject studies.
Correction for bias can be important when dof < 50
Appropriate Version Per Use Case
Use Version Independent groups, power analyses where \(\sigma_\text{pop}\) is known or \(\sigma\) is calculated with \(n\) \(d_{\text{pop}}\) Independent groups, power analyses where \(\sigma_\text{pop}\) is unknown or \(\sigma\) is calculated with \(n-1\) \(d_s\) Independent groups, corrects for small sample bias; report for use in meta-analyses \(g\) Independent groups, when treatment might affect SD \(\Delta\) Correlated groups; generally recommended over \(g_{\text{rm}}\) \(g_{\text{av}}\) Correlated groups; more conservative than \(g_{\text{av}}\) \(g_{\text{rm}}\) Correlated groups; power analyses \(d_z\)
Notes from: Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and ANOVAs (Lakens)
Can be used to compare effects across studies, even when the dependent variables are measured in different ways
Examples
When one study uses 7-point scales to measure dependent variables, while the other study uses 9-point scales
When completely different measures are used, such as when one study uses self-report measures, and another study used physiological measurements.
The bias-corrected version is known as Hedges’ g, and in the r family of effect sizes, the correction for eta squared (η2) is known as omega squared (ω2)
Guidelines
Range: 0 to \(\infty\)
Cohen (1992)
- |d| < 0.2 “negligible”
- |d| < 0.5 “small”
- |d| < 0.8 “medium”
- otherwise “large”
Values should not be interpreted rigidly
- e.g. Small effect sizes can have large consequences, such as an intervention that leads to a reliable reduction in suicide rates with an effect size of d = 0.1.
The only reason to use these benchmarks is when the findings are extremely novel, and cannot be compared to related findings in the literature.
Two groups of Independent Observations (Between-Subjects)
\[ \begin{align} d_s &= \frac{\bar X_1 - \bar X_2}{\sqrt{\frac{(n_1-1)SD^2_1 + (n_2-1)SD^2_2}{n_1 + n_2 - 2}}}\\ &= t\;\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}\\ & \approx \frac{2t}{\sqrt{N}} \end{align} \]
Where the denominator is the pooled standard deviation
\(t\) is the t-value of two-sample t-test
Typically used in an a priori power analysis for between-subjects designs
Hedges’ g (bias-corrected)
\[ g_s = d_s \times \left(1-\frac{3}{4(n_1 + n_2) - 9}\right) \]
- The same correction is used for all types of Cohen’s d
- The difference between Hedges’s gs and Cohen’s ds is very small, especially in sample sizes above 20
Interpretation: A percentage of the standard deviation. Best to relate it to other effect sizes in the literature and it’s practical consequences if possible.
- e.g. \(d_s = 0.5\) says the difference in means is half a standard deviation.
Whenever standard deviations differ substantially between groups, Glass’s \(\Delta\) should also be reported
One Sample or Correlated Samples (Within-Subjects)
\[ \begin{aligned} &d_z = \frac{M_{\text{diff}}}{S_{\text{diff}}} = \frac{t}{\sqrt{n}} \\ &\begin{aligned} \text{where} \quad S_{\text{diff}}^{(1)} &= \sqrt{\frac{\sum(X_{\text{diff}} - M_{\text{diff}})^2}{N-1}} \\ S_{\text{diff}}^{(2)} &= \sqrt{\text{SD}_1^2 + \text{SD}_2^2 - (2\cdot r\cdot \text{SD}_1 \cdot \text{SD}_2)} \end{aligned} \end{aligned} \]
- \(M_{\text{diff}}\) is the difference between the mean (M) of the difference scores and the comparison value, \(\mu\) (typically 0)
- For paired data, the mean of the difference scores is equal to the difference in means of the two groups, so you may see it described or calculated either way.
- \(X_{\text{diff}}\) are the difference scores (i.e. the difference between the repeated measurements)
- \(S_\text{diff}\) is the SD of the difference scores.
- It can be calculated two different ways, but I doubt both are equal to each other.
- The second way seems to be the preferred way since it incorporates a correlation measure.
- \(t\) is the t-value of a paired samples t-test
- \(r\) is the correlation between measurements
- \(M_{\text{diff}}\) is the difference between the mean (M) of the difference scores and the comparison value, \(\mu\) (typically 0)
Repeated Measures (Within-Subjects)
\[ d_{\text{rm}} = d_z \cdot \sqrt{2(1-r)} \]
Alternative
\[ d_{\text{av}} = \frac{M_{\text{diff}}}{\frac{\text{SD}_1 + \text{SD}_2}{2}} \]
- Ignores the correlation between measures
If it is believe that the intervention/treatment affected the SD after the intervention, then it is advised to only use either (pre-treatment) \(\text{SD}_1\) (recommended) or (post-treatment) \(\text{SD}_2\) and report which one is used. The calculated effect is then known as Glass’s \(\boldsymbol{\Delta}\)
Example: Distance (outcome), Season (Grouping variable)
Comparing Distance means given Season (3 levels) type
Data from Multilevel Modeling and Effects Statistics for Sports Scientists in R
Another package, {effectsize}, is similar in that its formula arg only allows for grouping variables with only 2 levels
- May have other features though, since it’s part of the easystats suite.
library(effsize) ::cohen.d(preseason_data$Distance, inseason_data$Distance) effsize## ## Cohen's d ## ## d estimate: 0.9157833 (large) ## 95 percent confidence interval: ## lower upper ## 0.7493283 1.0822383
- Season is a categorical fixed effect with 3 levels
- Other Available Arguments: hedges.correction, pooled, paired, within, noncentral
library(rstatix) %>% data ::cohens_d(Distance ~ Season, ci = TRUE) rstatix #> .y. group1 group2 effsize n1 n2 conf.low conf.high magnitude #> * <chr> <chr> <chr> <dbl> <int> <int> <dbl> <dbl> <ord> #> 1 Distance Inseason Postseason -0.0317 600 200 -0.18 0.13 negligible #> 2 Distance Inseason Preseason -0.877 600 200 -1.06 -0.71 large #> 3 Distance Postseason Preseason -0.884 200 200 -1.09 -0.68 large
- Same types of arguments as {effsize} are available and also bootstrap CIs
- Magnitude (interpretation) by Cohen’s (1992) guidelines
<- esci::estimate_mdiff_two( estimate data = mydata, outcome_variable = Prediction, grouping_variable = Exposure, conf_level = 0.95, assume_equal_variance = TRUE )$es_smd |> estimate::gather(key = "type", tidyrvalue = "value") #> type value #> 1 outcome_variable_name Prediction #> 2 grouping_variable_name Exposure #> 3 effect 20 ‒ 1 #> 4 effect_size 0.571611929854665 #> 5 LL 0.327273973938463 #> 6 UL 0.81492376943417 #> 7 numerator 11.3842850063322 #> 8 denominator 19.8603120279963 #> 9 SE 0.124402744976289 #> 10 df 268 #> 11 d_biased 0.573217832141019 $es_smd_properties$message estimate#> This standardized mean difference is called d_s because the standardizer used was s_p. d_s has been corrected for bias. Correction for bias can be important when df < 50. See the rightmost column for the biased value.
- Fairly large effect: d = 0.57 95% CI [0.33, 0.81] and the confidence interval is fairly narrow
- Makes available the type of cohen’s d and the denominator used
Common Language Effect Size
AKA Probability of Superiority
Converts the effect size into a percentage which is supposed to more understandable for laymen
Misc
- Notes from The Common Language Effect Size Statistic
- Packages
Interpretation
- Between-Subjects: The probability that a randomly sampled person from the first group will have a higher observed measurement than a randomly sampled person from the second group
- Within-Subjects: The probability that an individual has a higher value on one measurement than the other.
Formula
Assumes variables are normally distributed and \(\sigma_1 = \sigma_2\)
Original paper gives some evidence that these formulas are pretty robust to violations though.
Recommended only for continuous variables
Between-Subjects
\[ \begin{align} \tilde d &= \frac{|M_1 - M_2|}{\sqrt{p_1\text{SD}_1^2 + p_2\text{SD}_2^2}} \\ Z &= \frac{\tilde d}{\sqrt{2}} \end{align} \]
- \(M_i\): The mean of the ith group variable
- \(p_i\): The proportion of the sample size of the ith group variable
Within Subjects
\[ Z = \frac{|M_1 - M_2|}{\sqrt{\operatorname{SD}_1^2 + \operatorname{SD}_2^2 - 2 \times r \times \operatorname{SD}_1 \times \operatorname{SD}_2}} \]
- \(r\) is the Pearson correlation between the group variables
Alternative Generalization
\(A_{1,2} = P(X_1 > X_2) + 0.5 \times P(X_1 = X_2)\)
Applies for any, not necessarily continuous, distribution that is at least ordinally scaled
Equal to CL in the continuous case
Interpreted as an estimate of the value of CL that would be obtained if the distribution of X were continuous.
Eta Squared
Notes from: Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and ANOVAs (Lakens)
Effect Size for ANOVA
Measures the proportion of the variation in Y that is associated with membership of the different groups defined by X, or the sum of squares of the effect divided by the total sum of squares
eta squared is an uncorrected effect size estimate that estimates the amount of variance explained based on the sample, and not based on the entire population.
partial eta squared (η2p) to improve the comparability of effect sizes between studies, which expresses the sum of squares of the effect in relation to the sum of squares of the effect and the sum of squares of the error associated with the effect.
Although η2p is more useful when the goal is to compare effect sizes across studies, it is not perfect, because η2p differs when the same two means are compared in a within-subjects design or a between-subjects design.
An \(\eta^2\) of 0.13 means that 13% of the total variance can be accounted for by group membership.
CIs should be at 90%, because if you use 95%, it’s possible that even with a significant F-test, the CI will contain 0. For 90%, this doesn’t happen.
Eta Squared
\[ \eta^2 = \frac{\text{SS}_{\text{effect}}}{\text{SS}_{\text{total}}} \]
\(\text{SS}_{\text{effect}}\) and \(\text{SS}_{\text{total}}\) are obtained from the ANOVA results
The correction for eta squared (\(\eta^2\)) is known as omega squared (\(\omega^2\)). Still biased but less biased. The difference is typically small, and the bias decreases as the sample size increases.
\[ \begin{align} \omega^2 &= \frac{\operatorname{df}_{\text{effect}}(\operatorname{MS_{\text{effect}}}-\operatorname{MS_{\text{error}}})}{\operatorname{SS_{\text{total}}} + \operatorname{MS_{\text{error}}}} \quad \text{(between-subjects)} \\ \omega^2 &= \frac{\operatorname{df}_{\text{effect}}(\operatorname{MS_{\text{effect}}}-\operatorname{MS_{\text{error}}})}{\operatorname{SS_{\text{total}}} + \operatorname{MS_{\text{subjects}}}} \quad \text{(within-subjects)} \\ \end{align} \]
Partial Eta Squared
\[ \begin{align} \eta_p^2 &= \frac{\operatorname{SS_{\text{effect}}}}{\operatorname{SS_{\text{effect}}} + \operatorname{SS_{\text{error}}}} \quad \text{(fixed and measured variables)}\\ \eta_p^2 &= \frac{F \times \operatorname{df}_{\text{effect}}}{F \times \operatorname{df}_{\text{effect}} + \operatorname{df}_{\text{error}}} \quad \text{(fixed variables)} \end{align} \]
fixed (e.g., manipulated), not random (e.g., measured)
Bias-Lessened
\[ \omega_p^2 = \frac{\operatorname{df}_{\text{effect}}(\operatorname{MS_{\text{effect}}}-\operatorname{MS_{\text{error}}})}{\operatorname{df}_{\text{effect}} \times \operatorname{MS_{\text{effect}}} + (N - \operatorname{df}_{\text{effect}}) \times \operatorname{MS_{\text{error}}}} \]
- Same equation whether it’s for between-subject designs and within-subject designs
Recommend researchers report η2G and/or η2p, at least until generalized omega-squared is automatically provided by statistical software packages
- For designs where all factors are manipulated between participants, η2p and η2G are identical, so either effect size can be reported. For within-subjects designs and mixed designs where all factors are manipulated, η2p can always be calculated from the F-value and the degrees of freedom using formula 13, but η2G cannot be calculated from the reported results,and therefore I recommend reporting η2G for these designs
- supplementary spreadsheet provides a relatively easy way to calculate η2G for commonly used designs. For designs with measured factors or covariates, neither η2p nor η2G can be calculated from the
Appropriate Version Per Use Cases
Use Case Version Less Biased Version Comparisons within a single study \(\eta^2\) \(\omega^2\) Power analyses, and for comparisons of effect sizes across studies with the same experimental design \(\eta_p^2\) \(\omega_p^2\) Meta-Analyses to compare across various experimental designs \(\eta_G^2\) \(\omega_G^2\) Guidelines
Cohen’s benchmarks were developed for comparisons between unrestricted populations (e.g., men vs. women), and using these benchmarks when interpreting the η2p effect size in designs that include covariates or repeated measures is not consistent with the onsiderations upon which the benchmarks were based.
- Although \(\eta_G^2\) can be compared using these guidelines, it is preferable to compare effect sizes with those in the literature.