RCT

Follow-up thread to Harrell’s RCT post

Misc

  • Reasons for not running a RCT
    1. It’s just not technically feasible to have individual-level randomization of users as we would in a classical A/B test
      • e.g. randomizing which individuals see a billboard ad is not possible
    2. We can randomize but expect interference between users assigned to different experiences, either through word-of-mouth, mass media, or even our own ranking systems; in short, the stable unit treatment value assumption (SUTVA) would be violated, biasing the results
  • A gold-standard design is a 6-period 2-treatment randomized crossover study; the patient actually receives both treatments and her responses can be compared (Harrell)
  • ATE for RCT:
    • Non-theoretical ATE (i.e. calculated from actual data) is sample-averaged; population sampling weights are unavailable for RCT subject groups. So this ATE applies to a replication of the study with similar sampling patterns. ATE does not apply to the population and in fact may apply to no one due to lack of conditioning on patient characteristics. The ATE used in 99% of papers has nothing to do with population but uses only convenience sample weighting.  Some papers even blatantly call it population-averaged.”
    • “They test causal hypotheses about a group of patients with symptoms & other ‘diagnostic’ findings that form entry criteria for the RCT & may only be available in sufficient numbers in specialist centres.”
    • Gelman
      • “the drug works on some people and not others—or in some comorbidity scenarios and not others—we realize that”the treatment effect” in any given study will depend entirely on the patient mix. There is no underlying number representing the effect of the drug. Ideally one would like to know what sorts of patients the treatment would help, but in a clinical trial it is enough to show that there is some clear average effect. My point is that if we consider the treatment effect in the context of variation between patients, this can be the first step in a more grounded understanding of effect size.
    • Gelman regarding a 0.1 ATE for a treatment in an education study
      • “Actually, though, an effect of 0.1 GPA is a lot. One way to think about this is that it’s equivalent to a treatment that raises GPA by 1 point for 10% of people and has no effect on the other 90%. That’s a bit of an oversimplification, but the point is that this sort of intervention might well have little or no effect on most people. In education and other fields, we try lots of things to try to help students, with the understanding that any particular thing we try will not make a difference most of the time. If mindset intervention can make a difference for 10% of students, that’s a big deal. It would be naive to think that it would make a difference for everybody: after all, many students have a growth mindset already and won’t need to be told about it.
      • “Maybe in some fields of medicine this is cleaner because you can really isolate the group of patients who will be helped by a particular treatment. But in social science this seems much harder.”
        • Me: So, a 0.1 effect wouldn’t be large if there was no variation (i.e. same size effect for everyone), but that’s very unlikely to be the case.
  • Calculation of standard-errors is different depending on the RCT type in order that variation within arms could be validly used to estimate variation between.
  • Random Sampling vs Random Treatment Allocation (source)
    • Random Sampling: licenses the use of measures of uncertainty  for (sub)groups of sampled patients.
    • Random Treatment Allocation: licenses the use of measures of uncertainty for the differences between the allocated groups.
      • Re RCTs:
        • licenses the use of measures of uncertainty for hazard ratios, odds ratios, risk ratios, median/mean survival difference, absolute risk reduction etc that measure differences between groups.
        • Because there is no random sampling, measures of uncertainty are not licensed by the randomization procedure for cohort-specific estimates such as the median survival observed in each treatment cohort.
          • For those, we can use descriptive measures such as standard deviation (SD), interquartile range etc. Measures of uncertainty will require further assumptions to be considered valid. Further discussion here
  • Balance - Balanced allocations are more efficient in that they lead to lower variances
    • Variance of the Mean difference (e.g between treatment and control groups) for unbalanced design

      \[ \frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2} \]

  • Follow-Up
    • In-Trial Follow-Up (ITFU)
      • Without ITFU, the unbiased ascertainment of outcomes may be compromised and statistical power considerably reduced
      • Strategies
        • Face-to-face follow-up is widely used during the initial “in-trial” period, but is costly if employed longer term.
        • Telephone-based approaches are more practical, with the ability to contact many participants coordinated by a central trial office
        • Postal follow-up has been shown to be effective.
        • Web-based techniques may become more widespread as technological advances develop.
    • Post-Trial Follow-Up (PTFU)
      • RCTs are costly and usually involve a relatively brief treatment period with limited follow-up. A treatment response restricted to this brief “in-trial” period can potentially underestimate the long-term benefits of treatment and also may fail to detect delayed hazards.
      • Strategies
        • See ITFU strategies
        • Use of routine health records can provide detailed information relatively inexpensively, but the availability of such data and rules governing access to it varies across countries.

RCT Features

  • Three valuable design features of clinical trials are concurrent control, randomization and blinding.
    • Blinding is weak at best without Randomization
    • Randomization is impossible without Concurrent Control,
    • Concurrent Control is necessary for the other two, so it can be regarded as the most important of the three.
  • Blinding or Masking - patients are unaware of the treatment they are receiving and treating physicians are unaware of the treatment they are administering
    • Prevents differential care during follow-up, accounts for nonspecific effects associated with receiving an intervention (placebo effects), may facilitate blinding of outcome assessors, and may improve adherence.
  • Concurrent Control - the effect of a treatment should be assessed by comparing the results of subjects treated with the intervention being studied with the results of subjects treated concurrently (i.e. at the ‘same’ places at the ‘same’ times) with a control treatment, for example, placebo.
    • In reality
      • Re ‘same time’: the idea behind concurrent control is that the times at which they are recruited will vary randomly within treatment arms in the same way as between, so that variation in outcomes arising as a result of the former can be used to judge variation in outcomes as a result of the latter.
        • Timing matters: The time at which a patient is recruited into the trial matters, but should not be biasing if patients are randomized throughout the trial to intervention and control. It will tend to increase the variance of the treatment effect, and rightly so, but it is a component that may be possible to eliminate (partially) by modelling a trend effect.
          • Example: 1990s AIDS studies found survival of patients who were recruited later into trials tended to be better than those recruited earlier
      • Re ‘same place’: The vast majority of randomised clinical trials are run in many centers. The variations in design around this many-centers aspect is the primary difference between various types of RCTs (see below). All the types will be regarded as employing concurrent control, but have their standard errors calculated differently.
    • Consequence of violations
      • If variation from center to center is ignored and patients have not been randomized concurrently, then Fisher’s exact test, Pearson’s chi-square and Student’s t will underestimate the variation
  • Randomized assignment means that eligible units are randomly assigned to a treatment or comparison group. Each eligible unit has an equal chance of being selected. This tends to generate internally valid impact estimates under the weakest assumptions.
    • Randomization also allows us to achieve statistical independence, which eliminates omitted variable bias. Statistical independence implies that the treatment variable is not correlated with the other variables. The key assumption is that randomization effectively produces two groups that are statistically identical with respect to observed and unobserved characteristics. In other words, the treatment group is the same as the control group on average.
      • i.e. randomization process renders the experimental groups largely comparable. Thus, we can attribute any differences in the final metrics between the experimental groups to the intervention.
    • In the absence of randomization, we might fall victim to this omitted variable bias because our treatment variable will probably be endogenous. That is, it will be probably correlated with other variables excluded from the model (omitted variable bias).

Assignment Options

Randomization

  • Individual Randomization - The individual or patient that is allocated to an intervention (may be more than one intervention group) or control group, and simple statistical analyses on participant outcomes are used to evaluate if the intervention was effective.
    • These analyses assume that all participants are completely independent (ie. unlike each other, do not influence each other, and any outcomes measured on them are influenced by the intervention or usual care in the same way)
    • Stratified Randomization - Study participants are first divided into strata or subgroups based on one or more relevant characteristics or variables that are thought to influence response to treatment (i.e. confounders). Then, within each stratum, participants are randomly assigned to the treatment or control groups.
      • The purpose is to increase statistical precision and ensure balance across known confounding factors.
      • Strafiication variables could be age, gender, severity of disease, or any other potentially confounding factors.
      • The purpose of stratification is to ensure that the treatment and control groups are balanced concerning these potentially confounding variables, which can improve the precision and statistical power of the study.
      • To obtain valid inference following stratified randomisation, treatment effects should be estimated with adjustment for stratification variables.
        • If stratification variables are originally continuous but have been discretized for purposes of stratification (e.g. Age), then the continuous form of these variables should be used in the model. (Paper)
    • Block Randomization
      • The purpose is to control for known sources of variability and remove their effects from the experimental error
      • Twitter thread presenting an example of the procedure for a block randomized RCT
      • Similar to Stratified Randomization but blocks are formed based on practical considerations, such as time, location, or specific characteristics of participants or experimental units. (Models seem to be the same though)
        • Example: These hospitals volunteered to participate in my study and I wish to randomize individuals within groups. I have no reason to believe that a particular hospital envirnoment will affect the treatment’s efficacy on the outcome variable (i.e hospital is not a confounder), but I wish to account for it in order to reduce error.
      • Designs with unequal sized blocks are known as Incomplete Block Designs
        • The assumption of homogeneous variances across blocks is likely to be violated when block sizes differ greatly. This violates one of the standard ANOVA assumptions.
        • Typically modelled by Weighted Least Squares or Mixed Effects. Random effects for blocks to account for the variability between blocks, and fixed effects for treatment effects.
      • Block Designs
        • Latin Square
          • It’s commonly used in situations where there are two sources of variability that need to be controlled, such as when testing multiple treatments in different orders or sequences while accounting for variations in time or space.

          • Appropriate when complete randomization of treatments is feasible and desirable. This design requires no natural structure or hierarchy among experimental units (see Split-Plot for comparison), and treatments can be applied independently to each unit.

          • Each treatment appears once in each row and each column of the Latin square, ensuring that each treatment occurs in every possible position relative to the other treatments.

          • Useful for reducing noise and increasing efficiency.

          • Variations of the Latin square design, such as the Graeco-Latin square design, can be used to extend the concept to accommodate groups or blocks of experimental units. In these designs, the basic principles of the Latin square are applied within each block or group, allowing for more flexibility in experimental design while still controlling for sources of variation

          • Example: Does Music Genre Affect Work Productivity

            Location 1 Location 2 Location 3 Location 4
            Time 1 Genre A Genre B Genre C Genre D
            Time 2 Genre D Genre A Genre B Genre C
            Time 3 Genre C Genre D Genre A Genre B
            Time 4 Genre B Genre C Genre D Genre A
            • Determine if there are significant differences in task performance across the different music genres while controlling for potential confounding factors (time of day and office location).
            • Each cell is a different study participant which receives a particular treatment at a particular location and particular time.
        • Split-Plot
          • Particularly useful when there is a hierarchical structure to the experimental units
          • Handles hard-to-randomize factors such as logistical factors
          • The whole plots are for hard-to-randomize factors and subplots are for easy-to-randomize factors. It controls variability due to the whole plot factor.
          • Example: Effects of Learning Environment and Study Technique on Memory Recall
            • Main Plots (Groups of Participatnts)

              Plot 1 Plot 2
              Plot 3 Plot 4
              • Each plot consists of 20 participants
              • Plot 1: Participants from Psychology Department
              • Plot 2: Participants from Biology Department
              • Plot 3: Participants from English Department
              • Plot 4: Participants from Mathematics Department
              • Note how it would be impossible to randomly assign students to each department.
            • Subplots within each main plot

              Repeated Reading Mind Mapping
              Quiet Room subgroup 1 subgroup 2
              Noisy Café subgroup 3 subgroup 4
              • Each subgroup has 5 participants and are randomly assigned to a combination of two treatments: Environment and Study Technique.
        • Family Block - Naturally occurring groups or families (e.g. litters, plots with similar soil conditions, etc.).
    • Response-Adaptive Randomization (RAR)
      • Notes from Thread
      • Sounds like Multi-Armed Bandit Algorithm (Also see Experiments, A/B Testing >> Multi-Armed Bandit Algorithms)
      • Papers
      • The allocation ratio (usually set at 1:1 between intervention and control) is varied in response to the accruing observed outcomes – ratio is changed in favor of the emerging better treatment (e.g. from 1:1 to 2:1 etc)
      • The intuitive appeal behind RAR is that patients should theoretically have a higher probability of being allocated the better treatment over time and this could be viewed to be more patient-centred
      • Outcome data can be volatile in the early phases of a trial when the sample size is small, care must be taken especially early on; some suggest a “burn-in” phase to allow the data to stabilize first
      • It is crucial to actively monitor RAR throughout the trial to ensure the system is working appropriately and the allocation ratios remain controlled.
        • Allocation concealment must also be secure so that the trial results are not inadvertently unblinded
      • Concerns: loss of statistical efficiency, worries about temporal trends affecting the analysis, etc. See Proschan & Evans for further details.
  • Cluster Randomization - One in which intact social units, or clusters of individuals rather than individuals themselves, are randomized to different intervention groups
    • All participants recruited from the practice, school or workplace are allocated to either the intervention or the control group
    • The outcomes of the intervention are still measured at the individual level, but the level at which the comparison is made is the practice, school or workplace
    • Matched Pairs - A sample of clusters is paired according to baseline, cluster-level covariates and, within each pair, one cluster is selected at random for treatment.
    • Advantages
      • Members from intervention and control groups are less likely to have direct contact with each other and are less likely to pass on components of the intervention to the control group. (i.e. contamination)
      • There may also be increased compliance due to group participation.
      • Clusters typically consistent in their management.

Treatment Strategy

  • Parallel Group - Subjects are randomized to one or more study arms (aka treatment groups) and each study arm will be allocated a different intervention. After randomization each participant will stay in their assigned treatment arm for the duration of the study
    • Think this is just typical randomization into treatment/control groups but can be extended to include multiple treatment arms.
    • “Change from Baseline” (aka Change Scores) should never be the outcome variable
    • Central Question: For two patients with the same premeasurement value of x, one given treatment A and the other treatment B, will the patients tend to have different post-treatment values of y?
  • Crossover Group - Subjects are randomly allocated to study arms where each arm consists of a sequence of two or more treatments given consecutively.
    • i.e. each subject receives more than one treatment and each treatment occurs sequentially over the duration of the study.
    • Example: AB/BA study - Subjects allocated to the AB study arm receive treatment A first, followed by treatment B, and vice versa in the BA arm.
    • Allows the response of a subject to treatment A to be contrasted with the same subject’s response to treatment B.
      • Removing patient variation in this way makes crossover trials potentially more efficient than similar sized, parallel group trials in which each subject is exposed to only one treatment.
    • In theory treatment effects can be estimated with greater precision given the same number of subjects.
    • Misc
      • Best practice is to avoid this design if there is a reasonable chance of Carry Over
      • Also see
        • Senn SJ. Crossover trials in clinical research. Chichester: John Wiley; 1993, 2002.
          • “Readable approach to the problems of designing and analysing crossover trials”
          • See R >> Documents >> Experimental Design
          • Kurz notes on Chapters 3 and 4 with updated R code
    • Issue: Carry Over
      • Effects of one treatment may “carry over” and alter the response to subsequent treatments.
      • (Pre-experiment) Solution: introduce a washout (no treatment) period between consecutive treatments which is long enough to allow the effects of a treatment to wear off.
        • A variation is to restrict outcome measurement to the latter part of each treatment period. Investigators then need to understand the likely duration of action of a given treatment and its potential for interaction with other treatments.
      • Testing for Carry Over
        • If carry over is present the outcome on a given treatment will vary according to its position in the sequence of treatments.
        • Example: Concluding that there was no carry over when an analysis of variance found no statistically significant interaction between treatment sequence and outcome.1
        • However such tests have limited power and cannot rule out a type II error (wrongly concluding there is no carry over effect).
      • (Post-experiment) Solution: If Carry Over is detected:
        • Option 1: Treat the study as though it were a parallel group trial and confine analysis to the first period alone.
          • The advantages of the crossover are lost, with the wasted expense of discarding the data from the second period.
          • More importantly, the significance test comparing the first periods may be invalid
        • Option 2 (applicable only to studies with at least three treatment periods, e.g. ABB/BAA)
          • Model the carry over effect and use it to adjust the treatment estimate.
          • Such approaches, while statistically elegant, are based on assumptions which can rarely be justified in practice.
          • See Senn paper above
  • Randomized designs are classified as completely randomized design, complete block design, randomized block design, Latin square design, split pot design, crossover design, family block design, stepped-wedge cluster design, etc.
    • Completely Randomized Parallel Group trial - Any given center will have some patients randomly allocated to intervention and some to control. Randomization includes centers (i.e. a patient is randomly selected either treatment/control and which center they will receive the treatment)
      • Parallel Group Blocked by Center - Randomization happens within each center (i.e. each center handles their own randomization). Treatment/Control ratio is the same for each center.
        • “Center” should be included as a variable in the model.
    • Cluster-Randomized trial - Randomly allocate some centers to dispense the intervention and some the control
      • Fundamental unit of inference becomes the center and patients are regarded as repeated measures on it
  • Examples
    • The effects of a leading mindfulness meditation app (Headspace) on mental health, productivity, and decision-making (Paper)
      • RCT with 2,384 US adults recruited via social media ads.
      • Four-week experiment
        • first group is given free access to the app (worth $13)
        • second group receives, in addition, a $10 incentive to use the app at least four or ten separate days during the first two weeks
        • third group serves as a (waitlist) control group

Sources of Bias

  • Misc
  • Selection Bias - Occurs when there are systematic differences between baseline characteristics of groups.
    • If the assignment that was not properly randomized or the randomized assignment was not sufficiently concealed (i.e. allocation concealment), and so the person enrolling participants was aware of allocation sequence and influenced which patients were assigned to each group based on their prognostic factors
    • Example: if groups are not comparable on key demographic factors, then between-group differences in treatment outcomes cannot necessarily be attributed solely to the study intervention.
    • Example: The assignment of patients to a group is influenced by knowledge of which treatment they will receive
    • Solutions:
      • Randomized Assignment - RCTs attempt to address selection bias by randomly assigning participants to groups – but it is still important to assess whether randomization was done well enough to eliminate the influence of confounding variables.
      • Blinding - participants and investigators should remain unaware of which group participants are assigned to.
  • Performance Bias - Refers to systematic differences between groups that occur during the study. Leads to overestimated treatment effects, because of the physical component of interventions
    • Example: if participants know that they are in the active treatment rather than the control condition, this could create positive expectations that have an impact on treatment outcome beyond that of the intervention itself.
    • Solution: Blinding - participants and investigators should remain unaware of which group participants are assigned to.
      • More easily achieved in medication trials than in surgical trials
  • Detection Bias - Refers to systematic differences in the way outcomes are determined.
    • Example: if providers in a psychotherapy trial are aware of the investigators’ hypotheses, this knowledge could unconsciously influence the way they rate participants’ progress.
    • Solution: Attention to conflicts of interest and Blinding (also see Performance Bias) - RCTs address this by utilizing independent outcome assessors who are blind to participants’ assigned treatment groups and investigators’ expectations.
  • Attrition Bias - occurs when there are systematic differences between groups in withdrawals from a study.
    • It’s common for participants to drop out of a trial before or in the middle of treatment, and researchers who only include those who completed the protocol in their final analyses are not presenting the full picture.
    • Solution: Intention to Treat analysis - Analyses should include all participants who were randomized into the study, and not only participants who completed some or all of the intervention.
  • Reporting Bias - Refers to systematic differences between reported and unreported data.
    • Example: publication bias - occurs because studies with positive results are more likely to be published, and tend to be published more quickly, than studies with findings supporting the null hypothesis.
    • Example: outcome reporting bias - occurs when researchers only write about study outcomes that were in line with their hypotheses.
    • Solution: Requirements that RCT protocols be published in journals or on trial registry websites, which allows for confirmation that all primary outcomes are reported in study publications.
  • Other Bias - A catch-all category that includes specific situations not covered by the above domains.
    • Includes bias that can occur when study interventions are not delivered with fidelity, or when there is “contamination” between experimental and control interventions within a study (for example, participants in different treatment conditions discussing the interventions they are receiving with each other).