Simulation, Data

Misc

{simstudy}

Misc

Reference

  • Available distributions (link)

    • Probability Distributions
    • nonrandom: For constants; can be a numeric or a string with a formula that defines a dependency on another variable
    • clusterSize: For variable cluster sizes but a constant total sample size
      • formula: The (fixed) total sample size
      • variance: A (non-negative) dispersion measure that represents the variability of size across clusters
        • If the dispersion is set to 0, then cluster sizes are constant
    • trtAssign: For treatment assignment
      • formula: Ratio which is separated by semicolons and number of treatments

        • e.g. 2 values = 2 groups and “1;2” says group 2 has twice as many units and group 1
      • variance: Stratification; ratio in formula is used as the stratification ratio (e.g. unbalanced treatment groups → unbalanced stratification)

      • Example

        def <- 
          defData(def, 
                  varname = "rx", 
                  dist = "trtAssign",
                  formula = "1;1;2", 
                  variance = "male;over65")
        
        count(studytbl, rx)
        #> # A tibble: 3 × 2
        #>     rx    n
        #>   <int> <int>
        #> 1    1    84
        #> 2    2    82
        #> 3    3  164
        
        count(studytbl, male, rx)
        #> # A tibble: 6 × 3
        #>   male    rx    n
        #>   <int> <int> <int>
        #> 1    0    1    40
        #> 2    0    2    39
        #> 3    0    3    78
        #> 4    1    1    44
        #> 5    1    2    43
        #> 6    1    3    86
        
        count(studytbl, over65, rx)
        #> # A tibble: 6 × 3
        #>   over65    rx    n
        #>   <int> <int> <int>
        #> 1      0    1    66
        #> 2      0    2    65
        #> 3      0    3  130
        #> 4      1    1    18
        #> 5      1    2    17
        #> 6      1    3    34
  • Functions

    • defData(dtDefs = NULL, varname, formula, variance = 0, dist = "normal", link = "identity", id = "id") - Initially creates a data.table or adds a column to a data.table with instructions about creating a variable
      • formula: Numeric constant or string formula for the mean, probability of event (binary), probability of success (binomial), etc.
    • defDataAdd(dtDefs = NULL, varname, formula, variance = 0, dist = "normal", link = "identity") - Creates a variable definition like defData but is used to augment a already generated dataset. Used as input to addColumns which will generate the variable data from the instructions in this object and add it as a column to the already generated dataset.
    • genCluster(dtClust, cLevelVar, numIndsVar, level1ID, allLevel2 = TRUE) - After generating cluster-level data, this function takes the number of clusters and the sizes of each cluster from that data, and does something like expand.grid to generate an individual-level dataset. Also, adds an id variable.
      • dtClust: Cluster-Level Data
      • cLevelVar: Cluster variable from the cluster-level data
      • numIndsvar: Variable with the number of units per cluster from the cluster-level data
      • level1ID: Name you want for your individual-level ID variable

Variable Dependence

  • Binary depends on a Binary
    • Definitions

      def <- defData(varname = "male", dist = "binary",
                     formula = .5 , id="cid")
      def <- defData(def, varname = "over65", dist = "binary",
                     formula = "-1.7 + .8*male", link="logit")
    • What’s happening

      male <- c(1,1,0,1,0,0,0,1,0,1)
      logits <- -1.7 + 0.8 * male
      probabilities <- boot::inv.logit(logits)
      over65 <- rbinom(n = 10, size = 1, prob = probabilities)
      • The formula in the logits line defines the relationship between being male and being over 65yrs old.
      • Males in this sample will have a higher probability (0.2890505) of being over 65yrs old than females (0.1544653)
      • To sample from a Bernoulli distribution, set size = 1
      • over65 is an indicator where each value is determined by a separate probability parameter for a Bernoulli distribution

Clustered with Cluster-Level Random Effect

  • Example: Fixed Cluster sizes; Balanced

    • Cluster Definitions

      d0 <- defData(varname = "n", formula = 20, dist = "nonrandom")
      d0 <- defData(d0, varname = "a", formula = 0, variance = 0.33)
      d0 <- defData(d0, varname = "rx", formula = "1;1", dist = "trtAssign")
      d1 <- defDataAdd(varname = "y", formula = "18 + 1.6 * rx + a",
                       variance = 16, dist = "normal")
      • n: sample size for the cluster
        • dist = “nonrandom” and formula = 20 says use a constant for the cluster sizer
      • rx: treatment indicator
        • dist = “trtAssign” and formula = “1;1” says 2 treatment groups and they’re balanced
      • y: the individual-level outcome is a function of the treatment assignment and the cluster effect, as well as random individual-level variation
      • a: random individual-level variation (i.e. random effect)
        • Random Effects are sampled from \(\mathcal{N}(0, \sigma)\) where the variance is typically estimated in a Mixed Effects model.
    • Generate Cluster-Level Data

      set.seed(2761)
      dc <- genData(10, d0, "site")
      dc
      ##    site  n      a rx
      ##  1:    1 20 -0.3548  1
      ##  2:    2 20 -1.1232  1
      ##  3:    3 20 -0.5963  0
      ##  4:    4 20 -0.0503  1
      ##  5:    5 20  0.0894  0
      ##  6:    6 20  0.5294  1
      ##  7:    7 20  1.2302  0
      ##  8:    8 20  0.9663  1
      ##  9:    9 20  0.0993  0
      ## 10:  10 20  0.6508  0
      • Generates 10 clusters labelled as site according to the instructions in d0
    • Generate Individual Level Data

      dd <- genCluster(dc, "site", "n", "id")
      dd <- addColumns(d1, dd)
      dd
      ##      site  n      a rx  id    y
      ##  1:    1 20 -0.355  1  1 17.7
      ##  2:    1 20 -0.355  1  2 16.2
      ##  3:    1 20 -0.355  1  3 19.2
      ##  4:    1 20 -0.355  1  4 20.6
      ##  5:    1 20 -0.355  1  5 14.7
      ##  ---                           
      ## 196:  10 20  0.651  0 196 25.3
      ## 197:  10 20  0.651  0 197 22.1
      ## 198:  10 20  0.651  0 198 13.2
      ## 199:  10 20  0.651  0 199 15.6
      ## 200:  10 20  0.651  0 200 13.8
      • genCluster performs an expand.grid to generate an individual-level dataset along with adding an ID variable
      • addColumns uses individual-level data and outcome variable definition to generate the outcome variable and add it to the dataset.
  • Example: Varying Cluster Sizes and therefore Varying Sample Size

    d0 <- defData(varname = "n", formula = 20, dist = "poisson")
    genData(10, d0, "site")
    ##    site  n
    ##  1:    1 13
    ##  2:    2 18
    ##  3:    3 21
    ##  4:    4 26
    ##  5:    5 25
    ##  6:    6 27
    ##  7:    7 23
    ##  8:    8 30
    ##  9:    9 23
    ## 10:  10 20
    • Formula sets the poisson distribution parameter, \(\lambda = 20\). So sizes are sampled from poisson distribution with that mean/variance
    • To increase the variability between clusters, use the negative binomial distribution
    • Most likely leads to an unbalanced design
  • Example: Varying Cluster Sizes but Constant Sample Size

    # moderately varying cluster sizes
    d0 <- defData(varname = "n", formula = 200, variance = 0.2, dist = "clusterSize")
    genData(10, d0, "site")
    
    ##    site  n
    ##  1:    1 20
    ##  2:    2 28
    ##  3:    3 25
    ##  4:    4 24
    ##  5:    5 28
    ##  6:    6 22
    ##  7:    7  7
    ##  8:    8 13
    ##  9:    9 22
    ## 10:  10 11
    
    # Very highly varying cluster sizes
    d0 <- defData(varname = "n", formula = 200, variance = 5, dist = "clusterSize")
    genData(10, d0, "site")
    ##    site  n
    ##  1:    1  10
    ##  2:    2  2
    ##  3:    3  17
    ##  4:    4  2
    ##  5:    5  49
    ##  6:    6 110
    ##  7:    7  1
    ##  8:    8  4
    ##  9:    9  1
    ## 10:  10  4
    • Total sample size is fixed at 200 (formula), but individual cluster sizes are allowed to vary.
    • variance: A dispersion parameter that controls the amount of varying of the cluster sizes

{DeclareDesign}

Misc

Basics

  • Process
    • Define the data generating process
    • Define the model function
      • Input:
        • Data frame called data
      • Output:
        • 1-row data frame with mandatory and optional columns.
          • Mandatory: term, estimate
          • Optional: std.error, conf.low, conf.high, p.value, cost, etc.
    • Declare the rue value of the quantity of interest
    • Combine objects, simulate, and analyze the results
    • Compare results of various inputs, models, etc.
  • Example: Power Calculationn for Simple linear model
    • DGP

      library(marginaleffects); library(DeclareDesign)
      set.seed(2024)
      
      sample_size <- 100
      treatment_effect <- 0.2
      
      dgp <- declare_model(
        N = sample_size,
      
        # standard normal noise
        e = rnorm(N, mean = 0, sd = 1),
      
        # random dummy treatment with equal probability of treatment and control
        D = rbinom(N, size = 1, prob = 0.5),
      
        # intercept
        b0 = 0,
      
        # outcome equation
        Y = b0 + treatment_effect * D + e
      )
      
      d <- dgp()
      head(d, 3)
      #>    ID          e D b0          Y
      #> 1 001  0.9819694 1  0 1.18196941
      #> 2 002  0.4687150 0  0 0.46871504
      #> 3 003 -0.1079713 1  0 0.09202867
      • Note that sample size and estimated effect size are defined outside the functions. This is because they are variables we intend vary in order to compare the results.
      • declare_model is a function factory (i.e. returns a function)
      • Each call of dgp will produce a different dataset.
    • Model

      fit <- function(data) {
        model <- lm(Y ~ D, data = data)
        #> out <- data.frame(
        #>   "term" = "OLS",
        #>   "estimate" = coef(model)["D"]
        #> )
        out <- 
          marginaleffects::avg_comparisons(model, 
                                           variables = "D")
        return(out)
      }
      
      fit(d)
      #>   term contrast   estimate std.error statistic   p.value  s.value   conf.low conf.high predicted_lo predicted_hi predicted
      #> 1    D    1 - 0 0.07115999 0.1984708 0.3585414 0.7199382 0.474055 -0.3178356 0.4601556    0.0545284    0.1256884 0.0545284
      • In more complex models, with interactions or non-linear components, avg_comparisons will be a convenient way to estimate the average treatment effect (ATE), and it will return results in the “tidy” format required by DeclareDesign

      • The manual way is included to show how to define the output in case you aren’t using a model covered by {marginaleffects}

    • Truth and Results

      truth <- declare_inquiry(ATE = treatment_effect)
      
      design <- dgp + estimator + truth
      
      diagnose_design(design)
      #> Research design diagnosis based on 500 simulations. Diagnosis completed in 9 secs.
      #> 
      #>  Design Inquiry Term Mean Estimand Mean Estimate  Bias SD Estimate RMSE Power Coverage N Sims
      #>  design     ATE    D          0.20          0.20 -0.00        0.20 0.20  0.17     0.96    500
      • These simulation results suggest that our estimation strategy yields unbiased estimates. Unfortunately, the statistical power of our research design is very low.
    • Compare

      design_list <- 
        redesign(design,
                 sample_size = c(100, 500),
                 treatment_effect = c(0.2, 0.5)
        )
      
      diagnose_design(design_list)
      #> Research design diagnosis based on 500 simulations. Diagnosis completed in 9 secs.
      #> 
      #>    Design sample_size treatment_effect Inquiry Term Mean Estimand Mean Estimate  Bias SD Estimate RMSE Power Coverage N Sims
      #>  design_1         100              0.2     ATE    D          0.20          0.20 -0.00        0.21 0.21  0.19     0.95    500
      #>  design_2         500              0.2     ATE    D          0.20          0.19 -0.01        0.09 0.09  0.56     0.94    500
      #>  design_3         100              0.5     ATE    D          0.50          0.51  0.01        0.20 0.20  0.71     0.94    500
      #>  design_4         500              0.5     ATE    D          0.50          0.51  0.01        0.09 0.09  1.00     0.95    500
      • Still very little bias. Increasing the sample size by 5x increased the power by around 3x (0.56). Increasing the detectable effect size by 2.5x increased the power by around 4x (0.71). Increasing both increased the power by around 5x (1.00)

Examples

  • Example: Model Comparison

    library(marginaleffects); library(DeclareDesign)
    
    sample_size <- 100
    treatment_effect <- 0.2
    
    # DGP
    dgp <- declare_model(
      N = sample_size,
      e = rnorm(N, mean = 0, sd = 1),
      D = rbinom(N, size = 1, prob = 0.5),
    
      # random normal covariates
      fabricatr::draw_multivariate(c(X1, X2) ~ MASS::mvrnorm(
        n = N,
        mu = c(0, 0),
        Sigma = matrix(c(1, 0.5, 0.5, 1), 2, 2)
      )),
    
      Y = 1 + treatment_effect * D + 1 * X1 + 1 * X2 + e
    )
    
    # Model
    fit <- function(data) {
      # Linear model with regression adjustment for covariates
      adjusted <- lm(Y ~ D * (X1 + X2), data = data)
      adjusted <- 
        avg_comparisons(adjusted, 
                        variables = "D", 
                        vcov = "HC3")
    
      # Linear model with a single binary predictor
      unadjusted <- lm(Y ~ D, data = data)
      unadjusted <- 
        avg_comparisons(unadjusted, 
                        variables = "D", 
                        vcov = "HC3")
    
      # Combine the results
      out = rbind(adjusted, unadjusted)
    
      # Label the results
      out$term = c("Adjusted", "Unadjusted")
      return(out)
    }
    
    dgp() |> fit()
    #>        Term          Contrast Estimate Std. Error      z Pr(>|z|)   S  2.5 % 97.5 %
    #>  Adjusted   mean(1) - mean(0)   -0.045      0.207 -0.217    0.828 0.3 -0.451  0.361
    #>  Unadjusted mean(1) - mean(0)   -0.307      0.374 -0.820    0.412 1.3 -1.040  0.426
    
    # Truth and Compare
    truth <- declare_inquiry(ATE = treatment_effect)
    estimators <- declare_estimator(handler = fit)
    design <- dgp + truth + estimators
    diagnose_design(design)
    #>  Design Inquiry       Term N Sims Mean Estimand Mean Estimate   Bias SD Estimate   RMSE  Power Coverage
    #>  design     ATE   Adjusted    500          0.20          0.20   0.00        0.21   0.21   0.17     0.95
    #>                                          (0.00)        (0.01) (0.01)      (0.01) (0.01) (0.02)   (0.01)
    #>  design     ATE Unadjusted    500          0.20          0.20  -0.00        0.42   0.42   0.08     0.93
    #>                                          (0.00)        (0.02) (0.02)      (0.01) (0.01) (0.01)   (0.01)
    • Bootstrapped standard errors are in parentheses in the bottom row

    • No bias and as expected, the regression with the adjustment variables has more power.