Simulation, Data

Misc

Todo - for the “trtAssign” mess with ratio and the number of ratios
Also see
- Bkmks Data >> Data Simulation
- Pandas-Time Series-Simulation
  - A Gaussian and Standard GARCH time-series that’s frequently encountered in econometrics
Packages
- {simChef} (Vignette) - Rapidly plan, carry out, and summarize statistical simulation studies in a flexible, efficient, and low-code manner.
  - Intuitive tidy grammar of data science simulations
  - Powerful abstractions for distributed simulation processing backed by {future}
  - Automated generation of interactive R Markdown simulation documentation, situating results next to the workflows needed to reproduce them.
- {simDAG} - Generate data from a known directed acyclic graph (DAG) with associated information on distributions and causal coefficients
- {structmcmc} - A set of tools for performing structural inference for Bayesian Networks using MCMC
Papers
- Generation and analysis of synthetic data via Bayesian networks: a robust approach for uncertainty quantification via Bayesian paradigm

{simstudy}

Misc

Docs

Reference

Available distributions (link)

Probability Distributions
nonrandom: For constants; can be a numeric or a string with a formula that defines a dependency on another variable
clusterSize: For variable cluster sizes but a constant total sample size
- formula: The (fixed) total sample size
- variance: A (non-negative) dispersion measure that represents the variability of size across clusters
  - If the dispersion is set to 0, then cluster sizes are constant

trtAssign: For treatment assignment

formula: Ratio which is separated by semicolons and number of treatments
- e.g. 2 values = 2 groups and “1;2” says group 2 has twice as many units and group 1
variance: Stratification; ratio in formula is used as the stratification ratio (e.g. unbalanced treatment groups → unbalanced stratification)

Example

def <- 
  defData(def, 
          varname = "rx", 
          dist = "trtAssign",
          formula = "1;1;2", 
          variance = "male;over65")

count(studytbl, rx)
#> # A tibble: 3 × 2
#>     rx    n
#>   <int> <int>
#> 1    1    84
#> 2    2    82
#> 3    3  164

count(studytbl, male, rx)
#> # A tibble: 6 × 3
#>   male    rx    n
#>   <int> <int> <int>
#> 1    0    1    40
#> 2    0    2    39
#> 3    0    3    78
#> 4    1    1    44
#> 5    1    2    43
#> 6    1    3    86

count(studytbl, over65, rx)
#> # A tibble: 6 × 3
#>   over65    rx    n
#>   <int> <int> <int>
#> 1      0    1    66
#> 2      0    2    65
#> 3      0    3  130
#> 4      1    1    18
#> 5      1    2    17
#> 6      1    3    34

Functions
- defData(dtDefs = NULL, varname, formula, variance = 0, dist = "normal", link = "identity", id = "id") - Initially creates a data.table or adds a column to a data.table with instructions about creating a variable
  - formula: Numeric constant or string formula for the mean, probability of event (binary), probability of success (binomial), etc.
- defDataAdd(dtDefs = NULL, varname, formula, variance = 0, dist = "normal", link = "identity") - Creates a variable definition like defData but is used to augment a already generated dataset. Used as input to addColumns which will generate the variable data from the instructions in this object and add it as a column to the already generated dataset.
- genCluster(dtClust, cLevelVar, numIndsVar, level1ID, allLevel2 = TRUE) - After generating cluster-level data, this function takes the number of clusters and the sizes of each cluster from that data, and does something like expand.grid to generate an individual-level dataset. Also, adds an id variable.
  - dtClust: Cluster-Level Data
  - cLevelVar: Cluster variable from the cluster-level data
  - numIndsvar: Variable with the number of units per cluster from the cluster-level data
  - level1ID: Name you want for your individual-level ID variable

Variable Dependence

Binary depends on a Binary
- Definitions
```
def <- defData(varname = "male", dist = "binary",
               formula = .5 , id="cid")
def <- defData(def, varname = "over65", dist = "binary",
               formula = "-1.7 + .8*male", link="logit")
```
- What’s happening
```
male <- c(1,1,0,1,0,0,0,1,0,1)
logits <- -1.7 + 0.8 * male
probabilities <- boot::inv.logit(logits)
over65 <- rbinom(n = 10, size = 1, prob = probabilities)
```
  - The formula in the logits line defines the relationship between being male and being over 65yrs old.
  - Males in this sample will have a higher probability (0.2890505) of being over 65yrs old than females (0.1544653)
  - To sample from a Bernoulli distribution, set size = 1
  - over65 is an indicator where each value is determined by a separate probability parameter for a Bernoulli distribution

Clustered with Cluster-Level Random Effect

Example: Fixed Cluster sizes; Balanced

Cluster Definitions
```
d0 <- defData(varname = "n", formula = 20, dist = "nonrandom")
d0 <- defData(d0, varname = "a", formula = 0, variance = 0.33)
d0 <- defData(d0, varname = "rx", formula = "1;1", dist = "trtAssign")
d1 <- defDataAdd(varname = "y", formula = "18 + 1.6 * rx + a",
                 variance = 16, dist = "normal")
```
- n: sample size for the cluster
  - dist = “nonrandom” and formula = 20 says use a constant for the cluster sizer
- rx: treatment indicator
  - dist = “trtAssign” and formula = “1;1” says 2 treatment groups and they’re balanced
- y: the individual-level outcome is a function of the treatment assignment and the cluster effect, as well as random individual-level variation
- a: random individual-level variation (i.e. random effect)
  - Random Effects are sampled from \(\mathcal{N}(0, \sigma)\) where the variance is typically estimated in a Mixed Effects model.

Generate Cluster-Level Data

set.seed(2761)
dc <- genData(10, d0, "site")
dc
##    site  n      a rx
##  1:    1 20 -0.3548  1
##  2:    2 20 -1.1232  1
##  3:    3 20 -0.5963  0
##  4:    4 20 -0.0503  1
##  5:    5 20  0.0894  0
##  6:    6 20  0.5294  1
##  7:    7 20  1.2302  0
##  8:    8 20  0.9663  1
##  9:    9 20  0.0993  0
## 10:  10 20  0.6508  0

Generates 10 clusters labelled as site according to the instructions in d0

Generate Individual Level Data

dd <- genCluster(dc, "site", "n", "id")
dd <- addColumns(d1, dd)
dd
##      site  n      a rx  id    y
##  1:    1 20 -0.355  1  1 17.7
##  2:    1 20 -0.355  1  2 16.2
##  3:    1 20 -0.355  1  3 19.2
##  4:    1 20 -0.355  1  4 20.6
##  5:    1 20 -0.355  1  5 14.7
##  ---                           
## 196:  10 20  0.651  0 196 25.3
## 197:  10 20  0.651  0 197 22.1
## 198:  10 20  0.651  0 198 13.2
## 199:  10 20  0.651  0 199 15.6
## 200:  10 20  0.651  0 200 13.8

genCluster performs an expand.grid to generate an individual-level dataset along with adding an ID variable
addColumns uses individual-level data and outcome variable definition to generate the outcome variable and add it to the dataset.

Example: Varying Cluster Sizes and therefore Varying Sample Size
```
d0 <- defData(varname = "n", formula = 20, dist = "poisson")
genData(10, d0, "site")
##    site  n
##  1:    1 13
##  2:    2 18
##  3:    3 21
##  4:    4 26
##  5:    5 25
##  6:    6 27
##  7:    7 23
##  8:    8 30
##  9:    9 23
## 10:  10 20
```
- Formula sets the poisson distribution parameter, \(\lambda = 20\). So sizes are sampled from poisson distribution with that mean/variance
- To increase the variability between clusters, use the negative binomial distribution
- Most likely leads to an unbalanced design

Example: Varying Cluster Sizes but Constant Sample Size

# moderately varying cluster sizes
d0 <- defData(varname = "n", formula = 200, variance = 0.2, dist = "clusterSize")
genData(10, d0, "site")

##    site  n
##  1:    1 20
##  2:    2 28
##  3:    3 25
##  4:    4 24
##  5:    5 28
##  6:    6 22
##  7:    7  7
##  8:    8 13
##  9:    9 22
## 10:  10 11

# Very highly varying cluster sizes
d0 <- defData(varname = "n", formula = 200, variance = 5, dist = "clusterSize")
genData(10, d0, "site")
##    site  n
##  1:    1  10
##  2:    2  2
##  3:    3  17
##  4:    4  2
##  5:    5  49
##  6:    6 110
##  7:    7  1
##  8:    8  4
##  9:    9  1
## 10:  10  4

Total sample size is fixed at 200 (formula), but individual cluster sizes are allowed to vary.
variance: A dispersion parameter that controls the amount of varying of the cluster sizes

{DeclareDesign}

Misc

Docs, Website, EBook
Notes from
- Sample Size Calculations in R: Money vs. Power
  - Shows how to incorporate cost into the design. If cost is just a function of sample size, then I’m not sure how useful it is since it’s a simple calculation that can be done on your own.

Basics

Process
- Define the data generating process
- Define the model function
  - Input:
    - Data frame called data
  - Output:
    - 1-row data frame with mandatory and optional columns.
      - Mandatory: term, estimate
      - Optional: std.error, conf.low, conf.high, p.value, cost, etc.
- Declare the rue value of the quantity of interest
- Combine objects, simulate, and analyze the results
- Compare results of various inputs, models, etc.

Example: Power Calculationn for Simple linear model

DGP

library(marginaleffects); library(DeclareDesign)
set.seed(2024)

sample_size <- 100
treatment_effect <- 0.2

dgp <- declare_model(
  N = sample_size,

  # standard normal noise
  e = rnorm(N, mean = 0, sd = 1),

  # random dummy treatment with equal probability of treatment and control
  D = rbinom(N, size = 1, prob = 0.5),

  # intercept
  b0 = 0,

  # outcome equation
  Y = b0 + treatment_effect * D + e
)

d <- dgp()
head(d, 3)
#>    ID          e D b0          Y
#> 1 001  0.9819694 1  0 1.18196941
#> 2 002  0.4687150 0  0 0.46871504
#> 3 003 -0.1079713 1  0 0.09202867

Note that sample size and estimated effect size are defined outside the functions. This is because they are variables we intend vary in order to compare the results.
declare_model is a function factory (i.e. returns a function)
Each call of dgp will produce a different dataset.

Model

fit <- function(data) {
  model <- lm(Y ~ D, data = data)
  #> out <- data.frame(
  #>   "term" = "OLS",
  #>   "estimate" = coef(model)["D"]
  #> )
  out <- 
    marginaleffects::avg_comparisons(model, 
                                     variables = "D")
  return(out)
}

fit(d)
#>   term contrast   estimate std.error statistic   p.value  s.value   conf.low conf.high predicted_lo predicted_hi predicted
#> 1    D    1 - 0 0.07115999 0.1984708 0.3585414 0.7199382 0.474055 -0.3178356 0.4601556    0.0545284    0.1256884 0.0545284

In more complex models, with interactions or non-linear components, avg_comparisons will be a convenient way to estimate the average treatment effect (ATE), and it will return results in the “tidy” format required by DeclareDesign
The manual way is included to show how to define the output in case you aren’t using a model covered by {marginaleffects}

Truth and Results

truth <- declare_inquiry(ATE = treatment_effect)

design <- dgp + estimator + truth

diagnose_design(design)
#> Research design diagnosis based on 500 simulations. Diagnosis completed in 9 secs.
#> 
#>  Design Inquiry Term Mean Estimand Mean Estimate  Bias SD Estimate RMSE Power Coverage N Sims
#>  design     ATE    D          0.20          0.20 -0.00        0.20 0.20  0.17     0.96    500

These simulation results suggest that our estimation strategy yields unbiased estimates. Unfortunately, the statistical power of our research design is very low.

Compare

design_list <- 
  redesign(design,
           sample_size = c(100, 500),
           treatment_effect = c(0.2, 0.5)
  )

diagnose_design(design_list)
#> Research design diagnosis based on 500 simulations. Diagnosis completed in 9 secs.
#> 
#>    Design sample_size treatment_effect Inquiry Term Mean Estimand Mean Estimate  Bias SD Estimate RMSE Power Coverage N Sims
#>  design_1         100              0.2     ATE    D          0.20          0.20 -0.00        0.21 0.21  0.19     0.95    500
#>  design_2         500              0.2     ATE    D          0.20          0.19 -0.01        0.09 0.09  0.56     0.94    500
#>  design_3         100              0.5     ATE    D          0.50          0.51  0.01        0.20 0.20  0.71     0.94    500
#>  design_4         500              0.5     ATE    D          0.50          0.51  0.01        0.09 0.09  1.00     0.95    500

Still very little bias. Increasing the sample size by 5x increased the power by around 3x (0.56). Increasing the detectable effect size by 2.5x increased the power by around 4x (0.71). Increasing both increased the power by around 5x (1.00)

Examples

Example: Model Comparison

library(marginaleffects); library(DeclareDesign)

sample_size <- 100
treatment_effect <- 0.2

# DGP
dgp <- declare_model(
  N = sample_size,
  e = rnorm(N, mean = 0, sd = 1),
  D = rbinom(N, size = 1, prob = 0.5),

  # random normal covariates
  fabricatr::draw_multivariate(c(X1, X2) ~ MASS::mvrnorm(
    n = N,
    mu = c(0, 0),
    Sigma = matrix(c(1, 0.5, 0.5, 1), 2, 2)
  )),

  Y = 1 + treatment_effect * D + 1 * X1 + 1 * X2 + e
)

# Model
fit <- function(data) {
  # Linear model with regression adjustment for covariates
  adjusted <- lm(Y ~ D * (X1 + X2), data = data)
  adjusted <- 
    avg_comparisons(adjusted, 
                    variables = "D", 
                    vcov = "HC3")

  # Linear model with a single binary predictor
  unadjusted <- lm(Y ~ D, data = data)
  unadjusted <- 
    avg_comparisons(unadjusted, 
                    variables = "D", 
                    vcov = "HC3")

  # Combine the results
  out = rbind(adjusted, unadjusted)

  # Label the results
  out$term = c("Adjusted", "Unadjusted")
  return(out)
}

dgp() |> fit()
#>        Term          Contrast Estimate Std. Error      z Pr(>|z|)   S  2.5 % 97.5 %
#>  Adjusted   mean(1) - mean(0)   -0.045      0.207 -0.217    0.828 0.3 -0.451  0.361
#>  Unadjusted mean(1) - mean(0)   -0.307      0.374 -0.820    0.412 1.3 -1.040  0.426

# Truth and Compare
truth <- declare_inquiry(ATE = treatment_effect)
estimators <- declare_estimator(handler = fit)
design <- dgp + truth + estimators
diagnose_design(design)
#>  Design Inquiry       Term N Sims Mean Estimand Mean Estimate   Bias SD Estimate   RMSE  Power Coverage
#>  design     ATE   Adjusted    500          0.20          0.20   0.00        0.21   0.21   0.17     0.95
#>                                          (0.00)        (0.01) (0.01)      (0.01) (0.01) (0.02)   (0.01)
#>  design     ATE Unadjusted    500          0.20          0.20  -0.00        0.42   0.42   0.08     0.93
#>                                          (0.00)        (0.02) (0.02)      (0.01) (0.01) (0.01)   (0.01)

Bootstrapped standard errors are in parentheses in the bottom row
No bias and as expected, the regression with the adjustment variables has more power.

{synthpop}

For generating synthetic versions of sensitive microdata
Docs, Website

Example (source, analysis)

library(tidyverse)
library(synthpop)
library(qs2)

real_data_full <- qs_read(
  "~/Research projects/Current/Why Donors Donate/silent-skywalk/_targets/objects/data_sans_conjoint"
)

# Only keep some columns for {synthpop}
real_data_smaller <- real_data_full |> 
  select(
    gender = Q5.12,
    age = Q5.17,
    marital_status = Q5.13,
    education = Q5.14,
    donate_frequency = Q2.5,
    volunteer_frequency = Q2.10,
    voted = Q5.1
  ) |> 
  mutate(across(
    where(is.factor), 
    \(x) fct_relabel(x, \(y) str_replace(y, ".*: ", ""))
  ))

# Make synthetic data that has the same relationships and distributions as the real results
# (Seed from random.org)
synthetic_data <- syn(real_data_smaller, k = 1300, seed = 532742)

synthetic_data$syn

saveRDS(as_tibble(synthetic_data$syn), "synthetic_data.rds")
write_csv(synthetic_data$syn, "synthetic_data.csv")