Simulation, Data
Misc
- Todo - for the “trtAssign” mess with ratio and the number of ratios
- Also see
- Bkmks Data >> Data Simulation
- Pandas-Time Series-Simulation
- A Gaussian and Standard GARCH time-series that’s frequently encountered in econometrics
- Packages
- {simChef} (Vignette) - Rapidly plan, carry out, and summarize statistical simulation studies in a flexible, efficient, and low-code manner.
- Intuitive tidy grammar of data science simulations
- Powerful abstractions for distributed simulation processing backed by {future}
- Automated generation of interactive R Markdown simulation documentation, situating results next to the workflows needed to reproduce them.
- {structmcmc} - A set of tools for performing structural inference for Bayesian Networks using MCMC
- {simChef} (Vignette) - Rapidly plan, carry out, and summarize statistical simulation studies in a flexible, efficient, and low-code manner.
- Papers
{simstudy}
Misc
Reference
Available distributions (link)
- Probability Distributions
- nonrandom: For constants; can be a numeric or a string with a formula that defines a dependency on another variable
- clusterSize: For variable cluster sizes but a constant total sample size
- formula: The (fixed) total sample size
- variance: A (non-negative) dispersion measure that represents the variability of size across clusters
- If the dispersion is set to 0, then cluster sizes are constant
- trtAssign: For treatment assignment
formula: Ratio which is separated by semicolons and number of treatments
- e.g. 2 values = 2 groups and “1;2” says group 2 has twice as many units and group 1
variance: Stratification; ratio in formula is used as the stratification ratio (e.g. unbalanced treatment groups → unbalanced stratification)
Example
<- def defData(def, varname = "rx", dist = "trtAssign", formula = "1;1;2", variance = "male;over65") count(studytbl, rx) #> # A tibble: 3 × 2 #> rx n #> <int> <int> #> 1 1 84 #> 2 2 82 #> 3 3 164 count(studytbl, male, rx) #> # A tibble: 6 × 3 #> male rx n #> <int> <int> <int> #> 1 0 1 40 #> 2 0 2 39 #> 3 0 3 78 #> 4 1 1 44 #> 5 1 2 43 #> 6 1 3 86 count(studytbl, over65, rx) #> # A tibble: 6 × 3 #> over65 rx n #> <int> <int> <int> #> 1 0 1 66 #> 2 0 2 65 #> 3 0 3 130 #> 4 1 1 18 #> 5 1 2 17 #> 6 1 3 34
Functions
defData(dtDefs = NULL, varname, formula, variance = 0, dist = "normal", link = "identity", id = "id")
- Initially creates a data.table or adds a column to a data.table with instructions about creating a variable- formula: Numeric constant or string formula for the mean, probability of event (binary), probability of success (binomial), etc.
defDataAdd(dtDefs = NULL, varname, formula, variance = 0, dist = "normal", link = "identity")
- Creates a variable definition likedefData
but is used to augment a already generated dataset. Used as input toaddColumns
which will generate the variable data from the instructions in this object and add it as a column to the already generated dataset.genCluster(dtClust, cLevelVar, numIndsVar, level1ID, allLevel2 = TRUE)
- After generating cluster-level data, this function takes the number of clusters and the sizes of each cluster from that data, and does something likeexpand.grid
to generate an individual-level dataset. Also, adds an id variable.- dtClust: Cluster-Level Data
- cLevelVar: Cluster variable from the cluster-level data
- numIndsvar: Variable with the number of units per cluster from the cluster-level data
- level1ID: Name you want for your individual-level ID variable
Variable Dependence
- Binary depends on a Binary
Definitions
<- defData(varname = "male", dist = "binary", def formula = .5 , id="cid") <- defData(def, varname = "over65", dist = "binary", def formula = "-1.7 + .8*male", link="logit")
What’s happening
<- c(1,1,0,1,0,0,0,1,0,1) male <- -1.7 + 0.8 * male logits <- boot::inv.logit(logits) probabilities <- rbinom(n = 10, size = 1, prob = probabilities) over65
- The formula in the logits line defines the relationship between being male and being over 65yrs old.
- Males in this sample will have a higher probability (0.2890505) of being over 65yrs old than females (0.1544653)
- To sample from a Bernoulli distribution, set size = 1
- over65 is an indicator where each value is determined by a separate probability parameter for a Bernoulli distribution
Clustered with Cluster-Level Random Effect
Example: Fixed Cluster sizes; Balanced
Cluster Definitions
<- defData(varname = "n", formula = 20, dist = "nonrandom") d0 <- defData(d0, varname = "a", formula = 0, variance = 0.33) d0 <- defData(d0, varname = "rx", formula = "1;1", dist = "trtAssign") d0 <- defDataAdd(varname = "y", formula = "18 + 1.6 * rx + a", d1 variance = 16, dist = "normal")
- n: sample size for the cluster
- dist = “nonrandom” and formula = 20 says use a constant for the cluster sizer
- rx: treatment indicator
- dist = “trtAssign” and formula = “1;1” says 2 treatment groups and they’re balanced
- y: the individual-level outcome is a function of the treatment assignment and the cluster effect, as well as random individual-level variation
- a: random individual-level variation (i.e. random effect)
- Random Effects are sampled from \(\mathcal{N}(0, \sigma)\) where the variance is typically estimated in a Mixed Effects model.
- n: sample size for the cluster
Generate Cluster-Level Data
set.seed(2761) <- genData(10, d0, "site") dc dc## site n a rx ## 1: 1 20 -0.3548 1 ## 2: 2 20 -1.1232 1 ## 3: 3 20 -0.5963 0 ## 4: 4 20 -0.0503 1 ## 5: 5 20 0.0894 0 ## 6: 6 20 0.5294 1 ## 7: 7 20 1.2302 0 ## 8: 8 20 0.9663 1 ## 9: 9 20 0.0993 0 ## 10: 10 20 0.6508 0
- Generates 10 clusters labelled as site according to the instructions in d0
Generate Individual Level Data
<- genCluster(dc, "site", "n", "id") dd <- addColumns(d1, dd) dd dd## site n a rx id y ## 1: 1 20 -0.355 1 1 17.7 ## 2: 1 20 -0.355 1 2 16.2 ## 3: 1 20 -0.355 1 3 19.2 ## 4: 1 20 -0.355 1 4 20.6 ## 5: 1 20 -0.355 1 5 14.7 ## --- ## 196: 10 20 0.651 0 196 25.3 ## 197: 10 20 0.651 0 197 22.1 ## 198: 10 20 0.651 0 198 13.2 ## 199: 10 20 0.651 0 199 15.6 ## 200: 10 20 0.651 0 200 13.8
genCluster
performs anexpand.grid
to generate an individual-level dataset along with adding an ID variableaddColumns
uses individual-level data and outcome variable definition to generate the outcome variable and add it to the dataset.
Example: Varying Cluster Sizes and therefore Varying Sample Size
<- defData(varname = "n", formula = 20, dist = "poisson") d0 genData(10, d0, "site") ## site n ## 1: 1 13 ## 2: 2 18 ## 3: 3 21 ## 4: 4 26 ## 5: 5 25 ## 6: 6 27 ## 7: 7 23 ## 8: 8 30 ## 9: 9 23 ## 10: 10 20
- Formula sets the poisson distribution parameter, \(\lambda = 20\). So sizes are sampled from poisson distribution with that mean/variance
- To increase the variability between clusters, use the negative binomial distribution
- Most likely leads to an unbalanced design
Example: Varying Cluster Sizes but Constant Sample Size
# moderately varying cluster sizes <- defData(varname = "n", formula = 200, variance = 0.2, dist = "clusterSize") d0 genData(10, d0, "site") ## site n ## 1: 1 20 ## 2: 2 28 ## 3: 3 25 ## 4: 4 24 ## 5: 5 28 ## 6: 6 22 ## 7: 7 7 ## 8: 8 13 ## 9: 9 22 ## 10: 10 11 # Very highly varying cluster sizes <- defData(varname = "n", formula = 200, variance = 5, dist = "clusterSize") d0 genData(10, d0, "site") ## site n ## 1: 1 10 ## 2: 2 2 ## 3: 3 17 ## 4: 4 2 ## 5: 5 49 ## 6: 6 110 ## 7: 7 1 ## 8: 8 4 ## 9: 9 1 ## 10: 10 4
- Total sample size is fixed at 200 (formula), but individual cluster sizes are allowed to vary.
- variance: A dispersion parameter that controls the amount of varying of the cluster sizes
{DeclareDesign}
Misc
- Docs, Website, EBook
- Notes from
- Sample Size Calculations in R: Money vs. Power
- Shows how to incorporate cost into the design. If cost is just a function of sample size, then I’m not sure how useful it is since it’s a simple calculation that can be done on your own.
- Sample Size Calculations in R: Money vs. Power
Basics
- Process
- Define the data generating process
- Define the model function
- Input:
- Data frame called data
- Output:
- 1-row data frame with mandatory and optional columns.
- Mandatory: term, estimate
- Optional: std.error, conf.low, conf.high, p.value, cost, etc.
- 1-row data frame with mandatory and optional columns.
- Input:
- Declare the rue value of the quantity of interest
- Combine objects, simulate, and analyze the results
- Compare results of various inputs, models, etc.
- Example: Power Calculationn for Simple linear model
DGP
library(marginaleffects); library(DeclareDesign) set.seed(2024) <- 100 sample_size <- 0.2 treatment_effect <- declare_model( dgp N = sample_size, # standard normal noise e = rnorm(N, mean = 0, sd = 1), # random dummy treatment with equal probability of treatment and control D = rbinom(N, size = 1, prob = 0.5), # intercept b0 = 0, # outcome equation Y = b0 + treatment_effect * D + e ) <- dgp() d head(d, 3) #> ID e D b0 Y #> 1 001 0.9819694 1 0 1.18196941 #> 2 002 0.4687150 0 0 0.46871504 #> 3 003 -0.1079713 1 0 0.09202867
- Note that sample size and estimated effect size are defined outside the functions. This is because they are variables we intend vary in order to compare the results.
declare_model
is a function factory (i.e. returns a function)- Each call of
dgp
will produce a different dataset.
Model
<- function(data) { fit <- lm(Y ~ D, data = data) model #> out <- data.frame( #> "term" = "OLS", #> "estimate" = coef(model)["D"] #> ) <- out ::avg_comparisons(model, marginaleffectsvariables = "D") return(out) } fit(d) #> term contrast estimate std.error statistic p.value s.value conf.low conf.high predicted_lo predicted_hi predicted #> 1 D 1 - 0 0.07115999 0.1984708 0.3585414 0.7199382 0.474055 -0.3178356 0.4601556 0.0545284 0.1256884 0.0545284
In more complex models, with interactions or non-linear components,
avg_comparisons
will be a convenient way to estimate the average treatment effect (ATE), and it will return results in the “tidy” format required byDeclareDesign
The manual way is included to show how to define the output in case you aren’t using a model covered by {marginaleffects}
Truth and Results
<- declare_inquiry(ATE = treatment_effect) truth <- dgp + estimator + truth design diagnose_design(design) #> Research design diagnosis based on 500 simulations. Diagnosis completed in 9 secs. #> #> Design Inquiry Term Mean Estimand Mean Estimate Bias SD Estimate RMSE Power Coverage N Sims #> design ATE D 0.20 0.20 -0.00 0.20 0.20 0.17 0.96 500
- These simulation results suggest that our estimation strategy yields unbiased estimates. Unfortunately, the statistical power of our research design is very low.
Compare
<- design_list redesign(design, sample_size = c(100, 500), treatment_effect = c(0.2, 0.5) ) diagnose_design(design_list) #> Research design diagnosis based on 500 simulations. Diagnosis completed in 9 secs. #> #> Design sample_size treatment_effect Inquiry Term Mean Estimand Mean Estimate Bias SD Estimate RMSE Power Coverage N Sims #> design_1 100 0.2 ATE D 0.20 0.20 -0.00 0.21 0.21 0.19 0.95 500 #> design_2 500 0.2 ATE D 0.20 0.19 -0.01 0.09 0.09 0.56 0.94 500 #> design_3 100 0.5 ATE D 0.50 0.51 0.01 0.20 0.20 0.71 0.94 500 #> design_4 500 0.5 ATE D 0.50 0.51 0.01 0.09 0.09 1.00 0.95 500
- Still very little bias. Increasing the sample size by 5x increased the power by around 3x (0.56). Increasing the detectable effect size by 2.5x increased the power by around 4x (0.71). Increasing both increased the power by around 5x (1.00)
Examples
Example: Model Comparison
library(marginaleffects); library(DeclareDesign) <- 100 sample_size <- 0.2 treatment_effect # DGP <- declare_model( dgp N = sample_size, e = rnorm(N, mean = 0, sd = 1), D = rbinom(N, size = 1, prob = 0.5), # random normal covariates ::draw_multivariate(c(X1, X2) ~ MASS::mvrnorm( fabricatrn = N, mu = c(0, 0), Sigma = matrix(c(1, 0.5, 0.5, 1), 2, 2) )), Y = 1 + treatment_effect * D + 1 * X1 + 1 * X2 + e ) # Model <- function(data) { fit # Linear model with regression adjustment for covariates <- lm(Y ~ D * (X1 + X2), data = data) adjusted <- adjusted avg_comparisons(adjusted, variables = "D", vcov = "HC3") # Linear model with a single binary predictor <- lm(Y ~ D, data = data) unadjusted <- unadjusted avg_comparisons(unadjusted, variables = "D", vcov = "HC3") # Combine the results = rbind(adjusted, unadjusted) out # Label the results $term = c("Adjusted", "Unadjusted") outreturn(out) } dgp() |> fit() #> Term Contrast Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 % #> Adjusted mean(1) - mean(0) -0.045 0.207 -0.217 0.828 0.3 -0.451 0.361 #> Unadjusted mean(1) - mean(0) -0.307 0.374 -0.820 0.412 1.3 -1.040 0.426 # Truth and Compare <- declare_inquiry(ATE = treatment_effect) truth <- declare_estimator(handler = fit) estimators <- dgp + truth + estimators design diagnose_design(design) #> Design Inquiry Term N Sims Mean Estimand Mean Estimate Bias SD Estimate RMSE Power Coverage #> design ATE Adjusted 500 0.20 0.20 0.00 0.21 0.21 0.17 0.95 #> (0.00) (0.01) (0.01) (0.01) (0.01) (0.02) (0.01) #> design ATE Unadjusted 500 0.20 0.20 -0.00 0.42 0.42 0.08 0.93 #> (0.00) (0.02) (0.02) (0.01) (0.01) (0.01) (0.01)
Bootstrapped standard errors are in parentheses in the bottom row
No bias and as expected, the regression with the adjustment variables has more power.