Simulation, Data
Misc
- Todo - for the “trtAssign” mess with ratio and the number of ratios
- Also see
- Bkmks Data >> Data Simulation
- Pandas-Time Series-Simulation
- A Gaussian and Standard GARCH time-series that’s frequently encountered in econometrics
- Packages
- {simChef} (Vignette) - Rapidly plan, carry out, and summarize statistical simulation studies in a flexible, efficient, and low-code manner.
- Intuitive tidy grammar of data science simulations
- Powerful abstractions for distributed simulation processing backed by {future}
- Automated generation of interactive R Markdown simulation documentation, situating results next to the workflows needed to reproduce them.
- {structmcmc} - A set of tools for performing structural inference for Bayesian Networks using MCMC
- {simChef} (Vignette) - Rapidly plan, carry out, and summarize statistical simulation studies in a flexible, efficient, and low-code manner.
- Papers
{simstudy}
Misc
Reference
Available distributions (link)
- Probability Distributions
- nonrandom: For constants; can be a numeric or a string with a formula that defines a dependency on another variable
- clusterSize: For variable cluster sizes but a constant total sample size
- formula: The (fixed) total sample size
- variance: A (non-negative) dispersion measure that represents the variability of size across clusters
- If the dispersion is set to 0, then cluster sizes are constant
- trtAssign: For treatment assignment
formula: Ratio which is separated by semicolons and number of treatments
- e.g. 2 values = 2 groups and “1;2” says group 2 has twice as many units and group 1
variance: Stratification; ratio in formula is used as the stratification ratio (e.g. unbalanced treatment groups → unbalanced stratification)
Example
<- def defData(def, varname = "rx", dist = "trtAssign", formula = "1;1;2", variance = "male;over65") count(studytbl, rx) #> # A tibble: 3 × 2 #> rx n #> <int> <int> #> 1 1 84 #> 2 2 82 #> 3 3 164 count(studytbl, male, rx) #> # A tibble: 6 × 3 #> male rx n #> <int> <int> <int> #> 1 0 1 40 #> 2 0 2 39 #> 3 0 3 78 #> 4 1 1 44 #> 5 1 2 43 #> 6 1 3 86 count(studytbl, over65, rx) #> # A tibble: 6 × 3 #> over65 rx n #> <int> <int> <int> #> 1 0 1 66 #> 2 0 2 65 #> 3 0 3 130 #> 4 1 1 18 #> 5 1 2 17 #> 6 1 3 34
Functions
defData(dtDefs = NULL, varname, formula, variance = 0, dist = "normal", link = "identity", id = "id")
- Initially creates a data.table or adds a column to a data.table with instructions about creating a variable- formula: Numeric constant or string formula for the mean, probability of event (binary), probability of success (binomial), etc.
defDataAdd(dtDefs = NULL, varname, formula, variance = 0, dist = "normal", link = "identity")
- Creates a variable definition likedefData
but is used to augment a already generated dataset. Used as input toaddColumns
which will generate the variable data from the instructions in this object and add it as a column to the already generated dataset.genCluster(dtClust, cLevelVar, numIndsVar, level1ID, allLevel2 = TRUE)
- After generating cluster-level data, this function takes the number of clusters and the sizes of each cluster from that data, and does something likeexpand.grid
to generate an individual-level dataset. Also, adds an id variable.- dtClust: Cluster-Level Data
- cLevelVar: Cluster variable from the cluster-level data
- numIndsvar: Variable with the number of units per cluster from the cluster-level data
- level1ID: Name you want for your individual-level ID variable
Variable Dependence
- Binary depends on a Binary
Definitions
<- defData(varname = "male", dist = "binary", def formula = .5 , id="cid") <- defData(def, varname = "over65", dist = "binary", def formula = "-1.7 + .8*male", link="logit")
What’s happening
<- c(1,1,0,1,0,0,0,1,0,1) male <- -1.7 + 0.8 * male logits <- boot::inv.logit(logits) probabilities <- rbinom(n = 10, size = 1, prob = probabilities) over65
- The formula in the logits line defines the relationship between being male and being over 65yrs old.
- Males in this sample will have a higher probability (0.2890505) of being over 65yrs old than females (0.1544653)
- To sample from a Bernoulli distribution, set size = 1
- over65 is an indicator where each value is determined by a separate probability parameter for a Bernoulli distribution
Clustered with Cluster-Level Random Effect
Example: Fixed Cluster sizes; Balanced
Cluster Definitions
<- defData(varname = "n", formula = 20, dist = "nonrandom") d0 <- defData(d0, varname = "a", formula = 0, variance = 0.33) d0 <- defData(d0, varname = "rx", formula = "1;1", dist = "trtAssign") d0 <- defDataAdd(varname = "y", formula = "18 + 1.6 * rx + a", d1 variance = 16, dist = "normal")
- n: sample size for the cluster
- dist = “nonrandom” and formula = 20 says use a constant for the cluster sizer
- rx: treatment indicator
- dist = “trtAssign” and formula = “1;1” says 2 treatment groups and they’re balanced
- y: the individual-level outcome is a function of the treatment assignment and the cluster effect, as well as random individual-level variation
- a: random individual-level variation (i.e. random effect)
- Random Effects are sampled from \(\mathcal{N}(0, \sigma)\) where the variance is typically estimated in a Mixed Effects model.
- n: sample size for the cluster
Generate Cluster-Level Data
set.seed(2761) <- genData(10, d0, "site") dc dc## site n a rx ## 1: 1 20 -0.3548 1 ## 2: 2 20 -1.1232 1 ## 3: 3 20 -0.5963 0 ## 4: 4 20 -0.0503 1 ## 5: 5 20 0.0894 0 ## 6: 6 20 0.5294 1 ## 7: 7 20 1.2302 0 ## 8: 8 20 0.9663 1 ## 9: 9 20 0.0993 0 ## 10: 10 20 0.6508 0
- Generates 10 clusters labelled as site according to the instructions in d0
Generate Individual Level Data
<- genCluster(dc, "site", "n", "id") dd <- addColumns(d1, dd) dd dd## site n a rx id y ## 1: 1 20 -0.355 1 1 17.7 ## 2: 1 20 -0.355 1 2 16.2 ## 3: 1 20 -0.355 1 3 19.2 ## 4: 1 20 -0.355 1 4 20.6 ## 5: 1 20 -0.355 1 5 14.7 ## --- ## 196: 10 20 0.651 0 196 25.3 ## 197: 10 20 0.651 0 197 22.1 ## 198: 10 20 0.651 0 198 13.2 ## 199: 10 20 0.651 0 199 15.6 ## 200: 10 20 0.651 0 200 13.8
genCluster
performs anexpand.grid
to generate an individual-level dataset along with adding an ID variableaddColumns
uses individual-level data and outcome variable definition to generate the outcome variable and add it to the dataset.
Example: Varying Cluster Sizes and therefore Varying Sample Size
<- defData(varname = "n", formula = 20, dist = "poisson") d0 genData(10, d0, "site") ## site n ## 1: 1 13 ## 2: 2 18 ## 3: 3 21 ## 4: 4 26 ## 5: 5 25 ## 6: 6 27 ## 7: 7 23 ## 8: 8 30 ## 9: 9 23 ## 10: 10 20
- Formula sets the poisson distribution parameter, \(\lambda = 20\). So sizes are sampled from poisson distribution with that mean/variance
- To increase the variability between clusters, use the negative binomial distribution
- Most likely leads to an unbalanced design
Example: Varying Cluster Sizes but Constant Sample Size
# moderately varying cluster sizes <- defData(varname = "n", formula = 200, variance = 0.2, dist = "clusterSize") d0 genData(10, d0, "site") ## site n ## 1: 1 20 ## 2: 2 28 ## 3: 3 25 ## 4: 4 24 ## 5: 5 28 ## 6: 6 22 ## 7: 7 7 ## 8: 8 13 ## 9: 9 22 ## 10: 10 11 # Very highly varying cluster sizes <- defData(varname = "n", formula = 200, variance = 5, dist = "clusterSize") d0 genData(10, d0, "site") ## site n ## 1: 1 10 ## 2: 2 2 ## 3: 3 17 ## 4: 4 2 ## 5: 5 49 ## 6: 6 110 ## 7: 7 1 ## 8: 8 4 ## 9: 9 1 ## 10: 10 4
- Total sample size is fixed at 200 (formula), but individual cluster sizes are allowed to vary.
- variance: A dispersion parameter that controls the amount of varying of the cluster sizes