Analysis
Misc
- Packages
- {survey} - Lumley’s package
- {srvyr} - Brings parts of dplyr’s syntax to survey analysis, using the survey package.
- {svylme} (Paper) - Mixed models for complex surveys
- {svyvgam} - Inference based on the survey package for the wide range of parametric models in the ‘VGAM’ package
- {jtools} - Convenience functions for the survey package’s svyglm objects as well as weighted regressions
- {dropout} (JOSS) - Classifies missing values based on the occurrence of single missing values, section dropouts and complete dropouts which allows it to produce summary statistics of different response patterns and relate them to the overall occurrence of missing values.
- {fastsurvey} (article): Making the {survey} hundreds of times faster using {Rcpp}
- {svyROC} - Estimation of the ROC Curve and the AUC for Complex Survey Data
- {csSampling} (Vignette) - Provides estimation of Bayesian models for data collected from complex survey samples by combining functionality from Stan (via rstan and brms) and the survey package.
- Estimates the weighted stan model and provides an asymptotic covariance correction for model mis-specification due to using survey sampling weights as plug in values in the likelihood (i.e. design effect).
- {FHDI} (Vignette) - Functions for fractional hot deck imputation as well as fully efficient fractional imputation for handling item nonresponse under a missing at random (MAR) assumption
- bbw - The blocked weighted bootstrap is an estimation technique for use with data from two-stage cluster sampled surveys in which either prior weighting or posterior weighting is implemented
- Resources
- Exploring Complex Survey Data Analysis in R - Ebook from {srvyr} owner
- Introduction to Regression Methods for Public Health Using R: Chapter 8 Analyzing Complex Survey Data
- Survey Data Analysis with R - UCLA article that gives a nice overview of the {survey}
- Statistics in Survey Sampling
- Papers
- Questionnaire data can be modelled using ordinal regression (Liddell & Kruschke, 2018)
- When there are cluster-level variables, using a cross-validation split that ignores clustering tended to choose a lasso regularisation parameter that was smaller than ideal, resulting in overfitting of the model. (Lumley)
- Pairwise Likelihood (original, mathy paper, factor analysis w/ordinal data paper, usage on customer survey data ($) paper)
- A special case of composite likelihood methods that uses lower-order conditional or marginal log-likelihoods instead of the full log-likelihood
- When the number of items is greater than five (p > 5), Full Information Likelihood (FIML) is only feasible when the Item Response Theory (IRT) framework is used. However, even in IRT, FIML becomes very computationally heavy as the number of factors increases. Using Pairwise likelihood is a suitable alternative
- Ignoring the survey design features (such as stratification, clustering and unequal selection probabilities) can lead to erroneous inferences on model parameters because of sample selection bias caused by informative sampling.
- It is tempting to expand the models by including among the auxiliary variables all the design variables that define the selection process at the various levels and then ignore the design and apply standard methods to the expanded model. The main difficulties with this approach are the following:
- Not all design variables may be known or accessible to the analyst
- Too many design variables can lead to difficulties in making inference from the expanded model
- The expanded model may no longer be of scientific interest to the analyst
- Design-based approach can provide asymptotically valid repeated sampling inferences without changing the analyst’s model.
- Resampling methods, such as the jackknife and the bootstrap for survey data, can provide valid variance estimators and associated inferences on the census parameters
- In other cases, it is necessary to estimate the model variance of the census parameters from the sample. The estimator of the total variance is then given by the sum of this estimator and the re-sampling variance estimator.
- Example: In an education study of students, schools (first stage sampling units) may be selected with probabilities proportional to school size and students (second stage units) within selected schools by stratified random sampling.
- See Surveys, Sampling Methods >> Probabilistic Sampling Methods >> Multi-Stage Sampling
- Again, ignoring the survey design and using traditional methods for multi-level models can lead to erroneous inferences in the presence of sample selection bias
- In the design-based approach, estimation of variance component parameters of the model is more difficult than that of regression parameters.
- Asymptotically valid even when the sample sizes within sampled clusters (level 1 units) are small, unlike some of the existing methods, but knowledge of the joint inclusion probabilities within sampled clusters is required.
- Large variations in cluster sizes may cause an issue, see Lumley
- A special case of composite likelihood methods that uses lower-order conditional or marginal log-likelihoods instead of the full log-likelihood
- Example of debiasing a dataset by other means than by weighting by population
- The economist created a death-by-covid risk probability model. They had a bunch of medical records with patient comorbidities, age, gender, positive test, hospitalized, death/no death, etc. (people with other illnesses already) and were worried that the people who tested positive but just stayed at home (i.e. no medical records like younger people). Not correcting for this bias of undetected cases would bias their risk probabilities.
- Failing to correct this bias would lead to underestimating the risks associated with comorbidities, and to overestimating the risks among those without listed conditions.
- They used an estimated metric, national cfr per age group per gender per week (separate dataset from CDC which has stats on groups with and without medical records). When a week’s sample cfr didn’t match that week’s national cfr, they would randomly sample people in the dataset who didn’t meet the selection criteria (i.e. positive covid test) and assign them a positive test. They continued to add these reclassified people to that week’s sample until the sample cfr matched the national cfr. Thus, debiasing they’re data set.
- Thought this was an interesting case because it used a estimated metric to “weight” subgroups within their sample to make it more representative of the “true” population.
- Also see Projects >> Rolling COVID-19 CFR
- https://www.economist.com/graphic-detail/2021/03/11/how-we-built-our-covid-19-risk-estimator
- The economist created a death-by-covid risk probability model. They had a bunch of medical records with patient comorbidities, age, gender, positive test, hospitalized, death/no death, etc. (people with other illnesses already) and were worried that the people who tested positive but just stayed at home (i.e. no medical records like younger people). Not correcting for this bias of undetected cases would bias their risk probabilities.
Terms
- Ceiling and Floor Effects - An artificial lower limit on the value that a variable can attain, causing the distribution of scores to be skewed.
- Occur when the tests or scales are relatively easy or difficult such that substantial proportions of individuals obtain either maximum or minimum scores and that the true extent of their abilities cannot be determined.
- Sounds kind of like censoring (See Regression, Other >> Censored and Truncated Data)
- Ceiling or floor effects alone would induce, respectively, attenuation or inflation in mean estimates. And both ceiling and floor effects would result in attenuation in variance estimates.
- {DACF}
- Recovers mean and variance given data with ceiling/floor effects
- Allows for mean comparison tests such as t-test and ANOVA for data with ceiling/floor effects
- Example: The distribution of scores on an ability test will be skewed by a floor effect if the test is much too difficult for many of the respondents and many of them obtain zero scores.
- Example: The distribution of scores on an ability test will be skewed by a ceiling effect if the test is much too easy for many of the respondents and many of them obtain perfect scores
- Design Effect (DEff) - A measure of the expected impact of a sampling design on the variance of an estimator for some parameter of a population. (wiki)
- It is calculated as the ratio of the variance of an estimator based on a sample from an (often) complex sampling design, to the variance of an alternative estimator based on a simple random sample (SRS) of the same number of elements.
- Design Effect = 1: Says the variance of your estimator (and hence its standard error) is the same in your complex survey design as if you’d just sampled simply from the population at random.
- Design Effect > 1: These are the norm, and says that you have a higher variance in your estimate as a price of something else in your complex design
- Example: You are sampling clusters that live closer together to make it easier for interviewers to interview several households at once, which reduces cost but increases variance.
- STATA vs {survey}: Default results for this effect will differ when the stratification variable you are splicing your sample up into for estimation purposes (i.e. mean per province) happens to be the very variable that you used as a basis for over- and under-sampling in the first place. (see article for details)
- Design Effect isn’t typically scrutinized over at the stratified variable level. It’s used at the overall survey-level (i.e. overall mean estimate and not mean estimate per province), so this typically isn’t an issue).
- The STATA Design Effect per strata method has an intuitive interpretation that could be interesting though, so see the article for details on calculating it manually in R.
Weights
Misc
- Surveys responses are often biased due to coverage error, sampling error and non-response bias. Weighting is often an important step when analyzing survey data. For each unit in the sample (e.g. respondent to a survey), we attach a weight that can be understood as the approximate number of people from the target population that this respondent represents. Weights adjust the sample distribution more towards the population distribution
- The green bars show the sample with weights applied.
- The weighted average will also be less biased to the extent the response is correlated with respondent’s age.
- The weighted distribution is not fully corrected, mainly because of bias-variance considerations
- Packages
- Papers
- Replicate weight variance estimation is a class of resampling approaches to variance estimation, extensions of the jackknife and bootstrap to survey data.
- Replicate weights are designed by setting the weights of observations in some clusters to nearly zero and other clusters to nearly their sampling weights.
- In a jackknife, one cluster will have zero weight and the others will have weight close to their sampling weight; in a bootstrap, about 37% of clusters will have zero weight and 63% will have weight close to 1 or 2 or some other small multiple of their sampling weight.
- There are also split-half replicates where half the clusters have zero weight and the other half have double weight.
- For surveys that don’t have built-in replicate weights {survey} has algorithms to construct them.
Types
- Frequency Weights
- Steps
- Remove the duplicate observations
- Duplicates don’t add any additional information
- Weight each observation by the square root of number of times it appeared in the original dataset
\[ \sqrt{w_i} \cdot x_i \] - SSE needs to be divided by n - k + 1
- Where n is the number of observations in the original dataset and k is the number of predictors in the regression
- Remove the duplicate observations
- Steps
- Importance Weights - focus on how much each row of the data set should influence model estimation. These can be based on data or arbitrarily set to achieve some goal.
- Analytic Weights - If a data point has an associated precision, analytic weighting helps a model focus on the data points with less uncertainty (such as in meta-analysis).
- (Inverse) Probability Weights (wiki) - {{balance}} refers to this type as “inverse propensity weights”
- Also see below, {{balance}} >> Steps >> Adjust >> Options
- “Directly inverting propensity score estimates can lead to instability, bias, and excessive variability due to large inverse weights, especially when treatment overlap is limited” (See paper details and solution)
- Used to reduce bias when respondents have different probabilities of selection
- Adjusts a non-random sample to represent a population by weighting the sample units. It assumes two samples:
- A sample of respondents to a survey (or in a more general framework, a biased panel).
- A sample of a target population, often referred to as “reference sample” or “reference survey.”
- This sample includes a larger coverage of the population or a better sampling properties in a way that represents the population better.
- It often includes only a limited number of covariates and doesn’t include the outcome variables (the survey responses).
- In different cases it can be the whole target apopulation (in case it is available), a census data (based on a survey) or an existing survey.
- Propensity Score - The probability to be included in the sample (the respondents group) conditioned on the characteristics of the unit
- Let \(p_i = \text{Pr}\{i \in S \| x_i\} \quad \mbox{with}\; i = 1 \ldots n\).
- \(i\) is the unit (aka respondent), \(n\) is the total number of respondents, \(S\) is the sample of respondents
- \(X\) is a set of covariates that are available for the sample and the target population
- \(p_i\) is the estimated probability of being in the sample using logistic regression
- Data includes both sample and target population
- Outcome is a binary variable (1/0): 1 = Sample, 0 = Target
- Covariates are \(X\)
- Also see Econometrics, Propensity Score Analysis
- Calculate Weights
\[ w_i = \frac{1-p_i}{p_i}d_i \]- \(d_i\) is …?
- Covariate Balancing Propensity Scores (CBPS)
- When estimating propensity score, there is often a process of adjusting the model and choosing the covariates for better covariate balancing. The goal of CBPS is to allow the researcher to avoid this iterative process and suggest an estimator that is optimizing both the propensity score and the balance of the covariates together.
- Main advantage is in cases when the researcher wants better balance on the covariates than traditional propensity score methods - because one believes the assignment model might be misspecified and would like to avoid the need to fit followup models to improve the balance of the covariates.
- Also see
- Misc >> packages >> {CBPS}
- {{balance}} >> Steps >> Adjust >> Options
{{balance}}
- Docs
- A Python package for adjusting biased data samples.
- Provides eda, weights calculation, comparison of variables before and after weighting
Steps
- EDA:
- Understanding the initial bias in the sample data relative to a target population we would like to infer
- Summary Statistics
- The limitation of using the mean is that it is not easily comparable between different variables since they may have different variances.
- ASMD (Absolute Standardized Mean Deviation) measures the difference between the sample and target for each covariate.
It uses weighted average and std.dev for the calculations (e.g.: to take design weights into account).
This measure is the same as taking the absolute value of Cohen’s d statistic (also related to SSMD), when using the (weighted) standard deviation of the population.
\[ d = \frac{\bar x_1 - \bar x_2}{s} \]- Not sure why it says “(weighted)” when it’s the std.dev of the population since weights are applied to sample data. Maybe the population estimate is itself a weighted calculation.
- Guidelines on effect size for Cohen’s D should apply here, too.
- For categorical variables, the ASMD can be calculated as the average of the ASMD applied to each of the one-hot encoding of the categories of the variable
Also see
- Visualizations
- Calculate Weights:
- Adjust the data to correct for the bias by producing weights for each unit in the sample based on propensity scores
- Preprocessing (“using best practices in the field”):
- Transformations are done on both the sample dataframe and the target dataframe together
- Missing values - adds a column ‘_is_na’ to any variable that contains missing values
- Considered as a separate category for the adjustment
- Feature Engineering
- Continuous - Bucketed into 10 quantiles buckets.
- Categorical - Rare categories (with less than 5% prevalence) are grouped together so to avoid overfitting rare events
- Methods
- Inverse Propensity Weighting (IPW)
- Coefficients, parameters of the fitted models are available
- See above, Types >> (Inverse) Probability Weights Using LASSO logistic regression keeps the inflation of the variance as minimal as possible while still addressing the meaningful differences in the covariates between the sample and the target
- Design Effect (max_de) for tuning penalty factor, \(\lambda\), and the trimming ratio parameter
- A measure of the expected impact of a sampling design on the variance of an estimator for some parameter
- max_de=X - The regularization parameter and the trimming ratio parameter are chosen by a grid search over the 10 models with the max design effect value
- Default is 1.5
- Assumption: larger design effect often implies better covariate balancing.
- Within these 10 models, the model with the smallest ASMD is chosen.
- max_de=None - Optimization is performed by cross-validation of the logistic model
- The penalty factor, \(\lambda\), is chosen when the MSE is at most 1 standard error from the minimal MSE
- The trimming ratio parameter is set by the user, and default to 20
- Covariate Balancing Propensity Scores (CBPS)
- Estimates the propensity score in a way that optimizes prediction of the probability of sample inclusion as well as the covariates balance.
- Also see
- Types >> Covariate Balancing Propensity Scores (CBPS)
- Misc >> packages >> {CBPS}
- Design Effect (
max_de
)- A measure of the expected impact of a sampling design on the variance of an estimator for some parameter
- Default is 1.5; If “None”, then optimization is unconstrained
- Post-Stratification
- Inverse Propensity Weighting (IPW)
- Post-processing of the weights:
- Trims - trims the weights in order to avoid overfitting of the model and unnecessary variance inflation.
- Options
- Mean-Ratio - The ratio from above according to which the weights are trimmed by mean(weights) * ratio. Default is 20.
- Percentile - Winsorization is applied
- Options
- Normalizing to population size - Weights can be described as approximating the number of units in the population this unit of the sample represents.
- Trims - trims the weights in order to avoid overfitting of the model and unnecessary variance inflation.
- Compare data with and without weights
- Evaluate the final bias and the variance inflation after applying the fitted weights.
- Compares ASMD score (See EDA), Design Effect, Model proportion deviance explained (if inverese propensity weighting method was used)
- ASMD: since categorical variables are hot-encoded, a comparison (with/without weights) is made for each level
- Comparison of means is available
- Similar charts used in EDA are available that show a comparison between weighted/not weighted
- Response Rates with/without weights
- Effects on outcome variable
Modeling
Tidymodels
Misc
-
<- tribble( example_data ~lang_1, ~lang_2, ~lang_3, "English", "Italian", NA, "Spanish", NA, "French", "Armenian", "English", "French", NA, NA, NA ) recipe(~., data = example_data) |> step_dummy_multi_choice(starts_with("lang")) |> prep() |> bake(new_data = NULL) #> # A tibble: 4 × 5 #> lang_1_Armenian lang_1_English lang_1_French lang_1_Italian lang_1_Spanish #> <int> <int> <int> <int> <int> #> 1 0 1 0 1 0 #> 2 0 0 1 0 1 #> 3 1 1 1 0 0 #> 4 0 0 0 0 0
- Might be useful for preprocessing items with the same possible answers since creating one-hot encodes the typical way would result in duplicate columns.
-
Frequency weights are used for all parts of the preprocessing, model fitting, and performance estimation operations.
- This includes v-fold CV splits for now (see Using case weights with tidymodels for details)
Importance weights only affect the model estimation and supervised recipes steps (i.e. depend on the outcome variable).
- Not used with yardstick functions for calculating measures of model performance.
Example: Importance weights
<- training_sim %>% training_sim mutate( case_wts = ifelse(class == "class_1", 60, 1), case_wts = parsnip::importance_weights(case_wts) ) set.seed(2) <- vfold_cv(training_sim, strata = class) sim_folds <- sim_rec recipe(class ~ ., data = training_sim) %>% step_ns(starts_with("non_linear"), deg_free = 10) %>% step_normalize(all_numeric_predictors()) <- lr_spec logistic_reg(penalty = tune(), mixture = 1) %>% set_engine("glmnet") <- lr_wflow workflow() %>% add_model(lr_spec) %>% add_recipe(sim_rec) %>% add_case_weights(case_wts) <- metric_set(sensitivity, specificity) cls_metrics <- tibble(penalty = 10^seq(-3, 0, length.out = 20)) grid set.seed(3) <- lr_res %>% lr_wflow tune_grid(resamples = sim_folds, grid = grid, metrics = cls_metrics) autoplot(lr_res) # calibration curves
Description
- Binary outcome; lasso
- class_1 (80 obs) is severely imbalanced with class_2 (4920)
- class_1 observations get a weight of 60 since 4920/80 = 61.5 which is ~ 60
recipe
will automatically detect the weights (pretty sure it doesn’t matter whether no case_wts is included in formula, e.g. class ~ .)- Since these are performance weights and
step_ns
andstep_normalize
don’t depend on the outcome variable (i.e. supervised), case weights are not used in these transformations.
- Since these are performance weights and
Steps
- Add case_wts variable to df
- Use
add_case_weights
function in workflow code
Remove the case weights from a workflow
<- lr_unwt_wflow %>% lr_wflow remove_case_weights()
- Useful if you want to make a comparison between models
Visualization
- Likert Response Pre and Post Intervention
Example: (source)
-
Code
<- dat %>% q2 filter(group == "Control") %>% rename(Pre = pre, Post = post) %>% make_long(Pre, Post) %>% mutate(node = factor(node, levels = c(7,6,5,4,3,2,1)), next_node = factor(next_node, levels = c(7,6,5,4,3,2,1))) %>% ggplot(aes(x = x, next_x = next_x, node = node, next_node = next_node, fill = factor(node))) + geom_sankey(alpha = 0.7, node.color = 'black') + geom_sankey_label(aes(label = node), alpha = 0.75, size = 3, color = "black", fill = "gray80") + scale_x_discrete(expand = c(0.05,0.05)) + theme_institute(base_size = 14) + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.title.y=element_blank(), axis.text.y=element_blank(), axis.ticks=element_blank(), legend.position = "bottom", plot.title = element_text(hjust = 0.5)) + guides(fill = guide_legend(reverse = T, nrow = 1)) + labs(title = "Pre-post", fill = "Response", x = "") <- dat %>% q3 filter(group == "Control") %>% group_by(post) %>% tally() %>% mutate(freq = n / sum(n), res = str_c(n, "\n(", round(freq*100, 1), "%)")) %>% ggplot(aes(x = as.factor(post), y = freq)) + geom_bar(aes(fill = as.factor(post)), stat="identity", alpha = 0.8, colour = "black") + theme_institute(base_size = 14) + theme(legend.position = "none", panel.grid.major.x = element_blank(), plot.background = element_blank(), plot.title = element_text(hjust = 0.5)) + scale_y_continuous(labels = scales::percent_format(), breaks = seq(0, max_prop, by = 0.05), expand = expansion(mult = c(0, 0.1))) + coord_cartesian(ylim = c(0, max_prop)) + scale_fill_viridis_d(option = "plasma", end = 0.85, direction = -1) + labs(title = "Post", fill = "Response", x = "Response", y = "") + geom_text(aes(label = res), vjust = -0.1, family = "Barlow Semi Condensed") + guides(fill = guide_legend(nrow = 1)) <- (p1 + p2 + p3) / (q1 + q2 + q3) patchwork + patchwork plot_annotation(tag_levels = list(c('Intervention', '', '', 'Control', '', ''))) & theme(plot.tag.position = c(0, 1), plot.tag = element_text(face = "bold", hjust = 0, vjust = 0))
Only included code for 1 bar and 1 alluvial. The author wrote each chart out individually along with some data manipulation code which is a lot of code. Haven’t looked closely, but I think he could’ve wrote functions for bars and alluvials then looped inputs.
The alluvial gives you a sense of how much each response changed from pre to post treatment while the bars shows how the overall response distribution changed.
Alluvial and Horizontal Dots (Codeby M.Kay)
Tables
The author uses a function from his personal package that requires {flextable} (I think) which seems be the source of the table theme.
-
Code
library(gtsummary) tbl_merge(tbls = list(dat %>% mutate("Change in water intake" = fct_case_when(post < pre ~ "Decrease", == post ~ "No change", pre > pre ~ "Increase")) %>% post filter(group == "Control") %>% select("Change in water intake") %>% tbl_summary(), %>% dat mutate("Change in water intake" = fct_case_when(post < pre ~ "Decrease", == post ~ "No change", pre > pre ~ "Increase")) %>% post filter(group == "Intervention") %>% select("Change in water intake") %>% tbl_summary()), tab_spanner = c("**Control**", "**Intervention**")) %>% thekids_table(colour = "Saffron")
By Response Category (and Treatment)
Code
library(gtsummary) %>% dat mutate(Change = fct_case_when(post < pre ~ "Decrease", == post ~ "No change", pre > pre ~ "Increase")) %>% post select(group, pre_l, Change) %>% tbl_strata( strata = group, ~.x %>% tbl_summary( by = pre_l) %>% modify_header(all_stat_cols() ~ "**{level}**"), .combine_with = "tbl_stack" %>% ) thekids_table(colour = "Saffron")
-