Missingness

Misc

  • Missing data can reduce the statistical power of a study and can produce biased estimates, leading to invalid conclusions
  • Also see
  • Resources
  • Packages
    • {mice} (Multivariate Imputation by Chained Equations) - Imputes mixes of continuous, binary, unordered categorical and ordered categorical data
      • Based on Fully Conditional Specification, where each incomplete variable is imputed by a separate model.
      • Impute continuous two-level data, and maintain consistency between imputations by means of passive imputation.
      • Many diagnostic plots are implemented to inspect the quality of the imputations.
    • {rbmi} - Imputation of missing data in clinical trials with continuous multivariate normal longitudinal outcomes.
      • Supports imputation under a missing at random (MAR) assumption, reference-based imputation methods, and delta adjustments (as required for sensitivity analysis such as tipping point analyses).
      • Methods
        • Bayesian and approximate Bayesian multiple imputation combined with Rubin’s rules for inference
        • Frequentist conditional mean imputation combined with (jackknife or bootstrap) resampling
    • {naniar} - Tidyverse compliant methods to summarize, visualize, and manipulate missing data.
    • {simputation} - Model-based, multivariate, donar, and simple stat methods available
    • {NPBayesImputeCat}: Non-Parametric Bayesian Multiple Imputation for Categorical Data
      • Provides routines to i) create multiple imputations for missing data and ii) create synthetic data for statistical disclosure control, for multivariate categorical data, with or without structural zeros
      • Imputations and syntheses are based on Dirichlet process mixtures of multinomial distributions, which is a non-parametric Bayesian modeling approach that allows for flexible joint modeling
      • Vignette
  • Recommendations (article)
    • N > 200 \(\rightarrow\) Nonparametric
      • {mice::mice.imput.cart} (decision trees)
        • Fits one decision tree then samples from the leaves of that tree which approximates drawing samples from the conditional distribution
      • {mice-drf} (distributional random forest)
        • DRF stimates distributions in its leaves, so sampling using it’s predictions is like sampling from a conditional distribution (See Algorithms, ML >> Trees >> Distributional Trees/Forests)
        • {missForest} doesn’t doesn’t do this. It just uses the predictions as the imputed values which essentially a conditional mean method and not a distributional method. (See Choosing a Method)
      • Currently, DL models like GAIN perform well, but aren’t significantly outperforming ML models
    • N < 200 \(\rightarrow\) Parametric
      • e.g. {mice::mice.impute.norm.nob} (Gaussian: uses variance of residuals of an lm, and predictions (means) as parameters of a Normal distribution (rnorm), then gets the imputed values by drawing from that distribution)
  • “But more precisely, even having the correct model of the analysis stage does not absolve the analyst of considering the relationship between the imputation stage variables, the causal model, and the missingness mechanism. It turns out that in this simple example, imputing with an analysis-stage collider is innocuous (so long as it is excluded at the analysis stage). But imputation-stage colliders can wreck MI even if they are excluded from the analysis stage.
  • **Don’t impute missing values before your training/test split
  • Imputing Types full-information maximum likelihood
    • Multiple imputation
    • One-Step Bayesian imputation
  • Missness Types (MCAR, MAR, and MNAR)
  • Multivariate Imputation with Chained Equation (MICE) assumes MAR
    • Method entails creating multiple imputations for each missing value as opposed to just one. The algorithm addresses statistical uncertainty and enables users to impute values for data of different types.
  • Stochastic Regression Imputation is problematic
    • Popular among practitioners though
    • Issues
      • Stochastic regression imputation might lead to implausible values (e.g. negative incomes).
      • Stochastic regression imputation has problems with heteroscedastic data
    • Bayesian PMM handles these issues
  • Missingness in RCT due dropouts (aka loss to follow-up)
    • Notes from To impute or not: the case of an RCT with baseline and follow-up measurements
      • {mice} used for imputation
    • Bias in treatment effect due to missingness
      • If there are adjustment variables that affect unit dropout then bias increases as variation in treatment effect across units increases (aka hetergeneity)
        • In the example, a baseline measurement of the outcome variable, used an explanatory variable, was also causal of missingness. Greater values of this variable resulted in greater bias
        • Using multiple imputation resulted in less bias than just using complete cases, but still underestimated the treatment effect.
      • If there are no such variables, then there is no bias due to hetergeneous treatment effects
        • Complete cases of the data can be used
    • Last observation carried forward
      • Sometimes used in clinical trials because it tends to be conservative, setting a higher bar for showing that a new therapy is significantly better than a traditional therapy.
      • Must assume that the previous value (e.g. 2008 score) is similar to the ahead value (e.g. 2010 score).
      • Information about trajectories over time is thrown away.

Choosing a Method

  • (** Don’t use this. Just putting it here in order to be aware of it **) “Standard Procedure” for choosing an imputation method (article)
    • Issues
      • Some methods will be favored based on the metric used
        • Conditional Means Methods (RMSE)
        • Conditional Medians Methods (MAE)
      • Methods that use a conditional mean (e.g. regression, mean, knn, or {missForest} as the imputed value will be preferred by RMSE
        From Article
        • RMSE would choose the Regression imputation model rather than the Gaussian imputation model even though the Gaussian model best represents the data.
        • Similar situation with using MAE and models that estimate conditional medians
    • Steps
      1. Select some observations
      2. Set their status to missing
      3. Impute them with different methods
      4. Compare their imputation accuracy
        • For numeric variables, RMSE or MAE typically used
        • For categoricals, percentage of correct predictions (PCP)
  • Initial Considerations
    • If a dataset’s feature has missing data in more than 80% of its records, it is probably best to remove that feature altogether.
    • If a feature with missing values is strongly correlated with other missing values, it’s worth considering using advanced imputation techniques that use information from those other features to derive values to replace the missing data.
    • If a feature’s values are missing not at random (MNAR), remove methods like MICE from consideration.

Diagnostics

Bayesian

  • Misc
  • Predictive Mean Matching (PMM)
    • Notes from:

    • Uses a bayesian regression to predict a missing value, then randomly picks a value from a group of observed values that are closest to the predicted value.

    • Steps

      1. Estimate a linear regression model:
        • Use the variable we want to impute as Y.
        • Use a set of good predictors as \(X\) (Guidelines for the selection of \(X\) can be found in van Buuren, 2012, p. 128).
        • Use only the observed values of \(X\) and \(Y\) to estimate the model.
      2. Draw randomly from the posterior predictive distribution of \(\hat \beta\) and produce a new set of coefficients \(\beta^*\).
        • This bayesian step is needed for all multiple imputation methods to create some random variability in the imputed values.
      3. Calculate predicted values for observed and missing \(Y\).
        • Use \(\hat \beta\) to calculate predicted values for observed \(Y\).
        • Use \(\beta^*\) to calculate predicted values for missing \(Y\).
      4. For each case where \(Y\) is missing, find the closest predicted values among cases where \(Y\) is observed.
        • Example:
          • \(Y_i\) is missing. Its predicted value is 10 (based on \(\beta^*\)).
          • Our data consists of five observed cases of \(Y\) with the values 6, 3, 22, 7, and 12.
          • In step 3, we predicted the values 7, 2, 20, 9, and 13 for these five observed cases (based on \(\hat \beta\)).
          • The predictive mean matching algorithm selects the closest observed values (typically three cases) to our missing value \(Y_i\). Hence, the algorithm selects the values 7, 9, and 13 (the closest values to 10).
      5. Draw randomly one of these three close cases and impute the missing value \(Y_i\) with the observed value of this close case.
        • Example: Continued
          • The algorithm draws randomly from 6, 7, and 12 (the observed values that correspond to the predicted values 7, 9, and 13).
          • The algorithm chooses 12 and substitutes this value to \(Y_i\).
      6. In case of multiple imputation (strongly advised), steps 1-5 are repeated several times.
        • Each repetition of steps 1-5 creates a new imputed data set.
        • With multiple imputation, missing data is typically imputed 5 times.
    • Example

      data_imp <- 
        complete(mice(data,
                 m = 5,
                 method = "pmm"))
      • m is the number of times to impute the data
      • complete formats the data into different shapes according to an action argument
      • Running parmice instead of mice imputes in parallel

Multiple Imputation Fit

  • AKA “multiply” imputed data
  • The key difficulty multiple imputation is that the result of multiple imputation is K replicated datasets corresponding to different estimated values for the missing data in the original dataset.
  • Packages
    • {merTools} - Tools for aggregating results for multiply imputed Mixed Effects model data
  • Fitting a regression model with multiply imputed data

Time Series

ML

  • Random Forest
    • See StatQuest: Random Forests Part 2: Missing data and clustering video for more details
    • Process: Classification model
      • Missingness is in the training data
        • Choose intial values for the missing data
          • Looks at that predictor’s values that have the same outcome value as the observation with the missing data
            • Categorical: For example, if the row has an observed outcome of 1 (i.e. event), then it will look at that predictor’s values with outcome = 1 and choose the most popular category for the missing value
            • Numeric: same as categorical, except the median value of predictor is chosen for the missing value
        • Create a “Proximity Matrix” to determine which observation is most similar to observation with the missing data
          • The matrix values are counts of how many times each row ends up in the node as the missing data row across all the trees in the forest
          • The counts are then divided by the number of trees in the forest
        • Categorical: Weights for each category are calculated (see video). These weights are multiplied times the observed frequency of the category in the training data. The category with the highest weighted frequency becomes the new value for the missing data.
        • Numerical: The weights are used to calculate a weighted average. So, weight * median is the new value for the missing data
        • Process is repeated until the values don’t change within a tolerance
      • Missingness in the out-of-sample data
        • A copy of the observation with the missingness is made for each outcome category.
        • The proximity matrix procedure is done for each copy
        • Then a prediction for each copy with it’s new value is made in each tree of the forest. (of course the label for each copy has now been stripped)
        • Whichever copy had it’s (stripped) outcome label predicted correctly by the most trees wins and that label is prediction for that observation