

  • Packages

    • {multiverse} - makes it easy to specify and execute all combinations of reasonable analyses of a dataset

  • Regression Workflow (Paper)

  • ML Workflow

  • Sources of Data Leakage (article)

    • Feature Leakage - Often happens with features that are directly related to the target and exist as a result of the target event. Normally, feature leakage happens because the feature value is updated in a point in time after the target event.
      • Examples:
        • You are trying to predict loan default of a certain customer and one of your features is the number of outbound calls (i.e. calls from the bank) that the customer had in past 30 days. What you don’t know is that in this fictional bank the customer receives outbound calls only after they’ve entered into a default scenario.
        • You are trying to predict a certain disease of a patient and are using the number of times the patient went through a specific diagnostic test. However, you later find that this test is only prescribed to people after it’s been determined that they have a high likelihood of having the disease.
    • Data Splits (See Cross-Validation >> K-Fold >> Misc)
      • Distributional statistics used in transformations of the training set should be used in transformations of the validation and test sets. eg. scaling/standardization/normalization.
      • Imputation should happen after the train/test split.
      • Perform outlier analysis (i.e. if removing rows) only on the training set.
      • Subsampling for class imbalance should only be on the training set. (See Classification >> Class Imbalance >> CV)
    • Duplicate Rows (See EDA, General >> Preprocessing)
    • Checks
      • Fit an extremely shallow decision tree and check if one of the features has an enormous difference, in terms of importance, compared to others
      • If there is a time stamp, you might be able to determine if certain feature values (indicator or counts) occur very close to the event time (e.g. churn, loan default).
  • Make ML model pipelines reusable and reproducible

    • Notes from 7 Tips to Future-Proof Machine Learning Projects
    • Modularization - Useful for debugging and iteration
      • Don’t used declarative programming. Create functions/classes for preprocessing, training, tuning, etc., and keep in separate files. You’ll call these functions in the main script
        • Helper function

          ## file ##
          def data_preparation(data):
              data = data.drop(['Evaporation', 'Sunshine', 'Cloud3pm', 'Cloud9am'], axis=1)
              numeric_cols = ['MinTemp', 'MaxTemp', 'Rainfall', 'WindGustSpeed', 'WindSpeed9am']
              data[numeric_cols] = data[numeric_cols].fillna(data[numeric_cols].mean())
              data['Month'] = pd.to_datetime(data['Date']).dt.month.apply(str)
              return data
        • Main script

          from preprocessing import data_preparation 
          train_preprocessed = data_preparation(train_data)
          inference_preprocessed = data_preparation(inference_data)
      • Keep parameters in a separate config file
        • Config file

          ## ##
          DROP_COLS = ['Evaporation', 'Sunshine', 'Cloud3pm', 'Cloud9am']
          NUM_COLS = ['MinTemp', 'MaxTemp', 'Rainfall', 'WindGustSpeed', 'WindSpeed9am']
        • Proprocessing script

          ## ##
          from parameters import DROP_COLS, NUM_COLS
          def data_preparation(data):
              data = data.drop(DROP_COLS, axis=1)
              data[NUM_COLS] = data[NUM_COLS].fillna(data[NUM_COLS].mean())
              data['Month'] = pd.to_datetime(data['Date']).dt.month.apply(str)
              return data
    • Versioning Code, Data, and Models - Useful for investigating drift
      • See tools like DVC, MLFlow, Weights and Biases, etc. for model and data versioning
        • Important to save data snapshots throughout the project lifecycle, for example: raw data, processed data, train data, validation data, test data and inference data.
      • Github and dbt for code versioning
    • Consistent Structures - Consistency in project structures and naming can reduce human error, improve communication, and just make things easier to find.
      • Naming examples:

        • <model-name>-<parameters>-<model-version>

        • <model-name>-<data-version>-<use-case>

      • Example: Reduced project template based on {{cookiecutter}}

        ├── data
        │   ├── output      <- The output data from the model. 
        │   ├── processed      <- The final, canonical data sets for modeling.
        │   └── raw            <- The original, immutable data dump.
        ├── models             <- Trained and serialized models, model predictions, or model summaries
        ├── notebooks          <- Jupyter notebooks. 
        ├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
        │   └── figures        <- Generated graphics and figures to be used in reporting
        ├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
        │                         generated with `pip freeze > requirements.txt`
        ├── code              <- Source code for use in this project.
            ├──    <- Makes src a Python module
            ├── data           <- Scripts to generate and process data
            │   ├──
            │   └──
            ├── models         <- Scripts to train models and then use trained models to make
            │   │                 predictions
            │   ├──
            │   └──
            └── analysis  <- Scripts to create exploratory and results oriented visualizations
  • Model is performing well on the training set but much worse on the validation/test set

    • Andrew Ng calls the validation set the “Dev Set” 🙄
    • Test: Random sample the training set and use that as your validation set. Score your model on this new validation set
      • “Train-Dev” is the sampled validation set
      • Possibilities
        • Variance: The data distribution of the training set is the same as the validation/test sets
          • The model has been overfit to the training data
        • Data Mismatch: The data distribution of the training set is NOT the same as the validation/test sets
          • Unlucky and the split was bad
            • Something maybe is wrong with the splitting function
          • Split ratio needs adjusting. Validation set isn’t getting enough data to be representative.
  • Model is performing well on the validation/test set but not in the real world

    • Investigate the validation/test set and figure out why it’s not reflecting real world data. Then, apply corrections to the dataset.
      • e.g. distributions of your validation/tests sets should look like the real world data.
    • Change the metric
      • Consider weighting cases that your model is performing extremely poorly on.
  • Splits

    • Harrell: “not appropriate to split data into training and test sets unless n>20,000 because of the luck (or bad luck) of the split.”
    • If your dataset is over 1M rows, then having a test set of 200K might be overkill (e.g. ratio of 60/20/20).
      • Might be better to use a ratio of 98/1/1 for big data projects and 60/20/20 for smaller data projects
    • link
      • Shows that simple data splitting does not give valid confidence intervals (even asymptotically) when one refits the model on the whole dataset. Thus, if one wants valid confidence intervals for prediction error, we can only recommend either data splitting without refitting the model (which is viable when one has ample data), or nested CV.