Tuning

Misc

  • Packages
    • {tune} - Various grid searches via {dials} grids (space_filling, regular, and random)
    • {finetune}
      • Grid search via racing with ANOVA models or win/loss statistics
      • Simulated Annealing
    • {mlr3tuning} - The hyperparameter optimization package of the mlr3 ecosystem
    • {mlr3hyperband} - Adds the Hyperband and Successive Halving algorithm
      • Cheaper to evaluate, but suffer from approximations errors in small-budget evaluations. (mango paper)
    • {mlr3tuningspaces} - A collection of search spaces from scientific articles for commonly used learners.
      • Ready-to-use search spaces for many popular machine learning algorithms (i.e. ranges for hyperparameters)
      • Grids can also be created using {paradox}
    • {{optuna}}
      • Algorithms: Grid Search, Random Search, Tree-structured Parzen Estimator, CMA-ES, Gaussian process-based, Algorithm to enable partial fixed parameters, Nondominated Sorting Genetic Algorithm II, A Quasi Monte Carlo sampling
      • Parallelized by using PostgreSQL or MySQL to communicate between workers
    • {{hyperopt}}
      • Algorithms: Random Search, Tree of Parzen Estimators (TPE), Adaptive TPE
      • Parallelized through Spark or MongoDB to communicate between workers
  • Tree Parameter Categories (many overlap)
    • Tree Structure and Learning
    • Training Speed
    • Accuracy
    • Overfitting
  • When the search space is quite large, try the particle swarm method or genetic algorithm for optimization.
  • Early Stopping can lower computational costs and decrease practitioner downtime

Pairwise Tuning

  • Tunes a pair of parameters at a time. Once the first pair of parameters is tuned, those values replace the default parameter values, and the next pair of parameters is tuned, etc.
  • Limits the computational cost of performing a full grid search jointly with all parameters at once supposedly without sacrificing much in terms of predictive performance.
    • A full grid search or other tuning method can be applied to each pair.
  • Might be beneficial to create pairs that affect the same tuning area of the model fit (e.g. subsampling, regularization, tree complexity) so that the tuning process might capture any interaction effects between the parameters.
  • Example: XGBoost (article)
    • Parameter Pairs
      • (max_depth, eta)
      • (subsample, colsample_bytree)
      • (min_child_weight, gamma), and
      • (reg_alpha, reg_lambda)
    • Process
      • max_depth and eta are tuned
      • Tuned values for max_depth and eta replace their default values
      • subsample and colsample_bytree are tuning using the model with the tuned values for max_depth and eta
      • etc.
  • Viz

    • Each pair of parameters along with the loss metric are plotted in a 3D chart
    • matplotlib::plot_trisurf() uses Surface Triangulation is used to interpolate the gaps between the tested parameter values
    • If the chart has multiple pronounced dips and bumps (left chart):
      • May indicate that there’s a minima in one of the dips that might be better than the chosen parameter values as the interpolation process might have smoothed over that area a bit.
        • Might want to play with the smoothing parameter a bit to try and get a clearer idea of the range of values to further investigate.
      • Indicates the model is sensitive to this pair of parameters which might translate into a model instability, when we pass a new type of dataset into the tuned model in the deployment domain.

Bayesian Optimization

  • Packages
    • {tune::tune_bayes} - tidymodels bayesian optimization; uses a Gaussian process (GP) model; parallelized through {future}
    • {mlr3mbo} - mlr3 modular implementation for single- and multi-objective Bayesian Optimization
      • Gaussian processes for low to medium dimensional numeric search spaces and random forests for higher dimensional mixed (and/or hierarchical) search spaces (source)
    • {rBayesianOptimization} - A Pure R implementation of Bayesian Global Optimization with Gaussian Processes.
    • {{mango}} (paper)
      • Parallelized by using a Dask distributed cluster
      • Solve Combined Algorithm/Classifier Selection and Hyperparameter optimization (CASH) problem using multiple GP surrogates.
      • GP surrogates shown to outperform tree-structured Parze estimators (TPE), and TPE is designed to be sequential in nature, thus suffers performance loss in parallel search.
  • tl;dr
    • Builds a surrogate model using Gaussian Processes that estimates model score
    • This surrogate is then provided with configurations picked randomly, and the one that gives the best score is kept for training
    • Each new training updates the posterior knowledge of the surrogate model.
  • Components:
    • The black box function to optimize: f(x).
      • We want to find the value of x which globally optimizes f(x) (aka objective function, the target function, or the loss function)
    • The acquisition function: a(x)
      • Used to generate new values of x for evaluation with f(x).
      • a(x) internally relies on a Gaussian process model m(X, y) to generate new values of x.
  • Steps:
    1. Define the black box function f(x), the acquisition function a(x) and the search space of the parameter x.
    2. Generate some initial values of x randomly, and measure the corresponding outputs from f(x).
    3. Fit a Gaussian process model m(X, y) onto X = x and y = f(x) (i.e. surrogate model for f(x))
    4. The acquisition function a(x) then uses m(X, y) to generate new values of x as follows:
      • Use m(X, y) to predict how f(x) varies with x.
      • The value of x which leads to the largest predicted value in m(X, y) is then suggested as the next sample of x to evaluate with f(x).
    5. Repeat the optimization process in steps 3 and 4 until we finally get a value of x that leads to the global optimum of f(x).
      • All historical values of x and f(x) should be used to train the Gaussian process model m(X, y) in the next iteration — as the number of data points increases, m(X, y) becomes better at predicting the optimum of f(x).

Tree-based Parzen Estimators (TPE)

  • Misc
    • Notes from HyperOpt Demystified
    • Libraries
    • Typically outperforms basic bayesian optimization, but the main selling point is it handles complex hyperparameter relationships via a tree structure.
    • Supports categorical variables (cat hyperparams?) which traditional Bayesian optimization does not.
  • Process
    • Train a model with several sets of randomly-selected hyperparameters, returning objective function values.
    • Split our observed objective function values into “good” and “bad” groups, according to some threshold gamma (γ).
    • Calculate the “promisingness” score, which is just P(x|good) / P(x|bad).
    • Determine the hyperparameters that maximize promisingness via mixture models.
    • Fit our model using the hyperparameters from step 4.
    • Repeat steps 2–5 until a stopping criteria.
  • Tips/Tricks
    • HyperOpt is parallelizable via both Apache Spark and MongoDB. If you’re working with multiple cores, wether it be in the cloud or on your local machine, this can dramatically reduce runtime.
    • If you’re parallelizing the tuning process via Apache Spark, use a SparkTrialsobject for single node ML models (sklearn) and a Trails object for parallelized ML models (MLlib).
    • MLflow easily integrates with HyperOpt.
    • Don’t narrow down the search space too early. Some combinations of hyperparameters may be surprisingly effective.
    • Defining the search space can be tricky, especially if you don’t know the functional form of your hyperparameters. However, from personal experience TPE is pretty robust to misspecification of those functional forms.
    • Choosing a good objective function goes a long way. In most cases, error is not created equal. If a certain type of error is more problematic, make sure to build that logic into to your function.

Decision Trees

  • Misc
  • Hyperparameters
    • Maximum Depth (aka Tree Depth): Maximum level a tree can “descend” during the training process
      • Too high may lead to overfit
      • Too low may lead to underfit
    • Minimum Samples Split (aka Min N, minsplit): The minimum number of observations that must exist in a node in order for a split to be attempted
      • Too low may lead to overfit
      • Too high may lead to underfit
      • Defaults
        • {rpart}: 20
      • Recommended Range
        • {21 = 2, . . . , 27} (source)
    • Minimum Samples Leaf (minbucket): Number of observations in a terminal node after the split has “potentially” happened
      • Too low may lead to overfit
      • Too high may lead to underfit
      • Defaults
        • {rpart}: minsplit/3
      • Recommended Range
        • {20 = 1, . . . , 26} (source)
    • Minimum Impurity Decrease: Sets the threshold for the amount of impurity decrease that must occur in order for there to be another split
      • Too low may lead to overfit
      • Too high may lead to underfit
    • Maximum Features: Randomly choosing a set of features for each split
      • Useful for high dimension datasets; adds some randomness
      • Too high can lead to long training times
      • Too low may lead to underfit
    • Complexity Parameter (cp): The threshold complexity parameter for {rpart}. The main role of this parameter is to save computing time by pruning off splits that are obviously not worthwhile
      \[ R_{cp}(T) = R(T) + \text{cp} \cdot |T| \cdot R_1(T) \]
      • \(R\) is the risk (?)
      • \(T_1\) is the tree with no splits
      • \(|T|\) is the number of splits
      • cp = 1 will always result in a tree with no splits
      • For regression models, the scaled cp has a very direct interpretation: if any split does not increase the overall R2 of the model by at least cp then that split is decreed to be, a priori, not worth pursuing.
      • Default: 0.01
      • Recommended Range
        • [10−4, 10−1= 0.1] (source)

Random Forest

  • Misc
    • Docs: Ranger
    • Implicit features selection is performed by splitting, but performance decreases substantially if over 100 noise-like features are added & drastically if over 500 noise-like features
  • Hyperparameters
    • num_trees: Total number of trees
      • This should not be tuned as more trees almost always leads to better performance, but instead should be set the maximum that is computationally practical with your resources and your time constraint (paper which references this paper)
      • Default: 500 (Ranger) with 2000 potentially a maximum (?)
    • mtry: The number of potential features that can get split at each node
      • Most influential hyperparameter for random forests.
      • Increasing it improves performance in the presence of noise
        • Similar improvements can be had with (explicit) feature selection (e.g. recursive feature elimation)
      • Default:
        • {ranger}: \(\sqrt{p}\) , square root of the number of features
      • Range
        • {1, . . . , p}
    • replace: Should observations be sampled with or without replacement
      • Default
        • {ranger}: TRUE
    • sample.fraction: The proportion of randomly sampled observations
      • Default
        • {ranger}: 1 if replace = TRUE, 0.632 if replace = FALSE
      • Recommended Range

LightGBM

  • Notes
    • Parameters are listed from most to least important
    • Seems like the strategy should be to tune structure first, then move to accuracy or overfitting parameters based on results
    • Missing values should be encoded as NA_integer_
    • Processing: it is recommended to rescale data before training so that features have similar mean and standard deviation
  • Hyperparameters
    • Structure
      • num_leaves: the number of decision nodes in a tree
        • kaggle recommendation: 2^(max_depth)
          • translates to a range of 8 - 4096
      • max_depth: The complexity of each tree
        • kaggle recommendation: 3 - 12
      • min_data_in_leaf: the minimum number of observations that fit the decision criteria in a leaf
        • Value depends on sample size and num_leaves
        • lightgbm doc recommendation: 100 - 10000 for large datasets
      • linear_tree (docs): fits piecewise linear gradient boosting tree
        • Tree splits are chosen in the usual way, but the model at each leaf is linear instead of constant
          • The first tree has constant leaf values
        • Helps to extrapolate linear trends in forecasting
        • Categorical features are used for splits as normal but are not used in the linear models
        • Increases memory use; no L1 regularization
    • Accuracy
      • n_estimators: controls the number of decision trees
      • learning_rate: step size parameter of the gradient descent
        • Kaggle recommendation: 0.01 - 0.3
          • Moving outside this range is usually towards zero
      • max_bin: controls the maximum number of bins that continuous features will bucketed into
        • default = 255
    • Overfitting
      • lambda_l1lambda_l2: regularization
        • Default: 0
        • Kaggle recommendation: 0 - 100
      • min_gain_to_split: the reduction in training loss that results from adding a split point
        • Default: 0
        • Extra regularization in large parameter grids
        • Reduces training time
      • bagging_fraction: randomly select this percentage of data without resampling
        • Default: 1
        • * must set bagging_freq to an integer *
      • feature_fraction: specifies the percentage of features to sample from when training each tree
        • Default: 1
        • * Must set bagging_freq to an integer *
      • bagging_freq: frequency for bagging
        • Default: 0 (disabled)
        • (Integer) e.g. Setting to 2 means perform bagging at every 2nd iteration
      • stopping_rounds: early stopping
  • Issues (from docs)
    • Poor Accuracy
      • Use large max_bin (may be slower)
      • Use small learning_rate with large num_iterations
      • Use large num_leaves (may cause over-fitting)
      • Use bigger training data
    • Overfitting
      • Use small max_bin
      • Use small num_leaves
      • Use min_data_in_leaf and min_sum_hessian_in_leaf
      • Use bagging by set bagging_fraction and bagging_freq
      • Use feature sub-sampling by set feature_fraction
      • Use bigger training data
      • Try lambda_l1lambda_l2 and min_gain_to_split for regularization
      • Try max_depth to avoid growing deep tree
      • Try extra_trees
      • Try increasing path_smooth

XGBoost

  • Notes
    • Drob starts with learning_rate = 0.01 and tune other parameters before coming back to tune learning rate
    • Dancho tunes learning_rate first which he says is responsible for about 80% of the model’s performance. Then, he fixes the tuned learning rate and tunes the rest of the grid of hyperparameters.
      • Cuts training time by 10x
    • Kuhn suggests setting trees to about 500 and tune stop_iter
      • stop_iter: early stopping; stops if no improvement has been made after this many iterations
    • Uber found that the most important hyperparameters were:
      • tree_depth, trees, learning_rate, and min_n
    • Paper’s list of important hyperparamters
      • eta, nrounds, max_depth, colsample_bytree, colsample_bylevel, lambda, alpha, subsample
    • tree_method (more details about exact, approx)- Specify which tree construction algorithm you want to use. Trade-offs between accuracy and speed
      • “exact” - accurate algorithm, but it is not very scalable as during each split find procedure it iterates over all entries of input data.
        • Inefficient when the data does not completely fit into memory
        • Doesn’t support distributed training
      • “approx” - uses quantile sketch and gradient histograms
      • “hist” - method used in lightgbm with slight changes (binning continuous features)
        • Applies some of approx performance enhancements (e.g. bin caching)
        • Typically faster than approx
      • “gpu_hist” - gpu implementation of “hist”
        • Much faster than “hist” and usually requires less memory
          • Guessing because it uses the gpu memory
        • Not available for some OSs (link)
      • “auto” - Default. Chooses fastest method (exact or approx) based on data size
        • For larger datasets, approx will be used
  • Hyperparameters
    • trees (aka (?) n_estimators): The number of trees in your forest
      • Grid Value Examples: seq(50, 700, 50), seq(250, 1500, 25)
    • min_n: minimum number of data points in a node that is required for the node to be split further
      • aka minimum child weight
    • mtry: The number of predictors to randomly sample for each split in a tree
      • Grid Value Examples: c(3,5,7), c(5, 7, 9)
      • More Predictors –> higher mtry
    • Shrinkage
      • learning_rate (aka eta) aka step size; explainer
        • Shrinks the feature weights to make the boosting process more conservative
        • More trees \(\rightarrow\) lower learning rate
        • Default: 0.3
        • Range: [0,1]
        • Recommended Range
          • [10-4, 100 = 1] (source)
          • Grid value examples: 0.2, 0.01, 0.008, 0.005
    • Tree-Booster Constraints
      • tree_depth (aka max_depth): The maximum depth of each tree
        • Increasing makes the model more likely to overfit
        • Consumes memory when training a deep tree.
        • “exact” tree method requires non-zero value.
        • Default: 6
        • Range: [0,∞]
        • Recommended Range
          • Grid value examples: 4 - 8 (also see lightgbm recommendations for max_depth)
          • {1, . . . , 20} (source)
      • min_child_weight
        • Minimum sum of weights needed in a node in order for the algorithm to make another split
        • Increasing makes the model less likely to overfit
        • Default: 1
        • Range: [0,∞]
    • Regularization
      • reg_lambda (aka lambda), reg_alpha (aka alpha), and gamma:
        • These are parameters of the regularization function
        • Also see
        • High Alpha: generates a sparses model (i.e. has many null leaf weights)
          • L1 regularization term on weights
          • Simplifies the model
          • Reduces the model size as it’s not necessary to store null values.
          • Default: 0
          • Recommended Range
        • High Lambda: more stringently penalizes weights of less influential samples
          • L2 regularization term on weights
          • Simplifies the model
          • Has largest effect when data is smaller
          • Default: 1
          • Recommended Range
        • High Gamma: fewer nodes
          • Affects the amount of gain needed to add another split
          • Default: 0
          • Range: [0,∞]
    • Random Subsampling
      • subsample:
        • Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. and this will prevent overfitting.
        • Subsampling will occur once in every boosting iteration
        • Default: 1
        • Range: (0,1]
        • Recommended Range
      • colsample_bytree
        • Subsample ratio of columns when constructing each tree.
        • Default: 1
        • Recommended Range:
      • colsample_bylevel
        • subsample ratio of columns for each depth level
        • Default: 1
        • Recommended Range
    • Iteration Control
      • num_boost_round (aka nrounds)- Total number of boosting iterations
        • Recommended Range
      • early_stopping_round - Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training

Catboost

  • Search Spaces
    • Source
      • num_estimators: 100-2000, step size 25
      • max_depth: 5-25
      • num_leaves: 8-128
      • max_features: 40 - 80% of total features
      • min_samples: 5-100, step size of 25
      • learning_rate: 0.001 - 0.2
      • random_state: random integer

SVM

  • Misc
  • Hyperparameters
    • gamma - All the kernels except the linear one require the gamma parameter. (
      • Defaults
        • {e1071} : 1/p (where p is the number of features)
      • Recommended Range:
    • coef0 - Parameter needed for kernels of type polynomial and sigmoid
      • Defaults:
        • {e1071}: 0
    • cost – The cost of constraints violation — it is the ‘C’-constant of the regularization term in the Lagrange formulation.
      • C = 1/λ (R) or 1/α (sklearn)
      • When C is small, the regularization is strong, so the slope will be small
      • Defaults
        • {e1071}: 1
      • Recommended Range
    • degree - Degree of the polynomial kernel function
      • Defaults
        • {e1071}: 3
      • Recommended Range
    • epsilon - Needed for insensitive loss function
      • When the value of epsilon is small, the model is robust to the outliers.
      • When the value of epsilon is large, it will take outliers into account.
      • Defaults
        • {e1071}: 0.1
    • nu - For {e1071}, needed for types: nu-classification, nu-regression, and one-classification

Ensembles

  • Misc
    • Notes from
      • Hyperparameters Tuning for Machine Learning Model Ensembles
        • See article for
          • Details on performance, durations, etc. between sequential and simultaneous tuning methods
          • Links to repo, experimental paper
        • tldr;
          • They like the simultaneous tuning because had the best metric performance with the lowest variance (more stable results)
          • I think the sequential tuning method was comparable (and better in some cases) to the simultaneous tuning method but is probably faster and less costly.
    • Composite Structures
      • W/sequential model pipelines (bottom to top)
        • “dtr” = decision tree regression
      • W/feature engineering models
        • On the left side are typical structures with individual predictive models being fed into an ensemble model (e.g. Random Forest)
        • On the right side, some kind of feature engineering process or modeling happens prior to the predictive model/ensemble model.
  • Sequential Tuning
    • Only one model is tuned at a time
    • The scoring, during the tuning process, happens on the ensemble model
    • Example
      • Structure
        • (top pipe) KNN feeds its predictions to the ridge regression which feeds it’s predictions to the lasso regression (ensemble model)
        • (bottom pipe) Ridge regression feeds its predictions to the random forest which feeds it’s predictions to the lasso regression (ensemble model)
        • Red arrows indicate how far the data/predictions have travelled through the structure. Here it’s all the way to the ensemble model
        • Red circle indictates which model is currently being tuned
        • Models without hyperparameter values are models with default values for their hyperparameters
      • Figure shows that the KNN and the RR (bottom pipe) models have already been tuned and the RR model (top pipe) is the model being tuned.
      • The red Y’ indicates that the prediction scoring, while ridge regression is being tuned, is happening at the ensemble model.
      • RF gets tuned next then finally the ensemble model
  • Simultaneous Tuning
    • All models, including the ensemble model, are tuned at the same time
    • Computationally expensive
      • I’d think it would be more monetarily expensive as well given that you likely have to provision more machines in any realistic scenario to get a decent training time (one for each pipe?)
    • Example
      • See Sequential tuning example for details on the structure and what items in the figure represent.

DL

  • Increasing or decreasing the number of training iterations
  • learning_rate - The gradient descent learning rate, also called \(\eta\) pr \(\alpha\)
    • Recommended Range
  • regularizer_l2 - L2 penalty on weights
    • Recommended Range
  • epochs - The number of complete passes of the entire training dataset through the learning algorithm
    • Problem dependent
    • Typically optimized with early stopping
    • Recommended Range
      • 10, 100, 500, 1000, and larger (source)
  • Weight Decay
    • Improves data efficiency by > 50%
    • Frequently found in the best hyperparam configs
    • Among the most important hparams to tune
    • Tricky to tune
  • Learning Rate Scheduling (article): The schedule reduces the learning rate as training progresses, so take smaller step sizes near and around the optimum
    • Time-based Decay
      \[ \alpha = \frac{\alpha_0}{1+\text{decay} + \text{epoch}} \]
      • \(\alpha\): Learning Rate
      • \(\alpha_0\): Initial Learning Rate
      • \(\text{decay}\): Decay Rate
      • \(\text{epoch}\): Number of interations
    • Step Decay

      \[ \alpha = \alpha_0 \times \operatorname{factor}^{\frac{\text{epoch}}{\text{step}}} \]
      • \(\alpha\): Learning Rate
      • \(\alpha_0\): Initial Learning Rate
      • \(\text{factor}\): Factor that the learning rate will be reduced by
      • \(\text{epoch}\): Number of interations
      • \(\text{step}\): Number of epochs after which the learning rate should be reduced
    • Exponential Decay
      \[ \alpha = \alpha_0 \times e^{-\text{decay} \times \text{epoch}} \]
      • \(\alpha\): Learning Rate
      • \(\alpha_0\): Initial Learning Rate
      • \(\text{epoch}\): Number of interations
      • \(\text{decay}\): Decay Rate
    • Others: Performance Scheduling, 1Cycle Scheduling, and Power Scheduling.

Regularized Regression

  • Lambda - Penalization factor
    • Recommended Range:
    • [ 2-12, 212 ] (source)