sklearn

Misc

Extensions
- {DeepChecks} - For model diagnostics
- {feature-engine} - Multiple transformers to engineer and select features to use in machine learning models.
- {permetrics} (vignette) - 111 diagnostic metrics for regression, classification, and clustering.
- {polarsgrid} - Tidyverse-style expand_grid in a fast and efficient way using Polars.
Preventing data leakage in sklearn
- Use fit_transform on the train data - this ensures that the transformer learns from the train set only and transforms it simultaneously. Then, call the transformmethod on the test set to transform it based on the information learned from the training data (i.e mean and variance of the training data).
  - Prevents data leakage
  - Somehow the transform parameters calculated on the training data have to saved, so they can be applied to the production data. This can be done with Pipelines (see Pipelines >> Misc) by serializing the Pipeline objects.
```
u_transf = LabelEncoder()
item_transf = LabelEncoder()
# encoding
df['user'] = u_transf.fit_transform(df['user'])
df['item'] = item_transf.fit_transform(df['item'])
# decoding
df['item'] = item_transf.inverse_transform(df['item'])
df['user'] = u_transf.inverse_transform(df['user'])
```
  - A more robust way is to use sklearn’s Pipelines (see Pipelines below)

Tuning

Pick a metric for GridSearchCV and RandomizedSearchCV

Default is accuracy for classification which is rarely okay

Using metric from sklearn

gs = GridSearchCV(estimator=svm.SVC(), 
                  param_grid={'kernel':('linear', 'rbf'), 'C':[1, 10]},
                  scoring=‘f1_micro’)

List of metrics available

Using a custom metric

from sklearn.metrics import fbeta_score, make_scorer
custom_metric = make_scorer(fbeta_score, beta=2)
gs = GridSearchCV(estimator=svm.SVC(),
                  param_grid={'kernel':('linear','rbf'), 'C':[1,10]},
                  scoring=custom_metric)

Also see Diagnostics, Classification >> Scores >> Lift Score

Score model

Also see Pipelines >> Tuning a Pipeline

Basic

model = RandomForestClassifier(max_depth=2, random_state=0, warm_start=True, n_estimators=1)
model.fit(X_train, y_train)
model.score(X_test, y_test)

Classification

from sklearn.metrics import classification_report
rep = classification_report(y_test, y_pred, output_dict = True)

See Diagnostics, Classification >> Multinomial for definitions of multinomial scores

Multiple metrics function

from sklearn.metrics import precision_recall_fscore_support
def score(y_true, y_pred, index):
    """Calculate precision, recall, and f1 score"""

    metrics = precision_recall_fscore_support(y_true, y_pred, average='weighted')
    performance = {'precision': metrics[0], 'recall': metrics[1], 'f1': metrics[2]}
    return pd.DataFrame(performance, index=[index])

predict_proba outputs probabilities for classification models

Optimization

Misc
- For fast iteration and if you have access to GPUs, it’s better to use {{XGBoost, LightGBM or CatBoost}} since sklearn is only CPU capable.
  - Fastest to slowest: CaBoost, LightGBM, XGBoost
{{sklearnex}}
- Intel® Extension for Scikit-learn that dynamically patches scikit-learn estimators to use Intel(R) oneAPI Data Analytics Library as the underlying solver
  - Requirements
    - The processor must have x86 architecture.
    - The processor must support at least one of SSE2, AVX, AVX2, AVX512 instruction sets.
    - ARM* architecture is not supported.
    - Intel® processors provide better performance than other CPUs.
- Misc
  - Docs
  - Benchmarks
  - Algorithms
  - Currently extension does not support multi-output and sparse data for the Random Forest and K-Nearest Neighbors
- Non-interactively
```
python -m sklearnex my_application.py
```
- Interactively for all algorithms
```
from sklearnex import patch_sklearn
patch_sklearn()
# then import sklearn estimators
```
- Interactively for specific algorithms
```
from sklearnex import patch_sklearn
patch_sklearn([
    'RandomForestRegressor,
    'SVC',
    'DBSCAN'
])
```
- Unpatch unpatch_sklearn()
- Need to reimport estimators after executing
Speeding up retraining a model (article)
- Potentially useful for slow training models that need to be retrained often.
- Not all models have these methods available
- warm_start = True permits the use of the existing fitted model attributes to initialize a new model in a subsequent call to fit
  - Doesn’t learn new parameters so shouldn’t be used to fix concept drift
    - i.e. The new data should have the same distribution as the original data which maintains the same relationship with the output variable
  - Example
```
# original fit
model = RandomForestClassifier(max_depth=2, random_state=0, warm_start=True, n_estimators=1)
model.fit(X_train, y_train)

# fit with new data
model.n_estimators+=1
model.fit(X2, y2)
```

Partial Fit

Does learn new model parameters

Example

# original fit
model = SGDClassifier() 
model.partial_fit(X_train, y_train, classes=np.unique(y))

# fit with new data
model.partial_fit(X2, y2)

Preprocessing

Misc
- Resources
  - SciKit-Learn Transformation Docs
- train_test_split doesn’t choose random rows to be in each partition by default. For example, a 90-10 split has the first 90% of the rows in Train which leaves the last 10% of the rows to be in Test
  - “shuffle=True” will randomly shuffle the rows before it partitions which would be equivalent to randomly selecting rows for each partition

Stratified train/test splits

np.random.seed(2019)
# Generate stratified split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random.state=42)
# Train set class weights
>>> pd.Series(y_train).value_counts(normalize=True)
1    0.4
0    0.4
2    0.2
dtype: float64
# Test set class weights
>>> pd.Series(y_test).value_counts(normalize=True)
1    0.4
0    0.4
2    0.2

Or you can use StratifiedKFold which stratifies automatically.

Stratified Train/Validate/Test splits

X_train, X_, y_train, y_ = train_test_split(
    df['token'], tags, train_size=0.7, stratify=tags, random_state=RS
)

X_val, X_test, y_val, y_test = train_test_split(
    X_, y_, train_size=0.5, stratify=y_, random_state=RS
)

print(f'train: {len(X_train)} ({len(X_train)/len(df):.0%})\n'
      f'val: {len(X_val)} ({len(X_val)/len(df):.0%})\n'
      f'test: {len(X_test)} ({len(X_test)/len(df):.0%})')

Check proportions of stratification variable (e.g. “tags”) in splits

split = pd.DataFrame({
    'y_train': Counter(', '.join(i for i in row) for row in mlb.inverse_transform(y_train)),
    'y_val': Counter(', '.join(i for i in row) for row in mlb.inverse_transform(y_val)),
    'y_test': Counter(', '.join(i for i in row) for row in mlb.inverse_transform(y_test))
}).reindex(tag_dis.index)

split = split / split.sum(axis=0)

split.plot(
    kind='bar', 
    figsize=(10,4), 
    title='Tag Distribution per Split', 
    ylabel='Proportion of observations'
);

Bin numerics: KBinsDiscretizer(n_bins=4)
Impute Nulls/Nas
- SimpleImputer (Docs)
```
col_transformer = ColumnTransformer(
    # Transformer name, Transformer Object and columns
    [("ImputPrice", SimpleImputer(strategy="median"), ["price"])],
    # Any other columns are ignored
    remainder= SimpleImputer(strategy="constant", fill_value=-1)
  )
```
  - Takes “price” and replaces Nulls with median; all other columns get constant, -1, to replace their Nulls
  - “most frequent” also available
- IterativeImputer (Docs) - Multivariate imputer that estimates values to impute for each feature with missing values from all the others.
- KNNImputer (Docs) - Multivariate imputer that estimates missing features using nearest samples.
Log: FunctionTransformer(lambda value: np.log1p(value))

Standardize

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
standardized_data = scaler.fit_transform(df)
standardized_df = pd.DataFrame(standardized_data, columns=df.columns)

Check: standardized_df.apply(["mean", "std"]

Scaling

Min/Max

from sklearn.preprocessing import MinMaxScaler
# Create a scaler object

mm_scaler = MinMaxScaler()

# Transform the feature values
churn[["Customer_Age", "Total_Trans_Amt"]] = mm_scaler.fit_transform(churn[["Customer_Age", "Total_Trans_Amt"]])

# check the feature value range after transformation
churn["Customer_Age"].apply(["min", "max"])

“feature_range=(0,1)” parameter allows you to change the range
- default range for the MinMaxScaler is [0,1]

Ordinal

encode_cat = ColumnTransformer(
  [('cat', OrdinalEncoder(), make_column_selector(dtype_include='category'))],
  remainder='passthrough'
)

Categorical DType (like R factor type)

Assign predefined unordered values so that whenever some value for some column does not exist in the training set, receiving such a value in the test set would not lead to incorrectly assigned labels or any other logical error.

With multiple categorical columns, it can be difficult to stratify them in the training/test splits
Must know all possible categories for each categorical variable

import pandas as pd
from pandas import CategoricalDtype

def transform_data(df: pd.DataFrame, target: pd.Series, frac: float = 0.07,
                  random_state: int = 42) -> pd.DataFrame:

    "Transform non-numeric columns into categorical type and clean data."

    categories_map = {
          'gender': {1: 'male', 2: 'female'},
          'education': {1: 'graduate', 2: 'university', 3: 'high_school',
                        4: 'others', 5: 'others', 6: 'others', 0: 'others'},
          'marital_status': {1: 'married', 2: 'single', 3: 'others', 0: 'others'}
        }

    for col_id, col_map in categories_map.items():
            df[col_id] = df[col_id].map(col_map).astype(
                CategoricalDtype(categories=list(set(col_map.values())))
            )

One-hot encoding

** Don’t use pandas get_dummies because it doesn’t handle categories that aren’t in your test/production set **
- Wonder if get_dummies creates 1 less dummy than the number of categories like it should

Basic

# Create a one-hot encoder
onehot = OneHotEncoder()

# Create an encoded feature
encoded_features = onehot.fit_transform(churn[["Marital_Status"]]).toarray()

# Create DataFrame with the encoded features
encoded_df = pd.DataFrame(encoded_features, columns=onehot.categories_)

With Training and Test Data (source)

# Initialize the encoder
onehot = OneHotEncoder(handle_unknown='ignore')

# Define columns to transform
trans_columns = ['color_1_', 'color_2_']

# Fit and transform the data
enc_data = onehot.fit_transform(training_data[trans_columns])

# Get feature names
feature_names = onehot.get_feature_names_out(trans_columns)

# Convert to DataFrame
enc_df = pd.DataFrame(enc_data.toarray(), 
                      columns=feature_names)

# Concatenate with the numerical data
final_df = pd.concat([training_data[['numerical_1']], 
                      enc_df], axis=1)

# Transform test data
test_encoded = enc.transform(test_data[trans_columns])

test_feature_names = enc.get_feature_names_out(trans_columns)

test_encoded_df = pd.DataFrame(test_encoded.toarray(), 
                               columns=test_feature_names)

final_test_df = pd.concat([test_data[['numerical_1']], 
                           test_encoded_df], 
                           axis=1)

Pipeline

from sklearn.preprocessing import OneHotEncoder
# one hot encode categorical features
ohe = OneHotEncoder(handle_unknown='ignore')

from sklearn.pipeline import Pipeline
# store one hot encoder in a pipeline
categorical_processing = Pipeline(steps=[('ohe', ohe)])

from sklearn.compose import ColumnTransformer
# create the ColumnTransormer object
preprocessing = ColumnTransformer(transformers=[('categorical', categorical_processing, ['gender', 'job'])],
                                  remainder='passthrough')

Class Imbalance

SMOTE with {{imblearn}}

from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE
X_train, y_train = make_classification(n_samples=500, n_features=5, n_informative=3)
X_res, y_res = SMOTE().fit_resample(X_train, y_train)

Pipelines

Misc
- Helps by creating more maintainable and clearly written code. Reduces mistakes by transforming train/test sets automatically.
- Pipeline objects are estimators and can be serialized and saved like any other estimator
```
import joblib

#saving the pipeline into a binary file
joblib.dump(pipe, 'wine_pipeline.bin')

#loading the saved pipeline from a binary file
pipe = joblib.load('wine_pipeline.bin')
```

Basic Feature Transforming and Model Fitting Pipeline

format: Pipeline(steps = [(‘<step1_name>’, function), (‘<step2_name>’, function)])

Example

np.random.seed(2019)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# creating the pipeline with its different steps
# fitting the pipeline with data
# making predictions
pipe = Pipeline([
    ('feature_selection', VarianceThreshold(threshold=0.1)),
    ('scaler', StandardScaler()),
    ('model', KNeighborsClassifier())
])
pipe.fit(X_train, y_train)
predictions = pipe.predict(X_test)

#checking the accuracy
accuracy_score(y_test, predictions) #sklearn function; multi-class: accuracy, binary: jaccard index (similarity)

Pipeline transforms according to the sequence of the steps inserted into the list of tuples

Column Transformers

Example: transform by column type

# creating pipeline for numerical features
numerical_pipe = Pipeline([
    ('imputer', SimpleImputer(missing_values=np.nan, strategy='mean')),
    ('scaler', StandardScaler())
])

# creating pipeline for categorical features
categorical_pipe = Pipeline([
    ('imputer', SimpleImputer(missing_values=np.nan, strategy='most_frequent')),
    ('one_hot', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer([
    ('numerical', numerical_pipe, make_column_selector(dtype_include=['int', 'float'])),
    ('categorical', categorical_pipe, make_column_selector(dtype_include=['object', 'category'])),
])
pipe = Pipeline([
    ('column_transformer', preprocessor),
    ('model', KNeighborsClassifier())
])

pipe.fit(X_train, y_train)
predictions = pipe.predict(X_test)

ColumnTransformertakes list of tuples: name, transformer, columns
- n_jobs, verbose args also available
- remainder=“passthrough” says all other columns not listed are ignored (might be a default)
  - Can also provide a transformer object
- methods: fit_transform, get_feature_names_out, get_params, etc
make_column_selector allows your to select the type of column to apply the transformer (docs)

Example: Apply sequence of transformations to a column

col_transformer = ColumnTransformer(
    [
      (
          "PriceTransformerPipeline",
          # Pipeline -> multiple transformation steps
          Pipeline([
            ("MeanImputer"      , SimpleImputer(strategy="median")),
            ("LogTransformation", FunctionTransformer(lambda value: np.log1p(value)) ),
            ("StdScaler",        StandardScaler() ),
          ]),
          ["price"]
        ),
    ],
    remainder=SimpleImputer(strategy="constant", fill_value=-1)
  )

For the “price” columns, it replaces Nulls with median, then log transforms, then standardizes. All other columns get their Nulls replaces with -1.

Function Transformers

Misc
- If using a method in the sklearn.preprocessing module, then able to use fit, transform, and fit_transform methods (I think)
- Data must be the first argument of the function
- inverse_func argument for FunctionTransformer allows you to include a back-transform function
Steps
1. Create function that transforms data
2. Create FunctionTransformer object using function as the argument
3. Add function-tranformer to the pipeline by including it as an argument to make_pipeline

Example: Make numerics 32-bit instead 64-bit to save memory

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer

def reduce_memory(X: pd.DataFrame, y=None):
    """Simple function to reduce memory usage by casting numeric columns to float32."""
    num_cols = X.select_dtypes(incluce=np.number).columns
    for col in num_cols:
        X[col] = X.astype("float32")
    return X, y

ReduceMemoryTransformer = FunctionTransformer(func = reduce_memory)

# Plug into a pipeline
>>> make_pipeline(SimpleImputer(), ReduceMemoryTransformer)

Data goes through the SimpleImputer first then the ReduceMemoryTransformer

Custom Transformers Classes

Misc
- For more complex transforming tasks
Steps
- Create a class that inherits from BaseEstimator and TransformerMixin classes of sklearn.base
  - Inheriting from these classes allows Sklearn pipelines to recognize our classes as custom estimators and automatically adds fit_transform to your class
- Add transforming methods to Class

Class Skeleton

class CustomTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self):
        pass
    def transform(self):
        pass
    def inverse_transform(self):
        pass

Example: Log transforming outcome variable

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import PowerTransformer

class CustomLogTransformer(BaseEstimator, TransformerMixin):

    def __init__(self):
        self._estimator = PowerTransformer()  # init a transformer

    def fit(self, X, y=None):
        X_copy = np.copy(X) + 1  # add one in case of zeroes
        self._estimator.fit(X_copy)
        return self

    def transform(self, X):
        X_copy = np.copy(X) + 1
        return self._estimator.transform(X_copy)  # perform scaling

    def inverse_transform(self, X):
        X_reversed = self._estimator.inverse_transform(np.copy(X))
        return X_reversed - 1  # return subtracting 1 after inverse transform

reg_lgbm = lgbm.LGBMRegressor()
final_estimator = TransformedTargetRegressor(
    regressor=reg_lgbm, transformer=CustomLogTransformer()
)
final_estimator.fit(X_train, y_train)

fit returns the tranformer itself since it returns self
- Estimates the optimal parameter lambda for each feature
transform returns transformed features
- Applies the power transform to each feature using the estimated lambdas in fit

fit_transform does both at once

custom_log.fit(tps_df) 
transformed_tps = custom_log.transform(tps_df)
# or
transformed_tps = custom_log.fit_transform(tps_df)

inverse_transform returns the back-transformed features
TransformedTargetRegressor transforms the targets y before fitting a regression model. The predictions are mapped back to the original space via an inverse transform. It takes as an argument the regressor that will be used for prediction, and the transformer that will be applied to the target variable
- The regressor parameter accepts both regressors or pipelines that end with regressors
- If the transformer is a function, like np.log, you can pass it to func argument

Example: Dummy transformer + FeatureUnion (article)

This classes (below) that inherit this class with get these methods along with fit_transform

class DummyTransformer(BaseEstimator, TransformerMixin):
  """
  Dummy class that allows us to modify only the methods that interest us,
  avoiding redudancy.
  """
  def __init__(self):
      return None

  def fit(self, X=None, y=None):
      return self

  def transform(self, X=None):
      return self

Transformer classes

class Preprocessor(DummyTransformer):
"""
  Class used to preprocess text
"""
def __init__(self, remove_stopwords: bool):
    self.remove_stopwords = remove_stopwords
    return None

def transform(self, X=None):
    preprocessed = X.apply(lambda x: preprocess_text(x, self.remove_stopwords)).values
    return preprocessed

class SentimentAnalysis(DummyTransformer):
"""
    Class used to generate sentiment
    """
    def transform(self, X=None):
         sentiment = X.apply(lambda x: get_sentiment(x)).values
         return sentiment.reshape(-1, 1) # <-- note the reshape to transfor

Each class inherits the dummy transformer and its methods
preprocess_text and get_sentiment are user functions that are defined earlier in the article

Create pipeline

vectorization_pipeline = Pipeline(steps=[
  ('preprocess', Preprocessor(remove_stopwords=True)), # the first step is to preprocess the text
  ('tfidf_vectorization', TfidfVectorizer()), # the second step applies vectorization on the preprocessed text
  ('arr', FromSparseToArray()), # the third step converts a sparse matrix into a numpy array in order to show it in a            dataframe
])

TfidVectorizer and FromSparseArray are other classes in the article that I didn’t include in the Transformer classes chunk to save space

Combine transformed features

# vectorization_pipeline is a pipeline within a pipeline
features = [
  ('vectorization', vectorization_pipeline),
  ('sentiment', SentimentAnalysis()),
  ('n_chars', NChars()),
  ('n_sentences', NSententences())
]
combined = FeatureUnion(features) # this is where we merge the features together

# Get col names: subsets the second step of the second object in the vectorization_pipeline to retrieve 
# the terms generated by the tf-idf then adds the other three column names to it
cols = vectorization_pipeline.steps[1][1].get_feature_names() + ["sentiment", "n_chars", "n_sentences"]
features_df = pd.DataFrame(combined.transform(df['corpus']), columns=cols)

Pipelines are combined with FeatureUnion, features are transformed, and coerced into a pandas df which can be used to train a model
NChars and NSentences are other classes in the article that I didn’t include in the Transformer classes chunk to save space

Tuning a Pipeline

Example

parameters = {
    'column_transformer__numerical__imputer__strategy': ['mean', 'median'],
    'column_transformer__numerical__scaler': [StandardScaler(), MinMaxScaler()],
    'model__n_neighbors': [3, 6, 10, 15],
    'model__weights': ['uniform', 'distance'],
    'model__leaf_size': [30, 40]
}

# defining a scorer and a GridSearchCV instance
my_scorer = make_scorer(accuracy_score, greater_is_better=True)
search = GridSearchCV(pipe, parameters, cv=3, scoring=my_scorer, n_jobs=-1, verbose=1)

# search for the best hyperparameter combination within our defined hyperparameter space
search.fit(X_train, y_train)

# changing pipeline parameters to gridsearch results
pipe.set_params(**search.best_params_)

# making predictions
predictions = pipe.predict(X_test)

# checking accuracy
accuracy_score(y_test, predictions)

See Column Transformer section example for details on this pipeline
Note the double underscore used in the keys of the parameter dict
- Double underscores separate names of steps inside a nested pipeline with last name being the argument of the transformer function being tuned
- Example: ‘column_transformer__numerical__imputer__strategy’
  - column_transformer (pipeline step name) >> numerical (pipeline step name) >> imputer (pipeline step name) >> strategy (arg of SimpleImputer function)

View tuning results

best_params = search.best_params_
print(best_params)

# Stores the optimum model in best_pipe
best_pipe = search.best_estimator_
print(best_pipe)

result_df = DataFrame.from_dict(search.cv_results_, orient='columns')
print(result_df.columns)

Display Pipelines in Jupyter

from sklearn import set_config 
set_config(display="diagram")
giant_pipeline

Algorithms

Misc
Histogram-based Gradient Boosting Regression Tree
- Histogram-based models are more efficient since they bin the continuous features
- Inspired by LightGBM. Much faster than GradientBoostingRegressor for big datasets (n_samples >= 10 000).
```
sklearn.ensemble.HistGradientBoostingRegressor
```
- user guide
Stochastic Gradient Descent (SGD)
- algorithm
- Not a class of models, just merely an optimization technique
  - SGDClassifier(loss='log') results in logistic regression, i.e. a model equivalent to LogisticRegression which is fitted via SGD instead of being fitted by one of the other solvers in LogisticRegression.
  - SGDRegressor(loss='squared_error', penalty='l2') and Ridge solve the same optimization problem, via different means.
- Penalyzed regression hyperparameters are labelled different than in R
  - lambda (R) is alpha (py)
  - alpha (R) is 1 - L1_ratio (py)
- Can be successfully applied to large-scale and sparse machine learning problems often encountered in text classification and natural language processing Advantages:
  - Efficiency.
  - Ease of implementation (lots of opportunities for code tuning). Disadvantages:
  - SGD requires a number of hyperparameters such as the regularization parameter and the number of iterations.
  - SGD is sensitive to feature scaling.
- Processing
  - Make sure you permute (shuffle) your training data before fitting the model or use shuffle=True to shuffle after each iteration (used by default).
  - Features should be standardized using e.g. make_pipeline(StandardScaler(), SGDClassifier())
BisectingKMeans
- Centroid is picked progressively (instead of simultaneously) based on the previous cluster. We would split the cluster each time until the number of K is achieved
- Advantages
  - It would be more efficient with a large number of clusters
  - Cheaper computational costs
  - It does not produce empty clusters
  - The clustering result was well ordered and would create a visible hierarchy.

XGBoost with GPU

gbm = xgb.XGBClassifier(
                    n_estimators=100000,
                    max_depth=6,
                    objective="binary:logistic",
                    learning_rate=.1,
                    subsample=1,
                    scale_pos_weight=99,
                    min_child_weight=1,
                    colsample_bytree=1,
                    tree_method='gpu_hist',
                    use_label_encoder=False
                    )
eval_set=[(X_train,y_train),(X_val,y_val)] 
fit_model = gbm.fit( 
                X_train, y_train, 
                eval_set=eval_set,
                eval_metric='auc',
                early_stopping_rounds=20,
                verbose=True 
              )

tree_method='gpu_hist' specifies the use of GPU

Plot importance

fig,ax2 = plt.subplots(figsize=(8,11))
xgb.plot_importance(gbm, importance_type='gain', ax=ax2)

GBM with Quantile Loss (PIs)

gbm_lower = GradientBoostingRegressor(
    loss="quantile", alpha = alpha/2, random_state=0
)
gbm_upper = GradientBoostingRegressor(
    loss="quantile", alpha = 1-alpha/2, random_state=0
)
gbm_lower.fit(X_train, y_train)
gbm_upper.fit(X_train, y_train)
test['gbm_lower'] = gbm_lower.predict(X_test)
test['gbm_upper'] = gbm_upper.predict(X_test)

CV/Splits

K-Fold
```
from sklearn.model_selection import KFold
cv = KFold(n_splits=7, shuffle=True)
```
- Shuffling: minimizes the risk of overfitting by breaking the original order of the samples
StratifiedKFold
```
from sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(n_splits=7, shuffle=True, random_state=1121218)
```
- For classification, class ratios are held to the same ratios in both the training and test sets.
  - class ratios are preserved across all folds and iterations.
LeavePOut: from sklearn.model_selection import LeaveOneOut, LeavePOut
- Data is so limited that you have to perform a CV where you set aside only a few rows of data in each iteration
- LeaveOneOut is P = 1 for LeavePOut

ShuffleSplit, StratifiedShuffleSplit

from sklearn.model_selection import ShuffleSplit, StratifiedShuffleSplit
cv = ShuffleSplit(n_splits=7, train_size=0.75, test_size=0.25)
cv = StratifiedShuffleSplit(n_splits=7, test_size=0.5)

Not a CV, just repeats the train/test split process multiple times
Using different random seeds should resemble a robust CV process if done for enough iterations

TimeSeriesSplit

from sklearn.model_selection import TimeSeriesSplit
cv = TimeSeriesSplit(n_splits=7)

With time series data, the ordering of samples matters.

Group Data
- Data is not iid (e.g. multi-level data)
- Options
- Works just like the non-group methods but with a group arg for the grouping variable