Training

Misc

Packages
- {autokeras} - Leverages Neural Architecture Search (NAS) techniques behind the scenes. It uses a trial-and-error approach powered by Keras Tuner under the hood to test different configurations. Once a good candidate is found, it trains it to convergence and evaluates.
- {keras-tuner} - Focused on hyperparameter optimization. You define the search space (e.g., number of layers, number of units, learning rates), and it uses optimization algorithms (random search, Bayesian optimization, Hyperband) to find the best configuration.
Resources
- Automating Deep Learning: A Gentle Introduction to AutoKeras and Keras Tuner
  - Basic intro to {autokeras} and {keras-tuner}
Use early stopping and set the number of iterations arbitrarily high. This removes the chance of terminating learning too early.
Strategies
- Includes Active Learning, Transfer Learning, Few-Shot Learning, Weakly-Supervised Learning, Self-Supervised Learning, Semi-Supervised Learning
Common Variations of Gradient Descent
- Batch Gradient Descent: Use the entire training dataset to compute the gradient of the loss function. This is the most robust but not computationally feasible for large datasets.
- Stochastic Gradient Descent: Use a single data point to compute the gradient of the loss function. This method is the quickest but the estimate can be noisy and the convergence path slow.
- Mini-Batch Gradient Descent: Use a subset of the training dataset to compute the gradient of the loss function. The size of the batches varies and depends on the size of the dataset. This is the best of both worlds of both batch and stochastic gradient descent.
  - Use the largest batch sizes possible that can fit inside your computer’s GPU RAM as they parallelize the computation
GPU methods
Data Storage
- Data samples are iteratively loaded from a storage location and fed into the DL model
  - The speed of each training step and, by extension, the overall time to model convergence, is directly impacted by the speed at which the data samples can be loaded from storage
- Which metrics are important depends on the application
  - Time to first sample might be important in a video application
  - Average sequential read time and Total processing time would be important for DL models
- Factors
  - Data Location: The location of the data and its distance from the training machine can impact the latency of the data loading.
  - Bandwidth: The bandwidth on the channel of communication between the storage location and the training machine will determine the maximum speed at which data can be pulled.
  - Sample Size: The size of each individual data sample will impact the number of overall bytes that need to be transported.
  - Compression: Note that while compressing your data will reduce the size of the sample data, it will also add a decompression step on the training instance.
  - Data Format: The format in which the samples are stored can have a direct impact on the overhead of data loading.
  - File Sizes: The sizes of the files that make up the dataset can impact the number of pulls from the data storage. The sizes can be controlled by the number of samples that are grouped together into files.
  - Software Stack: Software utilities exhibit different performance behaviors when pulling data from storage. Among other factors, these behaviors are determined by how efficient the system resources are utilized.
- Metrics
  - Time to First Sample — How much time does it take until the first sample in the file is read?
  - Average Sequential Read Time — What is the average read time per sample when iterating sequentially over all of the samples?
  - Total Processing Time — What is the total processing time of the entire data file?
  - Average Random Read Time — What is the average read time when reading samples at arbitrary offsets?
- Reading from S3 (article)