In the previous post I explained that, after a certain point, improving efficiency of the data loaders or increasing the number of data loaders have a marginal impact on the training speed of deep learning.
Today I will explain a method that can further speed up your training, provided that you already achieved sufficient data loading efficiency.
Comparison of training pipelines with and without prefetcher.. Typical training pipeline In a typical deep learning pipeline, one must load the batch data from CPU to GPU before the model can be trained on that batch.
Have you wondered why some of your training scripts halt every n batches where n is the number of loader processes? This likely means your pipeline is bottlenecked by data loading time, as shown in the following animation:
Training is bottlenecked by data loading time. In the animation above, mean loading time for each batch is 2 seconds, and there are 7 processes but forward+backward pass for each batch only takes 100ms.