In the previous post I explained that, after a certain point, improving efficiency of the data loaders or increasing the number of data loaders have a marginal impact on the training speed of deep learning.
Today I will explain a method that can further speed up your training, provided that you already achieved sufficient data loading efficiency.
Comparison of training pipelines with and without prefetcher.. Typical training pipeline In a typical deep learning pipeline, one must load the batch data from CPU to GPU before the model can be trained on that batch.
Have you wondered why some of your training scripts halt every n batches where n is the number of loader processes? This likely means your pipeline is bottlenecked by data loading time, as shown in the following animation:
Training is bottlenecked by data loading time. In the animation above, mean loading time for each batch is 2 seconds, and there are 7 processes but forward+backward pass for each batch only takes 100ms.
Screenshot of the web app My UCI class is full is a web application which helped 1400+ students enroll in 5300+ courses from Fall 2016 to Fall 2019. The app sent 52K+ alerts to the students when a class without a waitlist had a spot for them to enroll in.
This app is discontinued as of December 2019.
Developing the first version In HackUCI 2015, Yang Jiao and I developed an app to help students enroll in classes without waitlist.