How Much Is Your Time Worth? A Story of Using the Right Hardware to Significantly Decrease Your Time to Results for Deep Learning Training
Data science, at its core, is about time. Time to increasing sales, time to finding cures for diseases, time to understanding. Thus, it is critical to think about what hardware you are using for deep learning training, which is by far the most computationally intensive part of artificial intelligence and data science in general. A 10% reduction in training time can make a huge difference in employee productivity and time to results. How about a 25% decrease in training time? Let me show you how to get it!
I recently read a Medium article by Thilina Rajapakse comparing 7 different BERT models. BERT is the current state-of-the-art training algorithm for natural language processing (NLP). In his article, he compared the models in terms of final accuracy and training time. His work was done using a custom built, AMD processor-based workstation with an Nvidia Titan RTX GPU, built upon the Turing architecture. This is a very popular PC/workstation GPU. In my conversations with customers, I often hear that they do their training on their laptop or a workstation with a configuration similar to that used by Thilina. His article, along with these customer comments, motivated me to determine how great of a difference an enterprise level server with top-shelf CPUs and GPUs can make in training time for these same models. The work I am presenting here did not focus on how the models work or comparing accuracy, but rather on training time. Since the models were so different, they allowed for a broad, unbiased comparison.
I set up two different enterprise servers in our lab which I will call Server 1 (S1) and Server 2 (S2), to run the same training that Thilina did. My goal was to compare training times between S1, S2, and Thilina’s workstation (WS). Whereas WS had a Titan RTX GPU, S1 and S2 had Nvidia V100 GPUs, Nvidia’s most powerful GPU. I expected S1 and S2 to outperform WS, but I did not know how much difference there would be. Would it be significant? Would it justify the difference in price? Those were the two main questions I was trying to answer. In essence, by running deep learning training on an enterprise server, how much more productive, and thus profitable, could a company be?
To keep this post concise, only the summary of the results is presented (see Table 1 below). Full details of the training performed and the specs of S1, S2, and WS can be found on my Github page. In Table 1, each model is listed, followed by the training time on that model for S1, S2, and WS. Overall, S1 reduced training time compared S2 and WS by 13.6% and 24.7%, respectively. S2 reduced training time compared to WS by 16.6%. An example from a customer is even more dramatic. A data scientist took a random forest model he was training on his laptop and trained it on a server very similar to S1. The training time decreased from 152 minutes to 26 minutes. That is 6x speedup.
Do these results demonstrate that using an enterprise level server rather than a workstation or laptop is significantly better? If decreasing your training time by 25% makes you more productive, then it is significantly better. In addition to being able to decrease training times, S1 had 8 GPUs. S2 and WS each had 1 GPU, although S2 can support up to 4 GPUs. (In this testing, only one GPU was utilized.) This means that 8 different people can be training concurrently on S1. Imagine 8 data scientists being able to train models at the same time and do it 25% faster than they can on their workstation or laptop. So not only is the training time decreased, it is decreased for as many as 8 people at the same time. This is a big deal.
To find out about how you can implement an enterprise level-server and have it set up to be able to perform your training, contact me. love helping companies utilize deep learning to its full potential and make a fundamental shift in their operations.
Table 1: Comparison of Training Times for 7 Different BERT Model on 3 Machines for 1 Epoch
Average Training Time for 1 Epoch (min:sec) | % Decrease in Training Time | |||||
BERT Model | S1 | S2 | WS | % Decrease in Training Time (S1 vs WS) | % Decrease in Training Time (S2 vs WS) | % Decrease in Training Time (S1 vs S2) |
bert-base-cased | 19:32 | 23:16 | 22:17 | 12.3% | -4.4% | 16.1% |
roberta-base | 19:40 | 23:23 | 29:59 | 34.4% | 22.0% | 15.9% |
distilbert-base-uncased | 10:37 | 12:26 | 15:34 | 31.8% | 20.1% | 14.6% |
xlnet-base-cased | 58:17 | 64:57 | 64:25 | 9.5% | -0.01% | 10.3% |
distilroberta-base | 11:02 | 12:47 | 15:59 | 31.0% | 20.0% | 13.7% |
bert-base-multilingual-cased | 20:36 | 23:56 | 24:38 | 16.4% | 28.4% | 13.9% |
distilbert-base-multilingual-cased | 11:46 | 13:08 | 18:49 | 37.5% | 30.2% | 10.4% |
Overall % Decrease in Training Time | 24.7% | 16.6% | 13.6% |
For details on the dataset used and the code, check out Thalina’s excellent article here.
Follow me on twitter @pacejohn