This advertorial is sponsored by Intel®
Most commercial deep learning applications today use 32-bits of floating point precision (ƒp32) for training and inference workloads. Various researchers have demonstrated that both deep learning training and inference can be performed with lower numerical precision, using 16-bit multipliers for training and 8-bit multipliers or fewer for inference with minimal to no loss in accuracy (higher precision – 16-bits vs. 8-bits – is usually needed during training to accurately represent the gradients during the backpropagation phase). Using these lower numerical precisions (training with 16-bit multipliers accumulated to 32-bits or more and inference with 8-bit multipliers accumulated to 32-bits) will likely become the standard over the next year, in particular for convolutional neural networks (CNNs).
There are two main benefits of lower precision. First, many operations are memory bandwidth bound, and reducing precision would allow for better usage of cache and reduction of bandwidth bottlenecks. Thus, data can be moved faster through the memory hierarchy to maximize compute resources. Second, the hardware may enable higher operations per second (OPS) at lower precision as these multipliers require less silicon area and power.
In this article, we review the history of lower numerical precision training and inference and describe how …