AUTHOR=Hawks Benjamin , Duarte Javier , Fraser Nicholas J. , Pappalardo Alessandro , Tran Nhan , Umuroglu Yaman TITLE=Ps and Qs: Quantization-Aware Pruning for Efficient Low Latency Neural Network Inference JOURNAL=Frontiers in Artificial Intelligence VOLUME=4 YEAR=2021 URL=https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2021.676564 DOI=10.3389/frai.2021.676564 ISSN=2624-8212 ABSTRACT=
Efficient machine learning implementations optimized for inference in hardware have wide-ranging benefits, depending on the application, from lower inference latency to higher data throughput and reduced energy consumption. Two popular techniques for reducing computation in neural networks are pruning, removing insignificant synapses, and quantization, reducing the precision of the calculations. In this work, we explore the interplay between pruning and quantization during the training of neural networks for ultra low latency applications targeting high energy physics use cases. Techniques developed for this study have potential applications across many other domains. We study various configurations of pruning during quantization-aware training, which we term