AUTHOR=Georganas Evangelos , Kalamkar Dhiraj , Avancha Sasikanth , Adelman Menachem , Aggarwal Deepti , Anderson Cristina , Breuer Alexander , Bruestle Jeremy , Chaudhary Narendra , Kundu Abhisek , Kutnick Denise , Laub Frank , Md Vasimuddin , Misra Sanchit , Mohanty Ramanarayan , Pabst Hans , Retford Brian , Ziv Barukh , Heinecke Alexander 

TITLE=Tensor Processing Primitives: A Programming Abstraction for Efficiency and Portability in Deep Learning and HPC Workloads

JOURNAL=Frontiers in Applied Mathematics and Statistics

VOLUME=Volume 8 - 2022

YEAR=2022

URL=https://www.frontiersin.org/journals/applied-mathematics-and-statistics/articles/10.3389/fams.2022.826269

DOI=10.3389/fams.2022.826269

ISSN=2297-4687

ABSTRACT=<p>During the past decade, novel Deep Learning (DL) algorithms, workloads and hardware have been developed to tackle a wide range of problems. Despite the advances in workload and hardware ecosystems, the programming methodology of DL systems is stagnant. DL workloads leverage either highly-optimized, yet platform-specific and inflexible kernels from DL libraries, or in the case of novel operators, reference implementations are built <italic>via</italic> DL framework primitives with underwhelming performance. This work introduces the Tensor Processing Primitives (TPP), a programming abstraction striving for efficient, portable implementation of DL workloads with high-productivity. TPPs define a compact, yet versatile set of 2D-tensor operators [or a virtual Tensor Instruction Set Architecture (ISA)], which subsequently can be utilized as building-blocks to construct complex operators on high-dimensional tensors. The TPP specification is platform-agnostic, thus, code expressed <italic>via</italic> TPPs is portable, whereas the TPP implementation is highly-optimized and platform-specific. We demonstrate the efficacy and viability of our approach using standalone kernels and end-to-end DL &amp; High Performance Computing (HPC) workloads expressed entirely <italic>via</italic> TPPs that outperform state-of-the-art implementations on multiple platforms.</p>