- 1Fermi National Accelerator Laboratory, Batavia, IL, United States
- 2Massachusetts Institute of Technology, Cambridge, MA, United States
- 3Northwestern University, Evanston, IL, United States
Machine learning algorithms are becoming increasingly prevalent and performant in the reconstruction of events in accelerator-based neutrino experiments. These sophisticated algorithms can be computationally expensive. At the same time, the data volumes of such experiments are rapidly increasing. The demand to process billions of neutrino events with many machine learning algorithm inferences creates a computing challenge. We explore a computing model in which heterogeneous computing with GPU coprocessors is made available as a web service. The coprocessors can be efficiently and elastically deployed to provide the right amount of computing for a given processing task. With our approach, Services for Optimized Network Inference on Coprocessors (SONIC), we integrate GPU acceleration specifically for the ProtoDUNE-SP reconstruction chain without disrupting the native computing workflow. With our integrated framework, we accelerate the most time-consuming task, track and particle shower hit identification, by a factor of 17. This results in a factor of 2.7 reduction in the total processing time when compared with CPU-only production. For this particular task, only 1 GPU is required for every 68 CPU threads, providing a cost-effective solution.
Introduction
Fundamental particle physics has pushed the boundaries of computing for decades. As detectors have become more sophisticated and granular, particle beams more intense, and data sets larger, the biggest fundamental physics experiments in the world have been confronted with massive computing challenges.
The Deep Underground Neutrino Experiment (DUNE) (Abi et al., 2020), the future flagship neutrino experiment based at Fermi National Accelerator Laboratory (Fermilab), will conduct a rich program in neutrino and underground physics, including determination of the neutrino mass hierarchy (Qian and Vogel, 2015) and measurements of CP violation (Nunokawa et al., 2008) in neutrino mixing using a long baseline accelerator-based neutrino beam, detection and measurements of atmospheric and solar neutrinos (Capozzi et al., 2019), searches for supernova neutrino bursts (Scholberg, 2012) and neutrinos from other astronomical sources, and searches for physics at the grand unification scale via proton decay (DUNE collaboration and Kudryavtsev, 2016).
The detectors will consist of four modules, of which at least three are planned to be
Due to the number of channels and long readout times of the detectors, the data volume produced by the detectors will be very large: uncompressed continuous readout of a single module will be nearly
In addition to applications in real-time data selection, accelerated ML inference that can scale to process large data volumes will be important for offline reconstruction and selection of neutrino interactions. The individual events are expected to have a size on the order of a few gigabytes, and extended readout events (associated, for example, with supernova burst events) may be significantly larger, up to
In this paper, we focus on the acceleration of the inference of deep ML models as a solution for processing large amounts of data in the ProtoDUNE single-phase apparatus (ProtoDUNE-SP) (DUNE Collaboration et al., 2020) reconstruction workflow. For ProtoDUNE-SP, ML inference is the most computationally intensive part of the full event processing chain and is run repeatedly on hundreds of millions of events. A growing trend to improve computing power has been the development of hardware that is dedicated to accelerating certain kinds of computations. Pairing a specialized coprocessor with a traditional CPU, referred to as heterogeneous computing, greatly improves performance. These specialized coprocessors utilize natural parallelization and provide higher data throughput. In this study, the coprocessors employed are graphics processing units (GPUs); however, the approach can accommodate multiple types of coprocessors in the same workflow. ML algorithms, and in particular deep neural networks, are a driver of this computing architecture revolution.
For optimal integration of GPUs into the neutrino event processing workflow, we deploy them “as a service.” The specific approach is called Services for Optimized Network Inference on Coprocessors (SONIC) (Krupa et al., 2020; Duarte et al., 2019; Pedro, 2019), which employs a client-server model. The primary processing job, including the clients, runs on the CPU, as is typically done in particle physics, and the ML model inference is performed on a GPU server. This can be contrasted with a more traditional model with a GPU directly connected to each CPU node. The SONIC approach allows a more flexible computing architecture for accelerating particle physics computing workflows, providing the optimal number of heterogeneous computing resources for a given task.
The rest of this paper is organized as follows. We first discuss the related works that motivated and informed this study. In Setup and Methodology, we describe the tasks for ProtoDUNE-SP event processing and the specific reconstruction task for which an ML algorithm has been developed. We detail how the GPU coprocessors are integrated into the neutrino software framework as a service on the client side and how we set up and scale out GPU resources in the cloud. In Results, we present the results, which include single job and batch job multi-CPU/GPU latency and throughput measurements. Finally, in Summary and Outlook, we summarize the study and discuss further applications and future work.
Related Work
Inference as a service was first employed for particle physics in Ref. (Duarte et al., 2019). This initial study utilized custom field programmable gate arrays (FPGAs) manufactured by Intel Altera and provided through the Microsoft Brainwave platform (Caulfield et al., 2016). These FPGAs achieved low-latency, high-throughput inference for large convolutional neural networks such as ResNet-50 (He et al., 2016) using single-image batches. This acceleration of event processing was demonstrated for the Compact Muon Solenoid (CMS) experiment at the Large Hadron Collider (LHC), using a simplified workflow focused on inference with small batch sizes. Our study with GPUs for neutrino experiments focuses on large batch size inferences. GPUs are used in elements of the simulation of events in the IceCube Neutrino Observatory (Halzen and Klein, 2010); recently, a burst for the elements that run on GPUs was deployed at large scale (Sfiligoi et al., 2020). The ALICE experiment at the LHC is planning to use GPUs for real-time processing and data compression of their Time Project Chamber subdetectors (ALICE Collaboration and Rohr, 2020). The LHCb experiment at the LHC is considering using GPUs for the first level of their trigger system (Aaij et al., 2020).
Modern deep ML algorithms have been embraced by the neutrino reconstruction community because popular computer vision and image processing techniques are highly compatible with the neutrino reconstruction task and the detectors that collect the data. NOνA has applied a custom convolutional neural network (CNN), inspired by GoogLeNet (Szegedy et al., 2015), to the classification of neutrino interactions for their segmented liquid scintillator-based detector (Aurisano et al., 2016). MicroBooNE, which uses a LArTPC detector, has conducted an extensive study of various CNN architectures and demonstrated their effectiveness in classifying and localizing single particles in a single wire plane, classifying neutrino events and localizing neutrino interactions in a single plane, and classifying neutrino events using all three planes (MicroBooNE collaboration et al., 2017). In addition, MicroBooNE has applied a class of CNNs, known as semantic segmentation networks, to 2D images formed from real data acquired from the LArTPC collection plane, in order to classify each pixel as being associated with an EM particle, other type of particle, or background (MicroBooNE et al., 2019). DUNE, which will also use LArTPC detectors, has implemented a CNN based on the SE-ResNet (Hu et al., 2018) architecture to classify neutrino interactions in simulated DUNE far detector events (DUNE Collaboration et al., 2006). Lastly, a recent effort has successfully demonstrated an extension of the 2D pixel-level semantic segmentation network from MicroBooNE to three dimensions, using submanifold sparse convolutional networks (Graham and van der Maaten, 2017; Dominé and Terao, 2020).
Setup and Methodology
In this study, we focus on a specific computing workflow, the ProtoDUNE-SP reconstruction chain, to demonstrate the power and flexibility of the SONIC approach. ProtoDUNE-SP, assembled and tested at the CERN Neutrino Platform (the NP04 experiment at CERN) (Pietropaolo, 2017), is designed to act as a test bed and full-scale prototype for the elements of the first far detector module of DUNE. It is currently the largest LArTPC ever constructed and is vital to develop the technology required for DUNE. This includes the reconstruction algorithms that will extract physics objects from the data obtained using LArTPC detectors, as well as the associated computing workflows.
In this section, we will first describe the ProtoDUNE-SP reconstruction workflow and the ML model that is the current computing bottleneck. We will then describe the SONIC approach and how it was integrated into the LArTPC reconstruction software framework. Finally, we will describe how this approach can be scaled up to handle even larger workflows with heterogeneous computing.
ProtoDUNE-SP Reconstruction
The workflow used in this paper is the full offline reconstruction chain for the ProtoDUNE-SP detector, which is a good representative of event reconstruction in present and future accelerator-based neutrino experiments. In each event, ionizing particles pass through the liquid argon, emitting scintillation light that is recorded by photodetectors. The corresponding pulses are reconstructed as optical hits. These hits are grouped into flashes from which various parameters are determined, including time of arrival, spatial characteristics, and number of photoelectrons detected.
After the optical reconstruction stage, the workflow proceeds to the reconstruction of LArTPC wire hits. Figure 1 shows a 6 GeV/c electron event in the ProtoDUNE detector.
FIGURE 1. A 6 GeV/c electron event in the ProtoDUNE detector. The x axis shows the wire number. The y axis shows the time tick in the unit of 0.5 μs. The color scale represents the charge deposition. DUNE Collaboration et al. (2020a)
FIGURE 2. Architecture of the neural network used by the EmTrackMichelId module in the ProtoDUNE-SP reconstruction chain, a convolutional (2DConv) layer is flattened to two fully connected layers.
Reconstruction begins by applying a deconvolution procedure to recover the original waveforms by disentangling the effects of electronics and field responses after noise mitigation. The deconvolved waveforms are then used to find and reconstruct wire hits, providing information such as time and collected charge. Once the wire hits have been reconstructed, the 2D information provided by the hits in each plane is combined with that from the other planes in order to reconstruct 3D space points. This information is primarily used to resolve ambiguities caused by the induction wires in one plane wrapping around into another plane.
The disambiguated collection of reconstructed 2D hits is then fed into the next stage, which consists of a modular set of algorithms provided by the Pandora software development kit (Marshall and Thomson, 2013). This stage finds the high-level objects associated with particles, like tracks, showers, and vertices, and assembles them into a hierarchy of parent-daughter nodes that ultimately point back to the candidate neutrino interaction.
The final module in the chain, EmTrackMichelId, is an ML algorithm that classifies reconstructed wire hits as being track-like, shower-like, or Michel electron-like (Michel, 1950). This algorithm begins by constructing
CNN Model for Track Identification
The neural network (Figure 2) employed by the EmTrackMichelId module of the ProtoDUNE-SP reconstruction chain consists of a 2D convolutional layer followed by two fully connected (FC) layers. The convolutional layer takes each of the
GPU Inference as a Service With LArSoft
ProtoDUNE-SP reconstruction code is based on the LArSoft C++ software framework (Snider and Petrillo, 2017), which provides a common set of tools shared by many LArTPC-based neutrino experiments. Within this framework, EmTrackMichelId, which is described in ProtoDUNE-SP Reconstruction, is a LArSoft “module” that makes use of the PointIdAlg “algorithm.” EmTrackMichelId passes the wire number and peak time associated with a hit to PointIdAlg, which constructs the patch and performs the inference task to classify it.
In this study, we follow the SONIC approach that is also in development for other particle physics applications. It is a client-server model, in which the coprocessor hardware used to accelerate the neural network inference is separate from the CPU client and accessed as a (web) service. The neural network inputs are sent via TCP/IP network communication to the GPU. In this case, a synchronous, blocking call is used. This means that the thread makes the web service request and then waits for the response from the server side, only proceeding once the server sends back the network output. In ProtoDUNE-SP, the CPU usage of the workflow, described in ProtoDUNE-SP Reconstruction, is dominated by the neural network inference. Therefore, a significant increase in throughput can still be achieved despite including the latency from the remote call while the CPU waits for the remote inference. An asynchronous, non-blocking call would be slightly more efficient, as it would allow the CPU to continue with other work while the remote call was ongoing. However, this would require significant development in LArSoft for applications of task-based multithreading, as described in Ref. (Bocci et al., 2020).
The ModelInterface class in LArSoft is used by PointIdAlg to access the underlying ML model that performs the inference. Previously, this inference was performed locally on a CPU using the TfModelInterface subclass of ModelInterface. In this work, the functionality to access the GPU as a service was realized by implementing the C++ client interface, provided by the Nvidia Triton inference server (Nvidia, 2019), in a new ModelInterface subclass called tRTisModelInterface. In this new subclass, the patches constructed by PointIdAlg are put into the proper format and transmitted to the GPU server for processing, while a blocking operation ensues until inference results are received from the server. Communication between the client and server is achieved through remote procedure calls based on gRPC (Google, 2018). The desired model interface subclass to use is selected and its parameters specified at run time by the user through a FHiCL (The Fermilab Hierarchical Configuration Language) configuration file. The code is found in the LArSoft/larrecodnn package (Garren et al., 2020). On the server side, we deploy NVidia T4 GPUs targeting data center acceleration. Figure 3 illustrates how EmTrackMichelId, PointIdAlg, and ModelInterface interact with each other. It also shows how each of the two available subclasses, TfModelInterface and tRTisModelInterface, access the underlying model.
FIGURE 3. The interaction between EmTrackMichelId, PointIdAlg, and ModelInterface, described in the text, is depicted in the figure above. CPU-only inference is provided by TfModelInterface, while GPU-accelerated inference, via the GPUaaS approach, is provided by tRTisModelInterface.
This flexible approach has several advantages:
• Rather than requiring one coprocessor per CPU with a direct connection over PCIe, many worker nodes can send requests to a single GPU, as depicted in Figure 4. This allows heterogeneous resources to be allocated and re-allocated based on demand and task, providing significant flexibility and potential cost reduction. The CPU-GPU system can be “right-sized” to the task at hand, and with modern server orchestration tools, described in the next section, it can elastically deploy coprocessors.
• There is a reduced dependency on open-source ML frameworks in the experimental code base. Otherwise, the experiment would be required to integrate and support separate C++ APIs for every framework in use.
• In addition to coprocessor resource scaling flexibility, this approach allows the event processing to use multiple types of heterogeneous computing hardware in the same job, making it possible to match the computing hardware to the ML algorithm. The system could, for example, use both FPGAs and GPUs servers to accelerate different tasks in the same workflow.
FIGURE 4. Depiction of the client-server model using Triton where multiple CPU processes on the client side are accessing the AI model on the server side.
There are also challenges to implementing a computing model that accesses coprocessors as a service. Orchestration of the client-server model can be more complicated, though we find that this is facilitated with modern tools like the Triton inference server and Kubernetes. In Summary and Outlook, we will discuss future plans to demonstrate production at full scale. Networking from client to server incurs additional latency, which may lead to bottlenecks from limited bandwidth. For this particular application, we account for and measure the additional latency from network bandwidth, and it is a small, though non-negligible, contribution to the overall processing.
The Triton software also handles load balancing for servers that provide multiple GPUs, further increasing the flexibility of the server. In addition, the Triton server can host multiple models from various ML frameworks. One particularly powerful feature of the Triton inference server is dynamic batching, which combines multiple requests into optimally-sized batches in order to perform inference as efficiently as possible for the task at hand. This effectively enables simultaneous processing of multiple events without any changes to the experiment software framework, which assumes one event is processed at a time.
Kubernetes Scale out
We performed tests on many different combinations of computing hardware, which provided a deeper understanding of networking limitations within both Google Cloud and on-premises data centers. Even though the Triton Inference Server does not consume significant CPU power, the number of CPU cores provisioned for the node did have an impact on the maximum ingress bandwidth achieved in the early tests.
To scale the NVidia T4 GPU throughput flexibly, we deployed a Google Kubernetes Engine (GKE) cluster for server-side workloads. The cluster is deployed in the US-Central data center, which is located in Iowa; this impacts the data travel latency. The cluster was configured using a Deployment and ReplicaSet. These are Kubernetes artifacts for application deployment, management and control. They hold resource requests, container definitions, persistent volumes, and other information describing the desired state of the containerized infrastructure. Additionally, a load-balancing service to distribute incoming network traffic among the Pods was deployed. We implemented Prometheus-based monitoring, which provided insight into three aspects: system metrics for the underlying virtual machine, Kubernetes metrics on the overall health and state of the cluster, and inference-specific metrics gathered from the Triton Inference Server via a built-in Prometheus publisher. All metrics were visualized through a Grafana instance, also deployed within the same cluster. The setup is depicted in Figure 5.
FIGURE 5. The Google Kubernetes Engine setup which demonstrates how the Local Compute FermiGrid farm communicates with the GPU server and how the server is orchestrated through Kubernetes.
A Pod is a group of one or more containers with shared storage and network, and a specification for how to run the containers. A Pod’s contents are always co-located and co-scheduled, and run in a shared context within Kubernetes Nodes (Kubernetes, 2020). We kept the Pod to Node ratio at 1:1 throughout the studies, with each Pod running an instance of the Triton Inference Server (v20.02-py3) from the Nvidia Docker repository. The Pod hardware requests aim to maximize the use of allocatable virtual CPU (vCPU) and memory and to use all GPUs available to the container.
In this scenario, it can be naively assumed that a small instance n1-standard-2 with 2 vCPUs,
Given these parameters, we found that the ideal setup for optimizing ingress bandwidth was to provision multiple Pods on 16-vCPU machines with fewer GPUs per Pod. For GPU-intensive tests, we took advantage of having a single point of entry, with Kubernetes balancing the load and provisioning multiple identical Pods behind the scenes, with the total GPU requirement defined as the sum of the GPUs attached to each Pod.
Results
Using the setup described in the previous section to deploy GPUs as a service (GPUaaS) to accelerate machine learning inference, we measure the performance and compare against the default CPU-only workflow in ProtoDUNE-SP.
First, we describe the baseline CPU-only performance. We then measure the server-side performance in different testing configurations, in both standalone and realistic conditions. Finally, we scale up the workflow and make detailed measurements of performance. We also derive a scaling model for how we expect performance to scale and compare it to our measured results.
CPU-Only Baseline
To compare against our heterogeneous computing system, we first measure the throughput of the CPU-only process. The workflow processes events from a Monte Carlo simulation of cosmic ray events in ProtoDUNE-SP, produced with the Corsika generator (Heck et al., 1998). The radiological backgrounds, including 39Ar, 42Ar, 222Rn, and 85Kr, are also simulated using the RadioGen module in LArSoft. Each event corresponds to a
TABLE 1. CPU types and distribution for the grid worker nodes used for the “big-batch” clients (see text for more details).
We measure the time it takes for each module in the reconstruction chain to run. We divide them into 2 basic categories: the non-ML modules and the ML module. The time values are given in Table 2. Of the CPU time in the ML module, we measure that
Server-Side Performance
To get a standardized measure of the performance, we first use standard tools for benchmarking the GPU performance. Then we perform a stress test on our GPUaaS instance to understand the server-side performance under high load.
Server Standalone Performance
The baseline performance of the GPU server running the EmTrackMichelId model is measured using the perf_client tool included in the Nvidia Triton inference server package. The tool emulates a simple client by generating requests over a defined time period. It then returns the latency and throughput, repeating the test until the results are stable. We define the baseline performance as the throughput obtained at the saturation point of the model on the GPU. We attain this by increasing the client-side request concurrency—the maximum number of unanswered requests by the client—until the throughput saturates. We find that the model reaches this limit quickly at a client-side concurrency of only 2 requests. At this point, the throughput is determined to be
Saturated Server Stress Test
To understand the behavior of the GPU server performance in a more realistic setup, we set up many simultaneous CPU processes to make inference requests to the GPU. This saturates the GPUs, keeping the pipeline of inference requests as full as possible. We measure several quantities from the GPU server in this scenario. To maximize throughput, we activate the dynamic batching feature of Triton, which allows the server to combine multiple requests together in order to take advantage of the efficient batch processing of the GPU. This requires only one line in the server configuration file.
In this setup, we run 400 simultaneous CPU processes that send requests to the GPU inference server. This is the same compute farm described in CPU-Only Baseline. The jobs are held in an idle state until all jobs are allocated CPU resources and all input files are transferred to local storage on the grid worker nodes, at which point the event processing begins simultaneously. This ensures that the GPU server is handling inference requests from all the CPU processes at the same time. This test uses a batch size of 1693. We monitor the following performance metrics of the GPU server in 10 min intervals:
• GPU server throughput: for the 4-GPU server, we measure that the server is performing about 122,000 inferences per second for large batch and dynamic batching; this amounts to 31,000 inferences per second per GPU. This is shown in Figure 6 (top left). This is higher than the measurement from the standalone performance client, by a factor of
• GPU processing usage: we monitor how occupied the GPU processing units are. We find that the GPU is
FIGURE 6. Top left: The number of inferences per second processed by the 4-GPU server, which saturates at approximately 126,000. Top right: The GPU usage, which peaks around 60%. Bottom: The number of total batches processed by the 4-GPU server. The incoming batches are sent to the server with size 1693, but are combined up to size 5,358 for optimal performance.
• GPU batch throughput: we measure how many batches of inferences are processed by the server. The batch size sent by the CPU processor is 1693 on average, but dynamic batching prefers to run at a typical batch size of 5,358. This is shown in Figure 6 (bottom).
Scale out Modeling
In the previous section, we discussed in detail the GPU server performance. With that information, we study in detail the performance of the entire system and the overall improvement expected in throughput.
To describe important elements of the as-a-service computing model, we first define some concepts and variables. Many of these elements have been described in previous sections, but we collect them here to present a complete model.
•
• p is the fraction of
•
•
•
•
•
•
•
With each element of the system latency now defined, we can model the performance of SONIC. Initially, we assume blocking modules and zero communication latency. We define p as the fraction of the event which can be accelerated, such that the total time of a CPU-only job is trivially defined as:
We replace the time for the accelerated module with the GPU latency terms:
This reflects the ideal scenario when the GPU is always available for the CPU job. We also include tlatency, which accounts for the preprocessing, bandwidth, and travel time to the GPU. The value of
Here,
Therefore, the total latency is constant when the GPUs are not saturated and increases linearly in the saturated case proportional to
Measurements Deploying SONIC
To test the performance of the SONIC approach, we use the setup described in the “server stress test” in Server-Side Performance. We vary the number of simultaneous jobs from 1 to 400 CPU processes. To test different computing model configurations, we run the inference with two different batch sizes: 235 (small batch) and 1693 (large batch). This size is specified at run time through a parameter for the EmTrackMichelId module in the FHiCL (The Fermilab Hierarchical Configuration Language) configuration file describing the workflow. With the small batch size, inferences are carried out in approximately 235 batches per event. Increasing the batch size to 1693 reduces the number of inference calls sent to the Triton server to 32 batches per event, which decreases the travel latency. We also test the performance impact of enabling or disabling dynamic batching on the server.
In Figure 7 (left), we show the performance results for the latency of the EmTrackMichelId module for small batch size vs. large batch size, with dynamic batching turned off. The most important performance feature is the basic trend. The processing time is flat as a function of the number of simultaneous CPU processes up to 190 (270) processes for small (large) batch size. After that, the processing time begins to grow, as the GPU server becomes saturated and additional latency is incurred while service requests are being queued. For example, in the large batch case, the performance of the EmTrackMichelId module is constant whether there are 1 or 270 simultaneous CPU processes making requests to the server. Therefore, using less than 270 simultaneous CPU processes for the 4-GPU server is an inefficient use of the GPU resources; and we find that the optimal ratio of CPU processes to a single GPU is 68:1.
FIGURE 7. Processing time for the EmTrackMichelId module as a function of simultaneous CPU processes, using a Google Kubernetes 4-GPU cluster. Left: small batch size vs. large batch size, with dynamic batching turned off. Right: large batch size performance with dynamic batching turned on and off. In both plots, the dotted lines indicate the predictions of the latency model, specifically Eq. (4).
As described in Scale out Modeling,
In Figure 7 (right), we show the performance of the SONIC approach for large batch size with dynamic batching enabled or disabled, considering up to 400 simultaneous CPU processes. We find that at large batch size, for our particular model, the large batch size of 1693 is already optimal and the performance is the same with or without dynamic batching. We also find that the model for large batch size matches the data well.
We stress that, until the GPU server is saturated, the EmTrackMichelId module now takes about
TABLE 3. A comparison of results in Table 2 with results using GPUaaS.
Server-Side Performance
Finally, it is important to note that throughout our studies using commercially available cloud computing, we have observed that there are variations in the GPU performance. This could result from a number of factors beyond our control, related to how CPU and GPU resources are allocated and configured in the cloud. Often, these factors are not even exposed to the users and therefore difficult to monitor. That said, the GPU performance, i.e. the number of inferences per second, is a non-dominant contribution to the total job latency. Volatility in the GPU throughput primarily affects the ratio of CPU processes to GPUs. We observe variations at the 30%–40% level, and in this study, we generally present conservative performance numbers.
Summary and Outlook
In this study, we demonstrate for the first time the power of accessing GPUs as a service with the Services for Optimized Network Inference on Coprocessors (SONIC) approach to accelerate computing for neutrino experiments. We integrate the Nvidia Triton inference server into the LArSoft software framework, which is used for event processing and reconstruction in liquid argon neutrino experiments. We explore the specific example of the ProtoDUNE-SP reconstruction workflow. The reconstruction processing time is dominated by the EmTrackMichelId module, which runs neural network inference for a fairly traditional convolutional neural network algorithm over thousands of patches of the ProtoDUNE-SP detector. In the standard CPU-only workflow, the module consumes 65% of the overall CPU processing time.
We explore the SONIC approach, which abstracts the neural network inference as a web service. A 4-GPU server is deployed using the Nvidia Triton inference server, which includes powerful features such as load balancing and dynamic batching. The inference server is orchestrated using Google Cloud Platform’s Kubernetes Engine. The SONIC approach provides flexibility in dynamically scaling the GPUaaS to match the inference requests from the CPUs, right-sizing the heterogeneous resources for optimal usage of computing. It also provides flexibility in dealing with different machine learning (ML) software frameworks and tool flows, which are constantly improving and changing, as well as flexibility in the heterogeneous computing hardware itself, such that different GPUs, FPGAs, or other coprocessors could be deployed together to accelerate neural network algorithms. In this setup, the EmTrackMichelId module is accelerated by a factor of 17, and the total event processing time goes from
With these promising results, there are a number of interesting directions for further studies.
• Integration into full-scale production: A natural next step is to deploy this workflow at full scale, moving from 400 simultaneous CPU processes up to 1000–2000. While this should be fairly straightforward, there will be other interesting operational challenges to be able to run multiple production campaigns. For example, the ability to instantiate the server as needed from the client side would be preferable. The GPU resources should scale in an automated way when they become saturated. There are also operational challenges to ensure the right model is being served and server-side metadata is preserved automatically.
• Server platforms: Related to the point above, physics experiments would ultimately prefer to run the servers without relying on the cloud, instead using local servers in lab and university data centers. Preliminary tests have been conducted with a single GPU server at the Fermilab Feynman Computing Center. Larger-scale tests are necessary, including the use of cluster orchestration platforms. Finally, a similar setup should be explored at high performance computing (HPC) centers, where a large amount of GPU resources may be available.
• Further GPU optimization: Thus far, the studies have not explored significant optimization of the actual GPU operations. In this paper, a standard 32-bit floating point implementation of the model was loaded in the Triton inference server. A simple extension would be to try model optimization using 8-bit or 16-bit operations. This would further improve the GPU performance and thereby increase the optimal CPU-to-GPU ratio. More involved training-side optimizations might yield similar physics performance at a reduced computational cost. For example, quantization-aware training tools such as QKeras (Coelho et al., 2020) and Brevitas (Pappalardo et al., 2020) could maintain performance at reduced precision better than simple post-training quantization.
• More types of heterogeneous hardware: In this study, we have deployed GPUs as a service, while in other studies, FPGAs and ASICs as a service were also explored. For this particular model and use case, with large batch sizes, GPUs already perform very well. However, the inference for other ML algorithms may be more optimal on different types of heterogeneous computing hardware. Therefore, it is important to study our workflow for other platforms and types of hardware.
By capitalizing on the synergy of ML and parallel computing technology, we have introduced SONIC, a non-disruptive computing model that provides accelerated heterogeneous computing with coprocessors, to neutrino physics computing. We have demonstrated large speed improvements in the ProtoDUNE-SP reconstruction workflow and anticipate more applications across neutrino physics and high energy physics more broadly.
Data Availability Statement
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.
Author Contributions
MW, TY, BHa, KP, KK, and JK have contributed to the software implementation and the execution and visualization of results. MF and BHo have contributed to the cloud computing and GPU orchestration and tools for monitoring the workflow. PH, NT and all other authors have contributed to the conceptual direction and coordination of the study.
Funding
MF, BHa, BHo, KK, KP, NT, MW, and TY are supported by Fermi Research Alliance, LLC under Contract No. DE-AC02-07CH11359 with the U.S. Department of Energy, Office of Science, Office of High Energy Physics. NT and BHa are partially supported by the U.S. Department of Energy Early Career Award. KP is partially supported by the High Velocity Artificial Intelligence grant as part of the Department of Energy High Energy Physics Computational HEP sessions program. PH is supported by NSF grants #1934700, #193146., JK is supported by NSF grant #190444. Cloud credits for this study were provided by Internet2 managed Exploring Cloud to accelerate Science (NSF grant #190444).
Conflict of Interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Acknowledgments
We acknowledge the Fast Machine Learning collective as an open community of multi-domain experts and collaborators. This community was important for the development of this project. We acknowledge the DUNE collaboration for providing the ProtoDUNE-SP reconstruction code and simulation samples. We would like to thank Tom Gibbs and Geetika Gupta from Nvidia for their support in this project. We thank Andrew Chappell, Javier Duarte, Steven Timm for their detailed feedback on the manuscript.
References
Aaij, R., Albrecht, J., Belous, M., Billoir, P., Boettcher, T., Brea Rodríguez, A., et al. (2020). Allen: a high level trigger on GPUs for LHCb. Comput. Softw. Big Sci. 4, 7. doi:10.1007/s41781-020-00039-7 [arXiv:1912.09161]
Abi, B., Acciarri, R., Acero, M. A., Adamov, G., Adams, D., Adinolfi, M., et al. (2020). Deep underground neutrino experiment (DUNE), far detector technical design report, volume I: introduction to DUNE. J. Inst. Met 15, T08008 [arXiv:2002.02967].
ALICE Collaboration, and Rohr, D. (2019). GPU-based reconstruction and data compression at ALICE during LHC Run 3, in 24th International Conference on Computing in High Energy and Nuclear Physics (CHEP 2019). EPJ Web Conf. 245, 10005. doi:10.1051/epjconf/202024510005 [arXiv:2006.04158]
Aurisano, A., Radovic, A., Rocco, D., Himmel, A., Messier, M., Niner, E., et al. (2016). A convolutional neural network neutrino event classifier. J. Inst. Met. 11, P09001. doi:10.1088/1748-0221/11/09/P09001
Ayres, D., Drake, G. R., Goodman, M. C., Grudzinski, J. J., Guarino, V. J., Talaga, R. L., et al. (2007). The NOvA technical design report. FERMILAB-DESIGN--2007-01. doi:10.2172/935497
Bocci, A., Dagenhart, D., Innocente, V., Kortelainen, M., Pantaleo, F., and Rovere, M., (2020). Bringing heterogeneity to the CMS software framework. 05009. doi:10.1051/epjconf/202024505009
Capozzi, F., Li, S.W., Zhu, G., and Beacom, J. F. (2019). DUNE as the next-generation solar neutrino experiment. Phys. Rev. Lett. 123, 131803. doi:10.1103/PhysRevLett.123.131803 [arXiv:1808.08232]
Caulfield, A., Chung, E., Putnam, A., Angepat, H., Fowers, J., Haselman, M., et al. (2016). “A cloud-scale acceleration architecture,” in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Taipei, Taiwan, October 15–19, 2016 (IEEE). Available at: https://www.microsoft.com/en-us/research/publication/configurable-cloud-acceleration/.
Coelho, C.N., Kuusela, A., Zhuang, H., Aarrestad, T., Loncar, V., Ngadiuba, J., et al. (2020). Ultra low-latency, low-area inference accelerators using heterogeneous deep quantization with QKeras and hls4ml. [arXiv:2006.10159]
Dominé, L., and Terao, K. (2020). Scalable deep convolutional neural networks for sparse, locally dense liquid argon time projection chamber data. Phys. Rev. D. 102, 012005 [arXiv:1903.05663]. doi:10.1103/PhysRevD.102.012005
Duarte, J., Harris, P., Hauck, S., Holzman, B., Hsu, S.-C., Jindariani, S., et al. (2019). FPGA-accelerated machine learning inference as a service for particle physics computing. Comput. Softw. Big Sci, 3, 13. doi:10.1007/s41781-019-0027-2[arXiv:1904.08986]
DUNE Collaboration, and Kudryavtsev, V. A. (2016). Underground physics with DUNE. J. Phys. Conf. Ser. 718, 062032. doi:10.1088/1742-6596/718/6/062032 [arXiv:1601.03496]
DUNE collaboration, Abi, B., Abed Abud, A., Acciarri, R., Acero, M. A., Adamov, G., et al. (2020a). First results on ProtoDUNE-SP liquid argon time projection chamber performance from a beam test at the CERN Neutrino Platform. JINST 15 P12004 [arXiv:2007.06722].
DUNE Collaboration, Abi, B., Acciarri, R., Acero, M. A., Adamov, G., Adams, D., et al. (2020b). Neutrino interaction classification with a convolutional neural network in the DUNE far detector. [arXiv:2006.15052].
Garren, L., Wang, M., and Yang, T. (2020). larrecodnn, [software] version v08_02_00 (accessed 2020-04-07) Available at: https://github.com/LArSoft/larrecodnn, https://github.com/LArSoft/larrecodnn.
Google, (2018). gRPC, [software] version v1.19.0 (accessed 2020-02-17) Available at: https://grpc.io/
Graham, B., and van der Maaten, L. (2017). Submanifold sparse convolutional networks. [arXiv:1706.01307].
Halzen, F., and Klein, S. R. (2010). IceCube: an instrument for neutrino astronomy. Rev. Sci. Instrum. 81, 081101. doi:10.1063/1.3480478 [arXiv:1007.1247]
He, K., Zhang, X., Ren, S., and Sun, J. (2016). “Deep residual learning for image recognition,” in IEEE Conference on computer vision and Pattern Recognition (CVPR), Las Vegas, NV, June 27–30, 2019 (IEEE). doi:10.1109/CVPR.2016.90[arXiv:1512.03385]
Heck, D., Knapp, J., Capdevielle, J., Schatz, G., and Thouw, T. (1998). CORSIKA: a Monte Carlo code to simulate extensive air showers. Tech. Rep. FZKA-6019
Hu, J., Shen, L., and Sun, G. (2018). “Squeeze-and-excitation networks,” in IEEE/CVF Conference on computer vision and Pattern recognition, Salt Lake City, UT, June 18–23, 2018 (IEEE). doi:10.1109/CVPR.2018.00745
Krupa, J., Lin, K., Flechas, M.A., Dinsmore, J., Duarte, J., Harris, P., et al. (2020). GPU coprocessors as a service for deep learning inference in high energy physics. [arXiv:2007.10359].
Marshall, J., and Thomson, M. (2013). “Pandora particle flow algorithm,” in International conference on calorimetry for the high energy frontier, 305. arXiv:1308.4537.
Michel, L. (1950). Interaction between four half-spin particles and the decay of the μ-meson, Proc. Phys. Soc. 63, 514. 10.1088/0370-1298/63/5/311
MicroBooNE Collaboration, Acciarri, R., Adams, C., An, R., Asaadi, J., Auger, M., et al. (2017a). Convolutional neural networks applied to neutrino events in a liquid argon time projection chamber, J. Inst. Met 12, P03011. doi:10.1088/1748-0221/12/03/P03011 [arXiv:1611.05531]
MicroBooNE Collaboration, Acciarri, R., Adams, C., An, R., Aparicio, A., Aponte, S., et al. (2017b). Design and construction of the MicroBooNE detector, J. Inst. Met. 12, P02017. doi:10.1088/1748-0221/12/02/P02017[arXiv:1612.05824]
MicroBooNE Collaboration, Adams, C., Alrashed, M., An, R., Anthony, J., Asaadi, J., et al. (2019). Deep neural network for pixel-level electromagnetic particle identification in the MicroBooNE liquid argon time projection chamber, Phys. Rev. D. 99, 092001 [arXiv:1808.07269].
Nvidia, (2019). Triton inference server, [software] version v1.8.0 (accessed 2020-02-17) Available at: https://docs.nvidia.com/deeplearning/sdk/triton-inference-server-guide/docs/index.html.
Nunokawa, H., Parke, S. J., and Valle, J. W. (2008). CP violation and neutrino oscillations, Prog. Part. Nucl. Phys. 60, 338. doi:10.1016/j.ppnp.2007.10.001[arXiv:0710.0554]
Pappalardo, A., Franco, G., and Fraser, N. (2020). Xilinx/brevitas: Pretrained 4b MobileNet V1 r2, [software] version quant_mobilenet_v1_4b-r2 10.5281/zenodo.3979501, 10.5281/zenodo.3979501.
Pedro, K. (2019). SonicCMS, [software] version v5.0.0 (accessed 2020-02-17) Available at: https://github.com/hls-fpga-machine-learning/SonicCMS.
Pietropaolo, F. (2017). Review of liquid-argon detectors development at the CERN neutrino platform, J. Phys. Conf. Ser. 888, 012038. doi:10.1088/1742-6596/888/1/012038
Qian, X., and Vogel, P. (2015). Neutrino mass hierarchy. Prog. Part. Nucl. Phys. 83, 1 [arXiv:1505.01891].
Scholberg, K. (2012). Supernova neutrino detection. Ann. Rev. Nucl. Part. Sci. 62, 81 [arXiv:1205.6003].
Sfiligoi, I., Wuerthwein, F., Riedel, B., and Schultz, D. (2020). Running a pre-exascale, geographically distributed, multi-cloud scientific simulation, High Performance Computing 12151, 18. [arXiv:2002.06667]. doi:10.1007/978-3-030-50743-5_2
Snider, E., and Petrillo, G. (2017). LArSoft: toolkit for simulation, reconstruction and analysis of liquid argon TPC neutrino detectors. J. Phys. Conf. Ser 898, 042057. doi:10.1088/1742-6596/898/4/042057
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., et al. (2015). “Going deeper with convolutions,” in IEEE conference on computer vision and pattern recognition (CVPR), Boston, MA, June 12–15, 2015 (IEEE). doi:10.1109/CVPR.2015.7298594
The Fermilab Hierarchical Configuration Language, Available at: https://cdcvs.fnal.gov/redmine/projects/fhicl/wiki, https://cdcvs.fnal.gov/redmine/projects/fhicl/wiki
Keywords: machine learning, heterogeneous (CPU+GPU) computing, GPU (graphics processing unit), particle physics, cloud computing (SaaS)
Citation: Wang M, Yang T, Flechas MA, Harris P, Hawks B, Holzman B, Knoepfel K, Krupa J, Pedro K and Tran N (2021) GPU-Accelerated Machine Learning Inference as a Service for Computing in Neutrino Experiments. Front. Big Data 3:604083. doi: 10.3389/fdata.2020.604083
Received: 08 September 2020; Accepted: 06 November 2020;
Published: 14 January 2021.
Edited by:
Daniele D’Agostino, National Research Council (CNR), ItalyCopyright © 2021 Wang, Yang, Flechas, Harris, Hawks, Holzman, Knoepfel, Krupa, Pedro and Tran. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Michael Wang, mwang@fnal.gov