Benchmarking in the Data Center: Expanding to the Cloud
workshop held in conjunction with PPoPP 2022: Principles and Practice of Parallel Programming 2022
UPDATE 2022.01.11: The workshop will be held virtually due to rising omicron cases
High performance computing (HPC) is no longer confined to universities and national research laboratories, it is increasingly used in industry and in the cloud. Education of users also needs to take this into account. Users need to be able to evaluate what benefits HPC can bring to their companies, what type of computational resources (e.g. multi-, many-core CPUs, GPUs, hybrid systems) would be best for their workloads and how they can evaluate what they should pay for these resources. Another issue that arises in shared computing environments is privacy: in commercial HPC environments, data produced and software used typically has commercial value, and so needs to be protected.
Recent general adoption of machine learning has motivated migration of HPC workloads to cloud data centers, and there is a growing interest by the community on performance evaluation in this area, especially for end-to-end workflows. In addition to traditional performance benchmarking and high performance system evaluation (including absolute performance, energy efficiency), as well as configuration optimizations, this workshop will discuss issues that are of particular importance in commercial HPC. Benchmarking has typically involved running specific workloads that are reflective of typical HPC workloads, yet with growing diversity of workloads, theoretical performance modeling is also of interest to allow for performance prediction given a minimal set of measurements. The workshop will be composed of submitted papers, invited talks and a panel composed of representatives from industry.
Date and Location
Sat 2 - Sun 3 April 2022
Seoul, South Korea
Co-located with PPoPP 2022
SubmissionWe invite novel, unpublished research paper submission within the scope of this workshop. Paper submission topics include, but are not limited to, the following areas:
- Multi-, many-core CPUs, GPUs, hybrid system evaluation
- Performance, power, efficiency, and cost analysis
- HPC, data center, and cloud workloads and benchmarks
- System, workload, and workflow configuration and optimization
Authors are invited to submit work as regular paper (up to 8 pages including references). All papers must be prepared in ACM Conference Format using the 2-column acmart format and SIGPLAN proceedings template. The submitted work shall be in the English language.
Submitted papers will be peer-reviewed by the technical program committee (TPC). The submitted manuscripts should not include author names and affiliations, as a double-blind review process will be followed. Review of supplementary material is at the discretion of the reviewers; papers must be complete and self-contained.
Workshop submission site:
Apr. 2, 09:00 - 13:00, UTC-4, Eastern Time (US & Canada)
(refer to PPoPP program page for time zone conversion tool)
|09:00: Welcome Remark, Aleksandr Drozd, Workshop Co-chair|
|Research Paper and Keynote Session (09:05 - 11:50)|
|09:05-09:35||Working with Proxy-Applications: Interesting Findings, Lessons Learned, and Future Directions
AbstractA considerable amount of research and engineering went into designing proxy applications, which represent common high-performance computing workloads, to co-design and evaluate the current generation of supercomputers, e.g., RIKEN’s Supercomputer Fugaku, ANL’s Aurora, or ORNL’s Frontier. These scaled-down versions of important scientific applications helped researchers around the world to identify bottlenecks, performance characteristics, and fine-tune architectures for their needs. We present some of our findings which will also influence the next generation of HPC systems. However, not all aspects of these proxy applications are compelling, and to accommodate an extremely-heterogeneous future, we have to reconsider how to co-design supercomputers during the next decade. Hence, we will introduce a new kind of benchmarking set, which hopefully solves some of the open issues, and can be used to close the ever widening gap between sustained and peak performance of our supercomputers.
Dr. Jens Domke
RIKEN Center for Computational Science (R-CCS), Japan
SLIDES | VIDEO
|09:35-10:05||Benchmarking For Processor Performance Variations: A Case Study With An Arm Processor
AbstractThis talk presents our ongoing study on processor performance variations of an Arm processor, HiSilicon Kunpeng 920 (KP920). Processor performance variations have been recognized as one of the main causes of the performance anomaly and performance degradation on supercomputers. We designed a suite of micro-benchmark to benchmark the KP920 processor and found performance variations. With the top-down approach, we located the source of performance variations; they are related to ALUs, on-chip cache subsystems, DDR4 memory subsystems, and interconnections. We then illustrated the characteristics of these performance variation sources and the performance impacts. In the end, we showed how to avoid performance degradation by diagnosing the performance variation on the KP920 processor.
Dr. James Lin
Vice Director, Network & Information Center (NIC) and Vice Director, High Performance Computing Center
Shanghai Jiao Tong University, China
SLIDES | VIDEO
|10:05-10:35||Benchmarking Apache Arrow Flight - A Wire-Speed Protocol for Data Transfer, Querying and Microservices
AbstractMoving structured data between different big data frameworks or data warehouses/storage systems often cause significant overhead. Most of the time more than 80\% of the total time spent in accessing data is elapsed in serialization/de-serialization step. Columnar data formats are gaining popularity in both analytics and transactional databases. Apache Arrow, a unified columnar in-memory data format promises to provide efficient data storage, access, manipulation and transport. In addition, with the introduction of the Arrow Flight communication capabilities, which is built on top of gRPC, Arrow enables high performance data transfer over TCP networks. Arrow Flight allows parallel Arrow RecordBatch transfer over networks in a platform and language-independent way, and offers high performance, parallelism and security based on open-source standards. In this paper, we bring together some recently implemented use cases of Arrow Flight with their benchmarking results. These use cases include bulk Arrow data transfer, querying subsystems and Flight as a microservice integration into different frameworks to show the throughput and scalability results of this protocol. We show that Flight is able to achieve up to 6000 MB/s and 4800 MB/s throughput for DoGet() and DoPut() operations respectively. On Mellanox ConnectX-3 or Connect-IB interconnect nodes Flight can utilize upto 95\% of the total available bandwidth. Flight is scalable and can use upto half of the available system cores efficiently for a bidirectional communication. For query systems like Dremio, Flight is order of magnitude faster than ODBC and turbodbc protocols. Arrow Flight based implementation on Dremio performs 20x and 30x better as compared to turbodbc and ODBC connections respectively. We briefly outline some recent Flight based use cases both in big data frameworks like Apache Spark and Dask and remote Arrow data processing tools. We also discuss some limitations and future outlook of Apache Arrow and Arrow Flight as a whole.
Mr. Tanveer Ahmad,
TU Delft, The Netherlands
SLIDES | VIDEO
|Coffee Break (10:35 - 10:50)|
|10:50-11:20||Methodology and Challenges for Benchmarking and Assessing Performance of Systems with Different Instruction Set Architectures
AbstractEvaluating performance of a computer system is an increasingly challenging task. Most industry benchmarks do not capture the complexities and the application behaviors of modern and emerging applications, and industry trends towards disaggregated hardware, distributed applications, and heterogenous HW complicate even further performance assessments. While most commercial high-performance systems in the past decade adopted one ISA, one of the big opportunities in current years has been the resurgence of multiple ISAs in both the client space and in datacenters. Measuring and comparing systems with different ISAs and system architectures presents unique issues and requires rethinking the methodology used to perform benchmarking and can open a window into how we can think about evaluating and projecting performance for future heterogenous systems. In this talk I will discuss how I tackled this problem in the past few years at Arm, and what challenges we see going forward.
Dr. Andrea Pellegrini
ARM Advanced Server Team, USA
SLIDES | VIDEO
|11:20-11:50||High Frequency Performance Monitoring via Architectural Event Measurement
AbstractObtaining detailed software execution information via hardware performance counters is a powerful analysis technique. The performance counters provide an effective method to monitor program behaviors; hence performance bottlenecks due to hardware architecture or software design and implementation can be identified, isolated and improved on. The granularity and overhead of the monitoring mechanism, however, are paramount to proper analysis. Many prior designs have been able to provide performance counter monitoring with inherited drawbacks such as intrusive code changes, a slow timer system, or the need for a kernel patch. In this talk, I will present K-LEB (Kernel - Lineage of Event Behavior), a new monitoring mechanism that can produce precise, non-intrusive, low overhead, periodic performance counter data using a kernel module based design. Our proposed approach has been evaluated on three different case studies to demonstrate its effectiveness, correctness and efficiency. By moving the responsibility of timing to kernel space, K-LEB can gather periodic data at a 100μs rate, which is at a 100X finer granularity than other comparable performance counter monitoring approaches. At the same time, it reduces the monitoring overhead by at least 58.8%, and the difference between the recorded performance counter readings and those of other tools are less than 0.3%.
Dr. Chen Liu
Clarkson University, USA
SLIDES | VIDEO
|11:50-12:50||Industry Panel Discussion
Hai Ah Nam
Lawrence Berkeley National Laboratory
Principal AI/HPC Architect,
Senior Vice President of Networking,
Oak Ridge National Laboratory
Moderators: Verónica G. Melesse Vergara, Oscar Hernandez – Industrial Panel Chairs
|12:50: Closing Remarks, Kevin A. Brown, Workshop Chair|
Please register at the main conference website.
|Paper submission deadline:|
|Workshop:||2-3 April 2022|
Kevin Brown (Argonne National Laboratory)
Aleksandr Drozd (RIKEN Center for Computational Science)
Industrial Panel Chairs
Verónica G. Melesse Vergara (Oak Ridge National Laboratory)
Oscar Hernandez (NVIDIA)
- Artur Podobas, KTH, Royal Institute of Technology
- Bok Jik Lee, Seoul National University
- Christopher Chang, National Renewable Energy Laboratory
- Dave Cheney, VMWare
- Emil Vatai, RIKEN CSS
- Gerd Heber, HDF Group
- Hamid Zohouri Reza, KLA
- Jorji Nonanka, Riken
- Kate Isaacs, University of Arizona
- Neil McGlohon, Rensselaer Polytechnic institute
- Shweta Salaria, IBM
- Sid Rasker, Argonne National Laboratory
- Steven Varga, Varga Consulting
- Tianqi Xu, Preferred Networks
- Trevor Steil, Lawrence Livermore National Laboratory
- Zhang Xianyi, PerfXLab Technology