Benchmarking in the Data Center: Expanding to the Cloud

workshop held in conjunction with PPoPP 2022: Principles and Practice of Parallel Programming 2022

UPDATE 2022.01.11: The workshop will be held virtually due to rising omicron cases

Workshop Scope

High performance computing (HPC) is no longer confined to universities and national research laboratories, it is increasingly used in industry and in the cloud. Education of users also needs to take this into account. Users need to be able to evaluate what benefits HPC can bring to their companies, what type of computational resources (e.g. multi-, many-core CPUs, GPUs, hybrid systems) would be best for their workloads and how they can evaluate what they should pay for these resources. Another issue that arises in shared computing environments is privacy: in commercial HPC environments, data produced and software used typically has commercial value, and so needs to be protected.

Recent general adoption of machine learning has motivated migration of HPC workloads to cloud data centers, and there is a growing interest by the community on performance evaluation in this area, especially for end-to-end workflows. In addition to traditional performance benchmarking and high performance system evaluation (including absolute performance, energy efficiency), as well as configuration optimizations, this workshop will discuss issues that are of particular importance in commercial HPC. Benchmarking has typically involved running specific workloads that are reflective of typical HPC workloads, yet with growing diversity of workloads, theoretical performance modeling is also of interest to allow for performance prediction given a minimal set of measurements. The workshop will be composed of submitted papers, invited talks and a panel composed of representatives from industry.

Date and Location

Sat 2 - Sun 3 April 2022
Seoul, South Korea
Co-located with PPoPP 2022

Submission

We invite novel, unpublished research paper submission within the scope of this workshop. Paper submission topics include, but are not limited to, the following areas:

Multi-, many-core CPUs, GPUs, hybrid system evaluation
Performance, power, efficiency, and cost analysis
HPC, data center, and cloud workloads and benchmarks
System, workload, and workflow configuration and optimization

Authors are invited to submit work as regular paper (up to 8 pages including references). All papers must be prepared in ACM Conference Format using the 2-column acmart format and SIGPLAN proceedings template. The submitted work shall be in the English language.

Submitted papers will be peer-reviewed by the technical program committee (TPC). The submitted manuscripts should not include author names and affiliations, as a double-blind review process will be followed. Review of supplementary material is at the discretion of the reviewers; papers must be complete and self-contained.

Workshop submission site: ~~https://easychair.org/conferences/?conf=bid2022~~ closed

Workshop Program

Apr. 2, 09:00 - 13:00, UTC-4, Eastern Time (US & Canada) (refer to PPoPP program page for time zone conversion tool)
09:00: Welcome Remark, Aleksandr Drozd, Workshop Co-chair
Research Paper and Keynote Session (09:05 - 11:50)
09:05-09:35	Working with Proxy-Applications: Interesting Findings, Lessons Learned, and Future Directions Abstract A considerable amount of research and engineering went into designing proxy applications, which represent common high-performance computing workloads, to co-design and evaluate the current generation of supercomputers, e.g., RIKEN’s Supercomputer Fugaku, ANL’s Aurora, or ORNL’s Frontier. These scaled-down versions of important scientific applications helped researchers around the world to identify bottlenecks, performance characteristics, and fine-tune architectures for their needs. We present some of our findings which will also influence the next generation of HPC systems. However, not all aspects of these proxy applications are compelling, and to accommodate an extremely-heterogeneous future, we have to reconsider how to co-design supercomputers during the next decade. Hence, we will introduce a new kind of benchmarking set, which hopefully solves some of the open issues, and can be used to close the ever widening gap between sustained and peak performance of our supercomputers. Dr. Jens Domke Research Scientist RIKEN Center for Computational Science (R-CCS), Japan SLIDES \| VIDEO
09:35-10:05	Benchmarking For Processor Performance Variations: A Case Study With An Arm Processor Abstract This talk presents our ongoing study on processor performance variations of an Arm processor, HiSilicon Kunpeng 920 (KP920). Processor performance variations have been recognized as one of the main causes of the performance anomaly and performance degradation on supercomputers. We designed a suite of micro-benchmark to benchmark the KP920 processor and found performance variations. With the top-down approach, we located the source of performance variations; they are related to ALUs, on-chip cache subsystems, DDR4 memory subsystems, and interconnections. We then illustrated the characteristics of these performance variation sources and the performance impacts. In the end, we showed how to avoid performance degradation by diagnosing the performance variation on the KP920 processor. Dr. James Lin Vice Director, Network & Information Center (NIC) and Vice Director, High Performance Computing Center Shanghai Jiao Tong University, China SLIDES \| VIDEO
10:05-10:35	Benchmarking Apache Arrow Flight - A Wire-Speed Protocol for Data Transfer, Querying and Microservices Abstract Moving structured data between different big data frameworks or data warehouses/storage systems often cause significant overhead. Most of the time more than 80\% of the total time spent in accessing data is elapsed in serialization/de-serialization step. Columnar data formats are gaining popularity in both analytics and transactional databases. Apache Arrow, a unified columnar in-memory data format promises to provide efficient data storage, access, manipulation and transport. In addition, with the introduction of the Arrow Flight communication capabilities, which is built on top of gRPC, Arrow enables high performance data transfer over TCP networks. Arrow Flight allows parallel Arrow RecordBatch transfer over networks in a platform and language-independent way, and offers high performance, parallelism and security based on open-source standards. In this paper, we bring together some recently implemented use cases of Arrow Flight with their benchmarking results. These use cases include bulk Arrow data transfer, querying subsystems and Flight as a microservice integration into different frameworks to show the throughput and scalability results of this protocol. We show that Flight is able to achieve up to 6000 MB/s and 4800 MB/s throughput for DoGet() and DoPut() operations respectively. On Mellanox ConnectX-3 or Connect-IB interconnect nodes Flight can utilize upto 95\% of the total available bandwidth. Flight is scalable and can use upto half of the available system cores efficiently for a bidirectional communication. For query systems like Dremio, Flight is order of magnitude faster than ODBC and turbodbc protocols. Arrow Flight based implementation on Dremio performs 20x and 30x better as compared to turbodbc and ODBC connections respectively. We briefly outline some recent Flight based use cases both in big data frameworks like Apache Spark and Dask and remote Arrow data processing tools. We also discuss some limitations and future outlook of Apache Arrow and Arrow Flight as a whole. Mr. Tanveer Ahmad, PhD Candidate TU Delft, The Netherlands SLIDES \| VIDEO
	Coffee Break (10:35 - 10:50)
10:50-11:20	Methodology and Challenges for Benchmarking and Assessing Performance of Systems with Different Instruction Set Architectures Abstract Evaluating performance of a computer system is an increasingly challenging task. Most industry benchmarks do not capture the complexities and the application behaviors of modern and emerging applications, and industry trends towards disaggregated hardware, distributed applications, and heterogenous HW complicate even further performance assessments. While most commercial high-performance systems in the past decade adopted one ISA, one of the big opportunities in current years has been the resurgence of multiple ISAs in both the client space and in datacenters. Measuring and comparing systems with different ISAs and system architectures presents unique issues and requires rethinking the methodology used to perform benchmarking and can open a window into how we can think about evaluating and projecting performance for future heterogenous systems. In this talk I will discuss how I tackled this problem in the past few years at Arm, and what challenges we see going forward. Dr. Andrea Pellegrini Distinguished Engineer ARM Advanced Server Team, USA SLIDES \| VIDEO
11:20-11:50	High Frequency Performance Monitoring via Architectural Event Measurement Abstract Obtaining detailed software execution information via hardware performance counters is a powerful analysis technique. The performance counters provide an effective method to monitor program behaviors; hence performance bottlenecks due to hardware architecture or software design and implementation can be identified, isolated and improved on. The granularity and overhead of the monitoring mechanism, however, are paramount to proper analysis. Many prior designs have been able to provide performance counter monitoring with inherited drawbacks such as intrusive code changes, a slow timer system, or the need for a kernel patch. In this talk, I will present K-LEB (Kernel - Lineage of Event Behavior), a new monitoring mechanism that can produce precise, non-intrusive, low overhead, periodic performance counter data using a kernel module based design. Our proposed approach has been evaluated on three different case studies to demonstrate its effectiveness, correctness and efficiency. By moving the responsibility of timing to kernel space, K-LEB can gather periodic data at a 100μs rate, which is at a 100X finer granularity than other comparable performance counter monitoring approaches. At the same time, it reduces the monitoring overhead by at least 58.8%, and the difference between the recorded performance counter readings and those of other tools are less than 0.3%. Dr. Chen Liu Associate Professor Clarkson University, USA SLIDES \| VIDEO
Panel Session
11:50-12:50	Industry Panel Discussion Hai Ah Nam Researcher, Lawrence Berkeley National Laboratory view bio Prabhat Ram Principal AI/HPC Architect, Microsoft view bio Gilad Shainer Senior Vice President of Networking, NVIDIA view bio Jens Glaser Computational Scientist, Oak Ridge National Laboratory view bio Moderators: Verónica G. Melesse Vergara, Oscar Hernandez – Industrial Panel Chairs VIDEO
12:50: Closing Remarks, Kevin A. Brown, Workshop Chair

Registration

Please register at the main conference website.

Schedule

Paper submission deadline:	~~31 January~~ Extended: 14 February 2022 AoE
Author notification:	~~24 February~~ Extended: 4 March 2022
Workshop:	2-3 April 2022

Organizing Committee

Kevin Brown (Argonne National Laboratory)

Aleksandr Drozd (RIKEN Center for Computational Science)

Contact: chairs(at)parallel.computer

Industrial Panel Chairs

Verónica G. Melesse Vergara (Oak Ridge National Laboratory)

Oscar Hernandez (NVIDIA)

Contact: panel(at)parallel.computer

Program Committee

Artur Podobas, KTH, Royal Institute of Technology
Bok Jik Lee, Seoul National University
Christopher Chang, National Renewable Energy Laboratory
Dave Cheney, VMWare
Emil Vatai, RIKEN CSS
Gerd Heber, HDF Group
Hamid Zohouri Reza, KLA
Jorji Nonanka, Riken
Kate Isaacs, University of Arizona
Neil McGlohon, Rensselaer Polytechnic institute
Shweta Salaria, IBM
Sid Rasker, Argonne National Laboratory
Steven Varga, Varga Consulting
Tianqi Xu, Preferred Networks
Trevor Steil, Lawrence Livermore National Laboratory
Zhang Xianyi, PerfXLab Technology