Benchmarking in the Data Center: Expanding to the Cloud
Workshop held in conjunction with PPoPP 2023: Principles and Practice of Parallel Programming 2023
High performance computing (HPC) is no longer confined to universities and national research laboratories, it is increasingly used in industry and in the cloud. Education of users also needs to take this into account. Users need to be able to evaluate what benefits HPC can bring to their companies, what type of computational resources (e.g. multi-, many-core CPUs, GPUs, hybrid systems) would be best for their workloads and how they can evaluate what they should pay for these resources. Another issue that arises in shared computing environments is privacy: in commercial HPC environments, data produced and software used typically has commercial value, and hence needs to be protected.
Recent general adoption of machine learning has motivated migration of HPC workloads to cloud data centers, and there is a growing interest by the community on performance evaluation in this area, especially for end-to-end workflows. In addition to traditional performance benchmarking and high performance system evaluation (including absolute performance, energy efficiency), as well as configuration optimizations, this workshop will discuss issues that are of particular importance in commercial HPC. Benchmarking has typically involved running specific workloads that are reflective of typical HPC workloads, yet with growing diversity of workloads, theoretical performance modeling is also of interest to allow for performance prediction given a minimal set of measurements. The workshop will be composed of submitted papers, invited talks and a panel composed of representatives from industry.
SubmissionWe invite novel, unpublished research paper submission within the scope of this workshop. Paper submission topics include, but are not limited to, the following areas:
- Multi-, many-core CPUs, GPUs, hybrid system evaluation
- Performance, power, efficiency, and cost analysis
- HPC, data center, and cloud workloads and benchmarks
- System, workload, and workflow configuration and optimization
Authors are invited to submit work as regular paper (up to 8 pages including references). All papers must be prepared in ACM Conference Format using the 2-column acmart format and SIGPLAN proceedings template. The submitted work shall be in the English language.
Submitted papers will be peer-reviewed by the technical program committee (TPC). The submitted manuscripts should not include author names and affiliations, as a double-blind review process will be followed. Review of supplementary material is at the discretion of the reviewers; papers must be complete and self-contained.
Workshop submission site: BID 23 (EasyChair)
|Paper submission deadline:|
|Author notification:||27. January 2023|
|Workshop:||25. February 2023|
Date and LocationSat 25. February 2023 (afternoon; 13:20-17:40)
Hotel Bonaventure Montreal, Canada (Room: Westmount 6)
Co-located with PPoPP 2023
Join RemotelyZoom webinar (password: BID23PPoPP)
Feb. 25, 13:20 - 13:50, Eastern Standard Time (EST), UTC -5
(refer to PPoPP 2023 Program page for time zone conversion tool)
|13:20: Welcome Remark, Jens Domke, Workshop Chair VIDEO|
|Research and Invited Paper Session (13:20 - 15:20)|
|13:20-13:50||Efficiently Processing Massive Graphs using Commodity Multicore Machines
AbstractProcessing large graphs quickly and cost-effectively is a major challenge, and existing systems capable of processing large graphs have high computational cost, only solve a limited set of problems, and have poor theoretical guarantees. In this talk I will describe recent algorithmic work the builds on the Graph-Based Benchmark Suite (GBBS) which provides solutions to a very broad set of fundamental graph problems that can scale to the largest publicly-available graph---the WebDataCommons hyperlink graph, with over 200 billion edges. Our solutions have strong provable bounds and most of them run in a matter of seconds to minutes on a commodity multicore machine with a terabyte of main memory.
Laxman Dhulipala, PhD
Department of Computer Science at the University of Maryland, College Park
SLIDES (N/A) | VIDEO
|13:50-14:20||Accurate and efficient software microbenchmarks
AbstractSoftware is often improved incrementally. Each software optimization should be assessed with microbenchmarks. In a microbenchmark, we record performance measures such as elapsed time or instruction counts during specific tasks, often in idealized conditions. In principle, the process is easy: if the new code is faster, we adopt it. Unfortunately, there are many pitfalls, such as unrealistic statistical assumptions and poorly designed benchmarks. Abstractions like cloud computing add further challenges. We illustrate effective benchmarking practices with examples.
Daniel Lemire, PhD
Data Science Laboratory of the University of Quebec (TELUQ)
SLIDES | VIDEO
|14:20-14:50||Overview of SPEC HPC Benchmarks and Details of the SPEChpc 2021 Benchmark
AbstractThe Standard Performance Evaluation Corporation (SPEC) is a non-profit corporation formed to establish, maintain and endorse standardized benchmarks and tools to evaluate performance and energy efficiency for the newest generation of computing systems. The SPEC High Performance Group (HPG) focuses specifically on developing industry standard benchmarks for HPC systems and has a track record of producing high-quality benchmarks serving both academia and industry. This talk provides an overview of the HPC benchmarks that are available from SPEC and SPEC HPG and then dives into the details of the newest benchmark, SPEChpc 2021. This benchmark covers all prevalent programming models, supports hybrid execution on CPUs and GPUs and scales from a single node all the way to thousands of nodes and GPUs. In addition to talking about the architecture and use cases of the benchmark, an outlook of future benchmark directions will be given.
Robert Henschel, Dr.
Program Director for Research Engagement
SLIDES | VIDEO
|14:50-15:20||TailWAG: Tail Latency Workload Analysis and Generation
AbstractServer performance has always been an active field of research, spanning topics including databases, network applications, and cloud computing. Tail latency is one of the most important performance metrics for server performance. In this work, we study how different timing components affect server workload performance, including thread contention and interrupt handling. We introduce a workload analysis and generation tool: TailWAG. TailWAG relies on measured timing components from applications to generate a synthetic workload that mimics the original behavior. TailWAG can help server designers and application developers explore performance bottlenecks and optimization opportunities at early stages.
University of Wisconsin-Madison
SLIDES | VIDEO
|Coffee Break (15:20 - 15:40)|
|Research and Invited Paper Session (15:40 - 17:40)|
|15:40-16:10||gprofng: The Next Generation GNU Profiling Tool
AbstractIn this talk we present an overview of gprofng, the next generation profiling tool for GNU/Linux. It has been part of the GNU binutils tools suite since version 2.39. This profiler has its roots in the Performance Analyzer from the Oracle Developer Studio product. Gprofng is a standalone tool however and specifically targets GNU/Linux. Gprofng is comprised of several tools to collect and view the performance data. It supports the profiling of programs written in C, C++, Java, or Scala running on systems using processors from Intel, AMD, Arm, or compatible vendors. The extent of the support is processor dependent, but the basic views are always available. The support for Fortran beyond F77 is currently limited and under evaluation. Any executable in the ELF (Executable and Linkable Format) object format can be used for profiling with gprofng. If debug information is available, gprofng can provide more details, but this is not a requirement. Gprofng offers full and transparent support for shared libraries and multithreading using Posix Threads, OpenMP, or Java Threads. This tool can also be used in case the source code of the target executable is not available. Gprofng also works with unmodified executables. There is no need to recompile, or instrument the code. By profiling the production executable it is ensured that the profile reflects the actual run time behaviour and conditions of a production run. After the data has been collected, the performance information can be viewed at the function, source, and disassembly level. Individual thread views are supported as well. One of the very powerful features of gprofng is the ability to compare two or more profiles. This allows for an easy way to spot regressions, or find scalability bottlenecks for example. In addition to text based views, there is also a tool that generates an html based structure, allowing the results to be viewed in a web browser. In the talk, we start with a description of the architecture of the gprofng tools suite. This is followed by an overview of the various tools that are available, plus the main features. A brief comparison with gprof will be made, but the bulk of the talk consists of examples to show the functionality and features. This includes the html functionality and also a sneak preview of the GUI that is under development. We conclude with some plans for future developments.
Ruud van der Pas, Drs.
Senior Principal Software Engineer
SLIDES | VIDEO
|16:10-16:40||Benchmarking MPI for Deep Learning and HPC Workloads
AbstractModern High-Performance Computing (HPC) architectures have necessitated scalable hybrid programming models to obtain peak performance on multi-core and multi-GPU systems. The Message Passing Interface (MPI) has traditionally been used to program HPC clusters for scientific workloads. However, in recent years, we have seen the widespread adoption of MPI for Machine Learning (ML). Although the workloads have changed, we still use many existing techniques to micro-benchmark and evaluate one-sided, point-to-point, and collective communication. Modern ML specific communication benchmarks are also used to explore application scalability. Finally, we will discuss MPI Partitioned Communication, a new addition to the MPI-4.0 standard. We ask: how does benchmarking this API differs from existing MPI evaluation techniques? and what workloads benefit from using MPI Partitioned communication?
Queen's University, Canada
SLIDES | VIDEO
|16:40-17:10||Too many chefs in the kitchen - An argument in favor of Program Execution Models (PXMs)
AbstractComputer architectures are now more parallel and heterogenous than ever. However, balancing programmability, portability and performance (3P) is still challenging. A large number of programming models, programming frameworks and execution models have been proposed, with the promise to solve these issues. Still, a large effort is also put by applications and developers to overcome the challenges of interoperability of these frameworks. In the meantime, applications still fear that the newly implemented code will not be able to work in future machines. In an era of extreme heterogeneity, the current programming and system infrastructure do not seem to have a convincing solution to the 3P problem. Furthermore, if we want to be able to provide software modularity, there must exists a clear distinction between the semantics of the execution model, and the user application, such that they can be evolved independently. In this talk we provide an argument that if we want to evolve computer systems, software tools and applications for the 3P problem, a well-defined program execution model should be vertically integrated across all the components of the system (i.e. Architecture, middleware, operating systems, compilers, libraries, programming models, etc). Additionally, through a view of evolution of computers, and some success examples, we support this argument. Finally, while the intention of this talk is to encourage discussion around the topic, a possible PXM is presented as a potential solution to the 3P problem.
Jose Manuel Monsalve, PhD
Argonne National Laboratory
SLIDES | VIDEO
|17:10-17:40||Benchmarking in Google Cloud: Google Cloud HPC-Toolkit + Ramble
AbstractThis talk will focus on HPC Benchmarking at Google Cloud. We'll cover our high level process, along with the tools we have developed to aid our ability to benchmark new workloads. Primarily, we will focus on a new tool (Ramble) for creating reproducible experiments. We will describe some of the functionality Ramble provides, along with what a typical user workflow looks like.
Doug Jacobsen, PhD
SLIDES | VIDEO
|17:40: Closing Remarks, Jens Domke and Aleksandr Drozd, Panel Chair VIDEO|
PPoPP 2023 Registration Website (early bird registration deadline is January 31st)
AttendingWe will support presenters with travel restrictions by setting up a way to present remotely. However, regular registration fees, as stated on PPoPP website, will still apply.
Jens Domke (RIKEN Center for Computational Science)
Industrial Panel Chairs
Aleksandr Drozd (RIKEN Center for Computational Science)
Anara Kozhokanova (RWTH Aachen)
Miwako Tsuji (RIKEN R-CCS)
Artur Podobas (KTH Royal Institute of Technology)
Dossay Oryspayev (Brookhaven National Laboratory)
Emil Vatai (RIKEN R-CCS)
Hitoshi Murai (RIKEN R-CCS)
Holger Brunst (TU Dresden)
Joseph Schuchart (University of Tennessee, Knoxville)
Robert Henschel (Indiana University)
Sascha Hunold (TU Wien)
Samar Aseeri (King Abdullah University of Science and Technology)
Juan (Jenny) Chen (National University of Defense Technology, China)
Benson Muite (Kichakato Kizito)