Gauge: An Interactive Data-Driven Visualization Tool for HPC Application I/O Performance Analysis

10 November 2020

Scientific applications running on modern supercomputers are often using only a fraction of the processing power that supercomputer can provide. Inadequate use of the storage system is a common reason for the limited speedup. Unfortunately, due to the complexity of modern High Performance Computing (HPC) systems, diagnosing what exactly is causing an application to underperform is very difficult. Even with logs collected at various points in the HPC system (e.g., compute, network and storage nodes), extracting knowledge about these components and their interactions is not straightforward.

Gauge is an HPC analysis and visualization tool we developed in collaboration with Argonne National Labs. Gauge is targeted at two audiences: (1) scientists, who can use Gauge to diagnose and improve the I/O throughput of their applications, and (2) facility administrators, that can use Gauge to get more insight into the workloads of their supercomputers.

Gauge works by taking a collection of HPC system logs, and building an easy-to-navigate hierarchy of jobs. In this hierarchy, similar jobs are grouped together so that I/O behaviors can more easily be spotted. Gauge lets the user traverse this hierarchy to better understand why certain jobs or groups of jobs are behaving as they do, and provides various graphs to discern differences between jobs and groups of jobs. We provide an interactive instance of Gauge that uses data collected on the Argonne Leadership Computing Facility’s (ALCF) Theta supercomputer, and welcome the reader to try it out.

Example hierarchy of jobs ran on the Theta supercomputer (left). Example visualization of a group of HPC jobs (right).

Our PDSW20 paper [1] explains how Gauge works and how it is used. For more details on how machine learning can be used to better understand HPC system I/O throughput, see our SC20 paper [2] on explainable local models of I/O throughput.

[1] E. del Rosario, M. Currier, M. Isakov, S. Madireddy, P. Balaprakash, P. Carns, R. Ross, and M. Kinsy, “Gauge: An interactive data-driven visualization tool for HPC application I/O performance analysis,” in 2020 IEEE/ACM Fifth International Parallel Data Systems Workshop (PDSW), 2020.

[2] M. Isakov, E. del Rosario, S. Madireddy, P. Balaprakash, P. Carns, R. Ross, and M.Kinsy, “HPC I/O throughput bottleneck analysis with explainable local models,” in SC’20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020