In recent times, GPUs have taken the lead as primary accelerators in high-performance systems, concentrating much computational power in GPU clusters. While multi-GPU acceleration benefits many HPC and ML applications, inter-GPU communication, handled traditionally by the host, can hinder scalability. The host traditionally controls execution by managing kernels, communication, and synchronization. This CPU involvement can be fully shifted to devices, improving performance for multi-GPU communication tasks.

In this talk, first, we present a fully autonomous multi-GPU execution model, eliminating CPU involvement after the initial kernel launch. Our CPU-free model leverages techniques such as persistent kernels and device-initiated communication, reducing communication overhead. We validate the model on iterative solvers—2D/3D Jacobian stencil and Conjugate Gradient (CG). We then proceed with accompanying tools for CPU-free execution, including CPU-free code generation with the compiler and a CPU-free runtime system. Second, we introduce a multi-GPU communication profiling tool that fills the gap in the vendor’s tool portfolio when used on CPU-free execution. Our tool excels in tracking both peer-to-peer transfers and communication library calls. It provides a variety of visualization modes and levels of detail, ranging from a broad overview of data movement across the system to the precise instructions and memory addresses involved.