Bandwidth and latencies are central performance limiters for Lattice QCD. To overcome bandwidth limiters one way is to reduce the number of bits need by e.g., mixed precision solvers. These provide great speedups but increase the relative importance of latency limiters. We discuss techniques that QUDA uses to reduce latencies from GPU-CPU and GPU-network transfers and their impact for strong-scaling HMC simulations, where these matter most.
Mathias Wagner Kate Clark (NVIDIA)