Gauge covariant smearing based on the 3D lattice Laplacian can be used to create extended operators that have better overlap with hadronic ground states. This is often done iteratively. For staggered quarks using two-link parallel transport preserves taste properties. We found that such iterative smearing was taking an inordinate amount of time when done on the CPU, so we have implemented the procedure in QUDA.
Instead of carrying out two consecutive parallel transports between nearest neighbor sites on each smearing iteration, we calculate the product of the two links joining next-to-nearest-neighbor sites once and reuse it for all iterations. This reduces both required floating point operations and communications.
We present the performance of this code on some recent GPUs.